Tracking Bias in Online News

Tracking Coverage of the 2018 Congressional Elections

During the 2018 elections we collected headlines via Google Alerts in order to track the coverage of various competitive races. By incorporating professional ratings of online news sources, we can show visually how the coverage of a given race varies across the political spectrum.

In this post we’ll take a closer look at the California 25th Congressional District (CA-25) where Katie Hill (D) unseated Steve Knight (R). Data for all 801 Senate and Congressional races can be accessed at via our 2018 election tracker. By the end we’ll show how a single visualization can provide at-a-glance evidence that:

  1. The coverage spanned the entire left-right spectrum, with perhaps slightly more nationally-rated coverage on the left-of-center news sites
  2. The single largest source of coverage is a local site2; and there are a number of other local sites providing coverage;
  3. There is a good deal of “secondary” activity, reflecting discussions on social sites (such as 4chan) or leverage by SEO or search-aggregation “re-posting” sites.
  4. The extremist sites did not play a major role in the conversation

The last week of coverage for the CA-25 race can be summarized at-a-glance with the following view.

US House CA-25 Coverage Summary, Week of November 4, 2018

With this view we can quickly assess the actors and topics in the coverage space.

Counting Mentions

Just by looking at how many stories mention each of the candidates (and the name of the district) we can get a pretty good sense of the coverage landscape.

The final week of coverage gave a slight edge to Katie Hill in terms of the total number of “news” stories that mentioned her, as opposed to her opponent Steve Knight:

Stories mentioning CA-25 during the week ending November 4, 2018

But coverage of the CA-25 increased over time as election day approached and coverage over the course of the race was mostly even between the candidates and grew roughly proportionally:

CA-24 Weekly Mention Count (Aug-Nov 2018)

Accounting for Bias in Mainstream News

We can see who is providing this coverage with the help of “site buckets” that bring together the work of Media Bias Fact Check with a significant amount of internal research on local and internet media organizations3.

Source-Type Buckets Used by Marvelous AI’s Election Tracker

We can also visually plot the documents we find on the basis of their credibility, as revealed by these buckets. Inspired by Media Bias Fact Check, we identify two dimensions relevant to credibility: political bias and commitment to facts.

For example, the Wall Street Journal has the following MBFC rating:

Media Bias Fact Check Rating Details for the Wall Street Journal

We translate this rating in our system to x=2.0 and y=2.0. Our main guiding intuition is that left-right (x) should align with political bias and up-down (y) with factual reporting. So the Marvelous record for Wall Street Journal is:

We can then plot all stories that come from as (2,2) on a graph: in other words as semi-credible, center-right material. The sites rated directly by MBFC are automatically assigned a score on the basis of their MBFC Factual Reporting buckets (i.e. LOW, MIXED, HIGH modulo a 2x multiplier for cases with “VERY”). Other cases are computed as a sum of the trigger words highlighted in the MBFC report. Typically mainstream news is in the top half while blogs and aggregators are in the bottom half.

Accounting for Bias in Blogs and Aggregators

For non-mainstream sources, MBFC does not provide political bias buckets and factual reporting ratings in the usual way. These sites, called “questionable sources” by MBFC, have a rating system that is somewhat different.

For example, the MBFC page for President 45 Donald Trump  is:

MBFC Record for Questionable Source “President 45 Donald Trump”

Since the descriptions of “questionable source” sites vary from page to page, we could not directly assign scores on the basis of bucket as we did with left, right and center bias (and with HIGH, LOW and MIXED credibility). Instead we had to create a scoring system to read its values off of the words used to describe each questionable source.

We do this by assigning a value to high bias and low-credibility words. Most of the scoring components can be summarized as:

An Assortment of Marvelous AI Credibility Scores and Term Weights

In the questionable site example provided, we would expect a score with both very high x (right bias) and very low y (poor commitment to facts), which is what we see in our record.

Partial Marvelous AI Record for Questionable Source “President 45 Donald Trump”

Barring MBFC ratings4, we did not take a stand on credibility or bias when plotting legitimate local news. So we rate known local affiliates with (0.0, 2.0) and totally unknown sites or aggregators as (0.0, -2.0).

Visualizing Credibility and Bias

Adding all of this together, we are able to plot the coverage of a given race as a 2D scatter plot between -5 and 5 on both dimensions. In the last week of CA-25 coverage, for example, we see5:

Marvelous Visualization of Bias and Credibility in Election Coverage

Red-to-blue denotes left-to-right bias (i.e. MBFC bucket) in the usual way for MBFC-rated sites. Orange sites are local newspaper or TV affiliates that we confirmed ourselves. Light cyan is aggregators and dark cyan is extremist blogs. We can now see evidence at-a-glance that:

  1. The coverage was full spectrum;
  2. Local news played a role, with the local paper in the lead;
  3. Aggregators tried to leverage candidate coverage to a moderate extent; and
  4. The extremist sites did not play a major role in the conversation

Looking more closely at a snapshot of the headlines and snippets from the local bucket, we can see at-a-glance that a lot of the last week of coverage, especially at the Signal, was designed to move public opinion toward Steve Knight, often with negative attitudes about Katie Hill or her perceived associates.

At Marvelous, we are in the process of decoding the language features within this corpus in our ongoing efforts to discover and characterize persuasive speech and viral deception. Expect an ongoing series of reports on these matters.

  1. We focused on competitive races that had completed their primaries by late August. Because manual relevance assessment was involved we limited the scope of the project to these 80 races.
  2. The Santa Clarita Valley Signal
  3. We’ll save discussion of our internal research for another post. For rating purposes, most of these sites get (0,2) or (0,-2) ratings, for local sources and aggregators, respectively. Improving these default ratings is a matter for future research.
  4. The MBFC ratings are not perfect, but offer a stable objective backdrop and remove our own judgments from the bias/credibility assessment loop.
  5. Note that raw scores are adjusted randomly with (-1,1) on both dimensions. This both ameliorates the problem of multiple sites being plotted at the same coordinates, but also reflects the approximate nature of the scoring in general. These plots are no doubt a work in progress.

Related Posts