Using Graphs to Tell a Story

Jul 07, 2023

Math is often considered to be unbiased, and the general public is trained to simply believe any statistical figure quoted at them by a seemingly reputable source. However, even in cases such as simple physics diagrams where all data could be unbiased and apolitical, the presentation of data always reflects an ulterior motive. When I am creating graphs and diagrams for presentations at work, I am always asking myself the questions “What am I trying to convey with this graphic?” and “What do I want the customer to do with this information?”. There is nothing inherently wrong with this - it is the responsibility of the scientific expert to distill their experiments into information the layperson can digest and incorporate into their own life. The issues come in when the display of the data is manipulated in such a way that it no longer reflects what was actually discovered, or when public figures of all stripes willfully misrepresent the data for their own purposes.

This article will discuss different ways an example data set can be represented in a newspaper article using just a bar graph, and which are accurate representations of the data. We will pretend I have interviewed 15000 college girls and asked one question:

“Which of the following reality TV shows have you watched at least one episode of in the last year? (check all that apply)”

The Bachelor
The Bachelorette
Love is Blind
Love Island

The responses are shown in the table below.

Our data set for this article - a survey of the reality TV viewing habits of college-aged girls

Due to the small number of data points in our set, the chart is interpretable and could be easily used in our final article. However, I have decided that it is not exciting enough and I want a graph to communicate this data instead. If the goal of my article is to convey that a large percentage of college-aged girls regularly watch reality television, I can make a basic bar graph showing the raw viewership numbers, such as the one below:

Question : How many “yes” responses did I get for each show? Focus: there are a lot of “yeses”

I could also be more direct and plot the percentage of “yes” responses instead, like in this graph:

Question: What percentage of respondents say that they watch each show? Focus: It is a high percentage for all shows

Due to our choice of setting the Y axis at 0, it looks as if all four shows have similar viewership numbers, and those numbers represent a plurality of college-aged girls. We have not done any data manipulation, and the Y axis counts are evenly spaced. Both of these graphs will allow me to write my article about how many college girls love reality TV, and they both accurately represent the answer to the question “How many yes responses did I get for each show"?”

Let’s say I now decide to take a different tack - I applied to be on Love Island and was rejected, so now I want to write an article about how Love Island is the least popular reality show among college-aged girls. This can be accomplished by simply adjusting the range of our Y-axis to get the following graph:

Question: How many “yes” responses did I get for each show? Focus: The difference in “yes” responses between each show

Again, this graph is an accurate representation of the data, but now instead of highlighting the large percentage of “yes” responses overall, it is now highlighting the differences in “yeses” between different shows. This would be a fair graph to use in the article.

However, let’s say I am really vindictive and decide to go even further. I combine the responses for the Bachelor and the Bachelorette into on category, like this:

Question: How many “yeses” did I get out of 1.5k for Love is Blind and Love Island, and out of 3k for all Bachelor shows?

This has now crossed a line, and I am misrepresenting the data for my own purposes. Our original graphs were answering the question “Out of 15000 college-aged girls, how many of them said they watched this show?”. The graph about does not answer this question for the “All Bachelor” show, since we now have 30000 possible “yeses” for that bar, but only 15000 for the other shows. This can be visualized by using percentages again:

Percentage representation of the above graph - how is it possible that over 100% of respondents said they watched a Bachelor show?

Clearly, it does not make sense that we could have more than 100% response rate. This graph is no longer a fair representation of our results, and it is irresponsible to publish it as such. If I really have it out for Love Island, I can re-compute my data to get the number of people who responded “yes” to either Bachelor, Bachelorette, or both and get this graph:

Question: How many girls said yes to watching Love is Blind, Love Island, OR at least one Bachelor show? Focus: The Bachelor shows are the most popula

This graph is slightly less misleading than the previous one, since we are comparing the exact percentages. However, it is still confusing because one of the bars represents two possible shows, while the other ones only represent one.

Let’s pretend now that my editor catches this and says I need to cut the graph since it is misleading. I am still angry, so I mess around with the Y axis of the graph and produce this:

A bar graph that egregiously violates the Lie Factor rule

This graph has been exaggerated for effect, but the manipulation of axes in this way is so common that it has its own name, Lie Factor. The Lie Factor states that in order to create an accurate graph, the size of the effect in the data must be proportional to the size of the graph. If the Bachelor is twice as popular as Love Island, the bar representing it much be twice as big as the Love Island bar, not four times. Again, this graph is not an accurate representation of the data and should not be printed as such.

Below, I have plotted all the graphs discussed in a single image - I am always amazed at how many different ways the same data can be represented and the different reactions they elicit. Whenever I read anything with graphics, I have to force myself to think of these factors because it is so easy to be taken in by the narrative. For more information, I recommend the classic The Visual Display of Quantitiative Information by Edward Tufte.

All of the plots talked about in this article

Thank you for reading Science for the Unscientific. This post is public so feel free to share it.

Science for the Unscientific

Using Graphs to Tell a Story

Discussion about this post