The "margin of error" and what that means for election polling
Hi all, I apologize for skipping yet another Friday post. I’ve started a new job that requires a longer commute, and it’s been hard to find time for this. Going forward, I’m going to be releasing one article a week, at 11am on the first Monday of everyone month. On to this month’s….
One only has to look at the front page of any American newspaper today to see that according to most election polls, Trump and Harris are effectively tied, with polls showing a one to three point difference between them in most battleground states. There’s a natural human tendency to take those polls at face value, interpreting a result of +1 for Trump in Georgia as meaning that Trump is going to win Georgia with a 1% lead. Newspapers often play into this mindset by simply displaying the results without any additional context. However, when the race is this close, it is critically important to understand how the survey was conducted and what potential blind spots there might be, so that we can answer two questions:
Given what we know about the methodology, can this survey tell us who is going to win?
If the survey can’t tell us who is going to win, or if the election is over and we realized our results were way off, is there anything we can do to improve our surveying in the future?
Let’s tackle the first question first. In cases like this election, where the polls are so close, the most important factor of the survey to consider is the margin of error. The margin of error describes how much the survey result differs from what the true result is. It’s often reported in media as “+/- points”, as in the phrase, “Harris leads in Iowa with 47% of the vote, +/- 3 points”. The “+/-” number comes from calculating the confidence interval of your poll. The confidence interval describes how confident you are that the true result falls within the error bounds of your survey. A common confidence interval is 95%, so we can interpret our statement above as “If we survey Iowans 100 times, 95 of those times, support for Harris would be between 44 and 50%, or 47 +/- 3%.” This may seem like a small distinction, but when the race is this close, it is vital to understand. If Harris is polling at 47% and Trump is polling at 45% in a given state, there is only a 2% difference between them. If the margin of error is 3%, that means the true Harris support is between 44 and 50%, while the true Trump support is between 42 and 48%. Those numbers overlap, which means the polls cannot tell you who is going to win. Anyone who tells you differently is either not being honest or does not understand statistics.
Let’s say the election is over, the victor is successfully installed in the White House, and our polls were completely wrong, similar to what happened in 2016. How can we improve our methodology so we can be more accurate in the future? There are several ways to improve our survey, and all rely on the fact that the confidence interval isn’t a made-up number, but instead a value we can calculate from statistics:
where:
x is the mean of our data, or the percent of people they supported Harris
z is the confidence we want (95%)
s is the standard deviation of our data, or the variability between Trump and Harris supported
n is the number of people we sampled.
One improvement method utilizes might be familar to those who recently read my article on evaluating scientific studies and utilizes what is called the Law of Large Numbers. The Law of Large Numbers states that the larger the sample size, the closer you are to converging on the true numbers. So, simply increasing n will theoretically make our mean x and standard deviation s closer to the true values, and will thus increase the reliability of our poll. You can see this reflected in our equation - a large n means that our standard deviation s and confidence z will have a smaller effect on our final result, since they are being divided by the large n.
Increasing the number of people surveyed is great, but it’s important to make sure those increased numbers of people actually represent the entire country. Depending on the polling method, you may be missing large demographics of the country. For example, Quinnipiac polls call people using Random Digit Dialing, meaning a computer program will select 9 random digits for the pollster to call. Younger generations are less likely to pick up the phone for a random number than older generations, meaning the poll will under-count the opinions of younger generations. The pollsters will need to specifically target younger voters or use statistics to extrapolate out results with the smaller sample size, which is very difficult to do accurately. In terms of our confidence interval equation, this represents a case where our calculated mean and standard deviation differ signficiantly from the truth.
Even if we get a larger number of representative Americans, what if they tell the pollsters they support one candidate, but then never actually vote? This is a common problem, especially in states with stricter rules around mail-in and early voting, and fixing it falls to elected officials who can pass legislation that make it easier for everyone to vote.
However, none of those improvements will help us if those people aren’t telling us the truth. This was cited as a reason the 2016 polls were so inaccurate, and with the rush of news stories about the Vote Common Good advertisment assuring women their vote is private and they can secretly vote differently from their husbands, it appears as if this may be an issue again. As in the other cases, this is a problem where our calculated mean and standard deviation differ from the truth. This is a tricky issue to untangle, as is has sociological roots and is not strictly math-based, although polling people in private may help. Building trust between average Americans and the pollsters and working towards a less polarized society where people feel comfortable revealing their political affilications woul reduce this problem even further, but it is clear that will not happen in this election cycle.
I write this not to sow uncertainty in the electoral process, but to help explain why it is so difficult to accurately predict who will win the election. Polling averages that differ significantly from the voting tallies are not immediate flags of election interferance, but instead indicate the nature of trying to capture complex human behaviors in a simple question.