If you have worked with data, then I bet you have been guilty of one or more of some kind of statistical fallacy at some point. I know I have!
In this series, we will be looking at fallacies that often come up when analyzing data or, allegedly, academic sources.
This fallacy arises when you do not take a representative sample from a population.
What does this mean? What is a sample and what do I mean by a population?
In statistics, a population is a set of all the things one is gathering statistics on. It is the collection of all the things you are interested in studying and getting data on.
For instance, if you are getting statistics on the height of males in the US, then the population is all the males in the US. If you are studying the lifespan of fruit flies, then your population is all fruit flies.
A sample is a subset of a population that is chosen as representative of the population in general.
Usually one cannot get data on the entire population. One is not able to measure the height of every male in the US or the lifespan of every fruit fly in existence.
If one is doing a poll on political views, then one is unlikely to be able to ask everyone in the population what their political views are.
So, one must take a sample of the population. They have to select a subset that is assumed to be an accurate representation of the population.
So, if one is interested in the height of men in the US, one picks a bunch of men and infers things about the height of men in the US from this subset of men.
Or, if you are interested in studying political views, you pick a sample of people in the population and ask them about their views.
The sample must be a fairly accurate representation of the population. The sample must be chosen so that it is valid to analyze the sample and use information about the sample to form conclusions about the population as a whole.
The subject of sample selection is complex and we will not go into it here.
Suffice to say that a proper sample must be taken so that the sample is sufficiently representative of the population.
What this fallacy deals with is the situation when the chosen sample is biased and does not accurately represent the population.
This often happens intentionally when people choose a sample so that it seems to prove their assertions about the population.
For instance, suppose I want to show that Scientology is a growing religion. However, I mostly survey people with known associations with Scientology. This creates a bias in my results that does not accurately represent the population as a whole.
Suppose that I want to sample the height of men in the US. Then I probably do not want to sample only men that are over 7 feet tall. This will not give me an accurate picture of the average height of men in the US!
If I want to get an idea of the attitude towards Communism in the US, then I probably do not want to sample only Communists or only those opposed to Communism!
In other words, I do not want to choose my sample so that it misrepresents conclusions about the population.
The problem is that I run the risk of results that are not representative of the population. My results indicate trends that are a result of the way I selected my sample and are not truly indicative of the population.
I need to select my samples to accurately represent the population and not cherry-pick a sample that seems to make the point I want to make.
This is named after the fallacy typically held by gamblers. As well as many other people engaging in games of chance and the like.
Suppose that you are betting on the roll of two six-sided dice. You notice that the dealer has rolled 10 a lot in the last few rounds. You, therefore, assume that he is less likely to roll a ten the next time he rolls the dice.
This is however not the case. For statistically independent events, it does not matter what happens in that past, any outcome always has the same probability of occurring.
Events are statistically independent when every possible event has a certain probability that is not affected by what has happened before.
That is to say, the outcome is not affected by previous outcomes.
Therefore, it does not matter if you roll ten on a dice ten times in a row. The chances of rolling ten on two six-sided dice are always 1/12, even if you just rolled ten one hundred times in a row.
Past events, good or bad do not affect the odds of statistically independent events.
A typical example would be when you assume that because you have had a streak of bad luck, that you are due for some good luck. Say you play Lotto and you fail to win anything for ten years but assume that after all this time that you are bound to win something one day soon!
This is not the case; you are no more likely to win Lotto now than ever before.
Or suppose you believe that since you had three girls in a row, this time you will most likely get a boy. No, you are just as likely to get a fourth girl as you are your first boy. The odds of getting a boy or girl are still 50/50.
Or you assume that because you have been rejected for five jobs in a row that today you are more likely to get one. No, you are just as likely to get it as you were as if the five rejections had never happened. All else being the same of course. And assuming you leave everything to dumb luck instead of improving your chances by upskilling.
Streaks of good or bad luck are meaningless and do not affect the outcomes of future events.