Governments, companies, research analysts and data scientists rely on data to make vital decisions. For instance, through data, China realized its "one-child" policy had led to an overly ageing population and introduced the "two-child" policy in 2016. Eventually, in May 2021, China announced it would allow a "three-child" policy. Canada with no prior restriction on child-birth turned to immigration to deal with its ageing population and a shrinking workforce.
Data is used differently by different departments in different companies. Google and Facebook, primarily use it for targeted ads and friend suggestions. But there's a problem! Some data may have certain trends lurking within them which could alter the results completely. This is referred to as the Simpsons Paradox.
Simpsons paradox is when the same set of data shows opposite trends/conclusions depending on how the data is grouped. This problem is commonly encountered in social science statistics. Lets examine a scenario to get a better understanding. Suppose you have a sick elderly relative in need of an important surgery. You are to decide between two hospitals - Hospital A and Hospital B - which would be better. To make this decision, you look at the last 1000 patients of each hospitals to decide which hospital has a better track record.
A look at the data revealed that for Hospital A, of the 1000 patients admitted 900 survived, while in hospital B, only 800 out of its last 1000 patients survived. From this, it seems hospital A is the reasonable choice given that hospital A has a 90% survival rate while hospital B has a 80% survival rate. However, we should remember that not all patients arrive at the hospital with the same level of health. Some patients arrive in a critical life-threatening state while some arrive in mild just-okay conditions. When we look at how each hospital performed among patients who arrived in good health and those who arrived in poor health, the picture begins to change.
For hospital A, 100 patients arrived in poor health of which 30 survived while for hospital B 400 patients arrived in poor health of which 210 survived - a survival rate of 30% and 52.5% respectively. This makes hospital B the better choice for patients who arrive in poor health. But what if your relative is in good health? In hospital A, 900 arrived in good health of which 870 survived while in hospital B 600 arrived in good health of which 590 survived a survival rate of 96.6% and 98.3% respectively. Surprisingly, hospital B is still the better choice! How can hospital A have a better overall survival rate while hospital B has a better survival rate for the two groups? This is the Simpsons paradox. This phenomenon isnt just theoretical. It comes up often in real world studies.
A study in the UK appeared to give the impression that smokers were more likely to outlive non-smokers over a 20 year time period. But when participants categorized by age group, the non-smokers consisted significantly of elderly people who were more likely to die naturally during the 20 year period thus giving the false impression that smokers outlived non-smokers over a 20 year period.
HOW TO AVOID FALLING FOR THE PARADOX
Unfortunately, there is no definite way to avoid this. Sometimes grouped data tells a more accurate picture that data divided and vice versa. The best thing to do is look carefully at each situation and the data being examined, to infer if changing variables are present. Consider this statement "100% of humans who drink water end up dying; Together we can stop this poison from spreading further. Say No To Water." Phenomena like the Simpsons paradox are what makes statistics both intriguing and terrifying. It does not have to lie to trick you.