Introduction
‘How to Lie with Statistics‘ explains how people use numbers to fudge. Darrell Huff, a former statistician, published the book in 1954. It was quite path-breaking at that time, since statistics was not taught in schools. In the modern world, where schools teach statistics, most people are aware of these deceptions.
This book would make most sense for people who have not been exposed to statistics.
I have tried to enliven this up by providing some examples (which are not the same as the ones given in the book)
Recommendation
Read it if you have an hour to kill.
None of the things mentioned in the book were new to me unfortunately. But then I have learnt stats, at a post-graduate level. If you have forgotten the basics, this book has some value for you.
Excerpts
Proper treatment will cure a cold in seven days, but left to itself a cold will hang on for a week.
The Sample with the Built-in Bias
When someone says that this group has a value of X, check for bias in the samples.
Generally, it is very difficult to go out and check every value. So we used samples. Samples have biases which skew the findings.
A truly random sample is very hard to create
Notorious example: Generally, colleges (mostly the MBA kind) tend to report that their alumni have great salaries. The problem is that their alumni would have responded with an inflated salary because of peer pressure or a lower salary for tax reasons or only those people who are successful might be part of the alumni network itself.
The Well-chosen Average
An average can be
Mean – this is the arithmetic average
Median – this is the middle most value
Mode – this is the value which separates the top 50% from the bottom 50%
In a bell curve (heights of human beings for example), the mean = median = mode
In a skewed curve (salaries of employees for example), the three values will very different
Notorious example: Salaries. Typically, companies will advertise a salary range. Unfortunately this range is not representative of what you will actually get. There could be a 100 people getting a low salary while 5 get enormous ones. If the mean is used to report the salary, it will give a false indication of what you can actually expect.
The Little Figures that are Not There
Sample sizes have to be statistically significant i.e. large enough so that sample is representative of the whole.
Notorious example: Toothpastes. 8 out of 10 dentists recommend our toothpaste. Of course they will, if you have picked and chosen 10 dentists only.
Much Ado about Practically Nothing
Every measurement from a sample has an error
Probable error: what is the error value where you are more than it in 50 % of the time, and less than in 50% of the time.
Standard error : Same as probable but with a 1/3 and 2/3 split.
Notorious example: An IQ test (with a probable error of 3) is administered to Ram & Shyam. Ram scores 98 and Shyam scores 101. Does this mean that Shyam is more intelligent than Ram? (leaving aside the efficacy of IQ in measuring intelligence). No. Ram’s IQ score is 98 which means that there is a 50% probability that his score is between 95 (98-3) and 101 (98+3). Shyam’s score means that there is a 50% probability that his score is between 98 (101-3) and 104 (101+3)
The Gee-Whiz Graph
The scales of a graph determine how the data will be interpreted
Notorious example: assume that over the course of 20 years, the national income went from Rs 2000 to Rs 2500. Assume that y-axis of the graph is the income and the x-axis is the year. If the graph had an axis where the origin was 2000, it will appear as though the salary has grown 25 times in the 20 years. If the axis instead had the origin at 0, the true picture will come out.
The One-dimensional Picture
Be wary of graphs or pictures where a one dimensional value is shown in n dimensions to convey increase or decrease.
The Semi-attached Picture
If you can’t prove what you want to prove, demonstrate something else and pretend that they are the same thing
Notorious Example: Soap. Our soap can kill 99% of micro-organisms on your body. So what? Are these harmful? What about your competitors?
Post Hoc Rides again
Correlation does not imply causation
Notorious Example: Drinking cold water makes you get colds. Why do people think this? Because they have seen a few folks drink cold water and immediately get a cold. Here the correlation is between the cold water and getting a cold. But the cause is actually the presence of cold viruses.
Key Questions to Ask When Presented with Data
Look for conscious & unconscious bias
Is the sample statistically significant?
Is there really a correlation?
What is missing? (error values etc.)
Check if the data says something & the conclusion is something else
Does it make sense?
Comments