Lies, damned lies and statistics

It’s a cliche…the headline, I mean. Many of us, on reading yet another statistic from a government department, a merchant banker or a scaremongering tabloid health story, recognise that statistics are anything but infallible, they are malleable, they can be manipulated to any given end. The raw data is raw, but give it a spin and it can prove any fact fiction or remove the fiction and replace it with fact. Of course, in scientific research there is no escaping them. They are essential to the interpretation of results, but even in the most credible laboratory has to make a choice about how to process its raw data and to retain its integrity in presenting results.

There are numerous ways to apply statistical analysis each with its own merits and limitations. Two of the most fundamental are:

Regression
Time series analysis

Khalil Al Jerjawi of the University of Western Sydney, discusses the application of these statistical tools in management research in a forthcoming issue of IJLSE.

Regression

“Regressions are used to compare the effects of two or more independent variables on a dependent variable,” he explains. There are numerous regression techniques including linear regression a parametric technique in which the regression function is defined in terms of a finite number of unknown parameters based on experience with the data). However, there also are non-parametric regression techniques that can be multiply dimensional.

Regression analysis can be used to extrapolate data, a point often exploited in making wild predictions and forecasts based on the perception that a trend will continue even if data is limited. However, application of regression relies on assumptions about the data and can lead to spurious results if data sets are relatively small or there are many outliers in a sample. Aside from wild extrapolations, regression often leads to the conclusion that because two or more variable are correlated that the correlation means one causes the other, this is not necessarily the case. The number of calls to the emergency fire services rises as the number of fires increases, this does not imply that calls to the emergency services cause fires.

Time series analysis

A time series is nothing more than data collected at uniform time intervals. The daily closing prices of Dow Jones index or the height of the Nile River at Aswan, for example. Time series analysis then uses various techniques to pull out the characteristics of the data from the series and, of course, to use this to predict the next data point in the series. Depending on the data being sampled this can be straightforward and reliable, perhaps as with rivers, or entirely spurious as is more common with finance. Time series analysis of climate data are the perfect example of how statistics can be manipulated to prove or deny a “fact”.

Obviously, by definition, time series are ordered. This is in stark contrast to other data sampling techniques. For instance, it doesn’t really matter in what order you count the fruit hanging on apple trees in an orchard to assess fertiliser quality, but it does matter if you’re monitoring the impact of daily temperatures rises and falls on when the fruit ripens. Similarly, time series analysis is distinct from spatial data analysis where data points are recorded in relation to geography (where the trees are in relation to the farm buildings and walls, for instance. According to Wikipedia: “A time series model will generally reflect the fact that observations close together in time will be more closely related than observations further apart. In addition, time series models will often make use of the natural one-way ordering of time so that values for a given period will be expressed as deriving in some way from past values, rather than from future values.” The trap that those interpretating time series can fall into is that past events are not necessarily a good predictor of future outcomes in many, many cases.

We all know of countless studies that seemingly contradict each other. The climate change “debate” is perhaps one of the most controversial, although the case for vaccination, for genetically modified food, for pharmaceutical intervention in medicine all succumb to the intrinsic assumptions of statistics and the willingness of the media to manipulate them to their own ends. One week, coffee causes cancer, the next the antioxidants in coffee prevent cancer. Yesterday, red wine was anathema, tomorrow it’s the best thing since…it’s not as if it’s the same data sets being interpreted.

Khalil Al Jerjawi (2012). Methods of statistical analysis: an overview and critique of common practices in research studies Int. J. Liability and Scientific Enquiry, 5 (1), 32-36