Alain F. Zuur on Avoiding Common Statistical Errors in Data Exploration

July 2012
Alain Zuur, Highland Statistics Ltd., Newburgh, U.K.

According to a recent analysis of Essential Science Indicators (a subset of the Web of Knowledge from Thomson Reuters), a 2010 report entitled “A protocol for data exploration to avoid common statistical problems” (A. Zuur, et al., Methods in Ecology and Evolution, 1[1]: 3-14, 201) has been identified as a New Hot Paper in the field of Environment/Ecology. Hot Papers, indexed by Thomson Reuters within the last two years, are distinguished by being cited at a rate notably above reports of comparable age and type published in the same journal. To date, this paper has been cited more than 60 times in the Web of Science.

First author Alain F. Zuur is affiliated with Highland Statistics Ltd., Newburgh, U.K., and with the University of Aberdeen. He is joined on this paper by his Highland Statistics colleague Elena N. Ieno, based in Alicante, Spain, and by Chris S. Elphick of the University of Connecticut, Storrs.

Below, Zuur answers a few questions about this new hot paper in Environment/Ecology.


SW: Why do you think your paper is highly cited?

While teaching statistics to ecologists, my coauthors and I noticed that many scientists make common statistical mistakes. If a random sample of these ecologists’ work (including scientific papers) produced before doing these courses were selected, half would probably contain violations of the underlying assumptions of the statistical techniques employed.

Some violations have little impact on the results or ecological conclusions, yet others increase type I or type II errors, potentially resulting in wrong ecological conclusions. Most of these violations can be avoided by applying better data exploration. These problems are especially troublesome in applied ecology, where management and policy decisions are often at stake.

In the paper, we provide a protocol for data exploration; we discuss current tools to detect outliers, heterogeneity of variance, collinearity, dependence of observations, problems with interactions, double zeros in multivariate analysis, zero inflation in generalized linear modelling, and the correct type of relationships between dependent and independent variables; and we provide advice on how to address these problems when they arise. We also address misconceptions about normality, and provide advice on data transformations.

A couple of other important reasons why this paper is so successful: (a) it provides people with a kind of "recipe" for doing a preliminary analysis, but also tries not to oversimplify the problems or present it as a one-size-fits-all solution; (b) it uses real and messy data sets, the kind that ecologists can relate to; and (c) it was written with practicing ecologists in mind.

Also critically important is the fact that the paper was a collaboration between a statistician and ecologists, which meant that both parties had to work hard to answer each other’s questions.

SW: Does it describe a new discovery or new synthesis of knowledge?

Data exploration has been advocated by many scientists for a long time, yet many scientists still dive into complicated statistical analyses without considering basic topics like outliers and collinearity. With the availability of the free software package R, which has fantastic graphical tools, all the techniques discussed in the paper are within easy reach of scientists.

SW: Would you summarize the significance of your paper in layman's terms?

The message we advocate in this paper is to formulate your underlying biological questions and, before engaging in any formal statistical analysis, apply an eight-step protocol to avoid “rubbish in, rubbish out.”

SW: How did you become involved in this research, and how would you describe the particular challenges, setbacks, and successes that you've encountered along the way?

I started as a statistical consultant at a biological research institute. Being one of two statisticians among hundreds of ecologists meant a high diversity of interesting projects and a steep learning curve. In such an environment you need to learn a thousand and one different statistical techniques. I also realized that the communication between a statistician and ecologist, and vice versa, is a very important process. You can apply the most advanced statistical technique, but if the person on the other side of the desk, or the reader of your paper, does not understand it, then there is no point in doing it.

Based on this experience, I wrote a book called Analysing Ecological Data (2007), together with the ecologists Ieno and Smith, and published it with Springer. This was shortly followed by two other books published with Springer, Mixed Effects Models and Extensions in Ecology with R (2009) by Zuur, Ieno, Saveliev and Smith, and A Beginner’s Guide to R (2009) by Zuur, Ieno and Meesters. All these books are based on cooperation between statisticians and biologists, and all use real and (very) messy data—something that many readers appreciate.

These books form the basis of a series of five-day statistical courses provided by the first two authors of this paper, Zuur and Ieno, who have taught over 7,000 ecologists from around the world. These hugely popular courses cover topics like R, data exploration, regression, GLM, GAM, mixed effects models, multivariate analysis, zero inflation, etc.

Key components in these courses (and our books) are visualization of data and modelling results, knowing what you are doing, and communicating your results to an audience that is most likely not familiar with your data and the statistical techniques applied. After all, most scientists are competing with each other for a small amount of space in top journals. This means that you have to apply the appropriate statistical technique and communicate the results as efficiently as possible. This paper, our books, and our courses contribute to this process.

SW: Where do you see your research leading in the future?

Due to the success of our three books with Springer, and this paper, we started a book series. The first two books in this series were recently published; they are Zero Inflated Models and Generalized Linear Mixed Models with R (2012) by Zuur, Saveliev, and Ieno, and A Beginner’s Guide to Generalized Additive Models (2012) by Zuur. Three more books will be published early 2013:

  • A Beginner’s Guide to GAMM with R, by Zuur, Saveliev, and Ieno.
  • Data Exploration and Visualisation with R, by Ieno and Zuur. This book is a spin-off from the paper “A protocol for data exploration to avoid common statistical problems”
  • A Beginner’s Guide to GLM with OpenBUGS, by Zuur, Hilbe, and Ieno.

SW: Do you foresee any social or political implications for your research?

Data exploration avoids type I and type II errors, among other problems, thereby reducing the chance of making wrong ecological conclusions and poor recommendations. It is therefore essential for good quality management and policy based on statistical analyses.


Alain F. Zuur
Highland Statistics Ltd
Newburgh, U.K.

highstat@highstat.com
www.highstat.com

The data and citation records included in this report are from Thomson Reuters Web of ScienceTM. Web of ScienceTM is a registered trademark of Thomson Reuters. All rights reserved.