Lately, a bevy of articles has come out describing the pitfalls of Amazon mTurk research (see “Mechanical Turk and Experiments in the Social Sciences”, “Don’t Trust the Turk”, et al.). These articles have a prevailing POV that generalizes problems with mTurk research. A layman read presents a gross assumption that assumes researchers using mTurk view it as a panacea to research. I, however, am of the mind that this view will rapidly become a stale and stodgy view as the research of mTurk (e.g. ExperimentalTurk) determines the benefits of research on mTurk outweigh its shortcomings.
In his post, for example, Gelman talks about how mTurk allows for large sample sizes that virtually always lead to statistical significance. “…[P]apers are essentially exercises in speculation, “p=0.05” notwithstanding.” Well yes, this is true. A sample of 2,000 will likely yield a significant result. However, an understanding of basic statistics and sampling distribution dicates large sample sizes shouldn’t be used regardless of their availability, as they demonstrate limited generalizability of a phenomenon.
This example is not a problem endemic specific to mTurk–this is a problem endemic to poorly designed research. (On a tangent, I’d like to know how a published paper in a decent journal printed the line: “As reflected by the size of F value and its p value of the result, the difference among the three countries was not as significant as that among the three segments.” This has nothing to do with mTurk–just highlighting that poor research is poor research.) Technology is not to blame for poor research–the skills of researchers and reviewers who publish the research play a role in this.
The fact is, those who are keen to mTurk and its potential recognize that mTurk is both a vehicle for the research (a “wrapper”, if you will) and part of the design itself. Those who are keen to mTurk recognize that it is not a panacea. We recognize that it has its limitations (see How naive are MTurk workers”), but accept those limitations as part of the research design itself and either work within the constraints of the method or acknowledge such limitations. Take for example, the switch from paper surveys to online surveys such as Qualtrics (e.g. Greenlaw and Brown-Welty 2009 on mode of survey presentation). Factors like page breaks, questions per page, “Front/Back” buttons (and their presence or lack thereof), pagination, etc., are all relevant to design, yet frequently overlooked. As researchers, we still need to recognize their importance and limitations; we need to recognize where we make tradeoffs and justify why those tradeoffs are made.
That’s the thing with social science research: it has limitations. When was the last time we heard someone say:
“Let’s toss experiments out because we can’t control for some incidental confound!”
“Let’s toss surveys out because we can’t control for self-reporting!”
“Let’s toss grounded theory out because the author subjectively influences the outcomes!”
“Let’s toss student samples because we can’t get external validity!”
“Let’s toss out national samples because we can’t get internal validity!”
We don’t. We haven’t. (Well, except maybe the last two–see: McGrath and Brinberg 1983) We acknowledge why we’ve used a particular method, we acknowledge what outcomes we find with that method, and we acknowledge that the method has potential flaws that may be explored in the future. MTurk is no different in this regard. And as the technology improves and becomes more sophisticated, taking advantage of APIs (e.g., TurkCheck, behavioral research will adapt accordingly.
mTurk is both a vehicle for research and part of the design itself. Researchers who don’t “get” that it is both of these aspects will fail with it, all around. The rest of us who are trying to specialize in it are finding not a panacea, but a totally different outcome that is workable within the methodological paradigm (see: Berinsky, Huber, and Lenz 2012). This is a subject worth mooting, but in the end, will be moot.
[Disclaimer: I used mTurk in my dissertation and in a couple of other studies and, based on my own findings, am planning to follow advances in mTurk technique to the best of my technical skills. There are advantages to being able to get reasonable sample sizes (ie, just-specified structural equation modeling) and my data over multiple studies implies fairly representative demographics across samples. In one study, for instance, participants were randomly assigned to one of four conditions–no significant differences were found in demographics across conditions, self-report bias notwithstanding.]