Complexity Theory: Amazon’s Mechanical Turk is a disaster for crowdworkers

(and nothing will change that until the platform’s incentive structure is fixed)

By and

Crowdsourcing, or sourcing a large number of contributions from an open group, is a model that has been used to support product development and the creation of common goods. When the contribution is paid labor, the contributors are called crowdworkers. Amazon’s Mechanical Turk is a platform where requesters can post tasks for crowdworkers, known as “Turkers,” to complete for a small fee. These tasks typically take the form of surveys, image labeling, question answering and other tasks that don’t require specialized skills for humans, but are difficult or impossible for computers.

This sounds like a great way for COVID-displaced workers to supplement their income, but like a lot of other high-tech gig economy “disrupters,” it just doesn’t add up to money in workers’ pockets at nearly the same rate you might think.

A November 2019 blog post from Doordash claimed that overall driver earnings had recently increased to $18.54 per “active hour.” An independent study found that actual take-home pay from Doordash was about $1.45 after expenses per real hour worked, and that the pricing model relied on customer tips to stay afloat. Similarly, a 2018 MIT study found that median profit after expenses for Uber and Lyft drivers was $3.37 per hour — vastly different from Ridester’s 2018 survey finding that the average UberX driver made $13.70 per hour before tips.

The key point to consider in the above examples is that gig economy work doesn’t pay for expenses, nor does it pay for the downtime between gigs. In this race to attract the most users, competing ridesharing, food delivery and other gig economy platforms are incentivized to push customer pricing, and thus worker wages, as low as possible. Since customers may not understand the difference between ridesharing apps’ claims of drivers making $20-25 per hour and their actual take-home pay after expenses, they may not realize how little their driver is really being paid.

This brings us to Mechanical Turk (MTurk). When someone logs on to MTurk as a worker (or “Turker”), they can sort the available tasks (known as “Human Intelligence Tasks (HITs)”) by pay, but the only indication of how long the task will take to complete is an estimate that may or may not be provided by the person who posted the task (or “requester”). The platform itself gives no information about how long others took on average to complete the task, despite that being a simple calculation for the platform that is reported to requesters.

Even more telling is a new banner at the top of the Turker website. Workers can set a “HITs Goal” and a “Reward Goal” for themselves to track how many tasks they’ve completed that day and how much they’re slated to earn, but there is no indication of the total amount of time they’ve spent trying to accomplish these goals. The dashboard gives a historical record of how many tasks each Turker has completed and how much was earned by doing them, but again no indication of total time spent on platform.

Just like these other gig economy apps that are supposed to give low-income workers more flexibility, MTurk puts total earnings front and center and does its best to hide how many hours workers are putting in to achieve these earnings. For the Turkers who are more mindful of their time invested, however, there exists a cottage industry of MTurk statistics aggregators, requester reviews and browser add-ons that aim to help Turkers have less downtime between HITs — some of which even charge subscription fees.

Another troubling characteristic MTurk shares with other gigwork apps is its prioritization of loyal workers, again counter to the claims of increased flexibility. With other apps, rewards and rating systems incentivize workers to complete as many trips as possible without necessarily focusing on how much each trip is worth. MTurk has a similar system without even the explicit benefits that come with a tiered reward or rating system.

Instead, due to the problems of automated bots and people running through HITs as quickly as possible without paying attention to the tasks, it’s common for requesters to put some restrictions on their tasks, especially higher-paying tasks (and on MTurk, anything that pays $1 or more is “high-paying”). Typically, requesters will require that Turkers have completed some number of HITs (50, 100, 1000) with a high acceptance rate of 90%, 95% or even 99%. This incentivizes Turkers to begin their careers carefully completing a large number of short, low-paying HITs just to be able to qualify for better-paying HITs.

To show the problems with MTurk as a platform, we conducted small experiments both from the worker side and the requester side.

After playing around with the worker site and completing HITs for several hours, making somewhere in the neighborhood of $2 per hour, we decided to see what the best-case scenario for a hardworking Turker just starting out might be. We set a timer for an hour and one of us tried to maximize earnings over that hour by picking the tasks with the best perceived pay per minute. Several points stood out from this experiment.

First, doing this work requires attention and it can be mentally taxing. Again, due to worries of people or bots trying to game the system by filling out nonsense answers, most tasks have one or more attention checks built in, meaning that if workers don’t carefully read every question before answering and accidentally fail one of these checks, their HIT will be rejected. Not only does this mean their time was wasted, but worse, their all-important acceptance rate drops, disqualifying them from better-paying HITs in the future. Indeed, we had one of our own HITs in this maximum-efficiency hour rejected, for some unknown reason. Looking to Turker communities, this is a common job hazard and is exacerbated by confusing or poorly designed attention checks and predatory requesters rejecting work done in a seemingly arbitrary fashion.There is generally little recourse for Turkers in these cases, besides blacklisting requesters and warning others on the aforementioned forums and rating sites.

Second, it isn’t always clear how long a task will take before starting it, and time thinking about whether to start a task is time wasted. Reading through the task descriptions itself takes time, and starting a task only to find that it’s impossible for technical reasons, is taking longer than expected, or just isn’t worth doing means income lost.

In the end, we managed to make $6.30 in that hour, after which it was necessary to take a mental break. Except it wasn’t an hour, because the last task, a social science survey from a university in California that shall not be named, took 30 minutes to complete and paid $0.30. We emailed the researcher afterward to inquire about this pay rate, and were told that the task listed the expected time somewhere in the consent form (of course, time reading the consent form is time spent not actively making money). This dropped our experiment in maximum-efficiency MTurking to an effective pay rate of about $4.70 per hour — far higher than a 2018 paper’s estimate that the median wage was about $2 per hour after factoring in all of the unpaid time using the platform but not actively working. Of course, this 80-minute experiment also exhausted all of the highest paying HITs that were available, so subsequent work would pay even less.

On the requester side, we offered a short survey (estimated 5 minutes) about Turkers’ usage habits before and after the outbreak of coronavirus. We offered it to 100 respondents for $0.50 (the “suggested” $0.10/min rate — still under the federal minimum of $7.25 per hour but an attractive rate by MTurk standards) and then again to another 100 respondents for $0.05. Curiously, the batch of $0.05 HITs finished about four times faster than the $0.50 batch, and the lower paying HITs had Turkers successfully completing our simple attention check 47% of the time, while the higher paid workers successfully completed the attention check only 26% of the time — a statistically significant difference, even in our small sample. 

These results speak to a difference in how Turkers interact with the platform. Experienced Turkers use a suite of browser plugins that make searching for and queuing up large numbers of high-paying HITs much easier, leaving lower paying jobs for less-experienced Turkers. Once a large number these jobs are queued, the experienced Turker maximizes their payout by clearing their queue as fast as possible — possibly explaining why high-paying failed responses tend to come much faster than any other response category. Newer users are more likely to accept and complete tasks one-at-a-time, finishing them faster within the system, but completing fewer tasks HITs with more care. To this point, a 2019 study found that 5.7% of Turkers (those with 10,000+ HITs completed) accounted for 42.2% of all HITs completed.

The rewards structures in place on MTurk produce an almost adversarial relationship between requesters and Turkers. Since per-HIT pay is low, the only way to make money on MTurk is to complete lots of HITs, and do them quickly. This can lead to response quality issues, as one assumes that some accuracy is traded for speed. Requesters, on the other hand, are primarily interested in data quality. From their perspective, it seems that better quality data can come from posting HITs with reduced rewards (so as to target newer or more casual Turkers).

Despite pleas from prominent academics to pay Turkers a minimum of $10 (2016 Brookings Institute paper) or $15 (2019 Stanford paper) per hour, and even providing easy-to-use tools that automate these payment guarantees, the platform is set up in such a way that this race to bottom for both requesters and Turkers is inevitable. With the implicit requirement to spend dozens of hours completing hundreds or thousands of short, low-paying tasks, unethical requesters can easily find a willing audience of Turkers who will complete the task with higher levels of attention than undergraduate students doing psychology studies

With more workers and researchers turning to the platform to replace lost face-to-face opportunities, we believe it is the ethical duty of Amazon to reform the incentive structure of Mechanical Turk and put a stop to this vicious circle. Furthermore, requesters and Turkers should step up to change the perception of MTurk as a place to source low quality labor for exploitative wages. Requesters, including academic researchers, need to pay fair wages, make use of filtering tools, accurately broadcast how much time a task should take, and not deny payment to workers based on deceptive attention checks. Turkers also need to do their part by not accepting HITs that don’t pay fair wages, reporting exploitative requesters and completing honest work. Only by changing the culture of the platform can both sets of users ensure their needs are met.

Contact Glenn Davis at gmdavis ‘at’ stanford.edu and Klint Kanopka at kkanopka ‘at’ stanford.edu.

The Daily is committed to publishing a diversity of op-eds and letters to the editor. We’d love to hear your thoughts. Email letters to the editor to eic ‘at’ stanforddaily.com and op-ed submissions to opinions ‘at’ stanforddaily.com. 

Follow The Daily on Facebook, Twitter and Instagram.

While you're here...

We're a student-run organization committed to providing hands-on experience in journalism, digital media and business for the next generation of reporters. Your support makes a difference in helping give staff members from all backgrounds the opportunity to develop important professional skills and conduct meaningful reporting. All contributions are tax-deductible.

Donate

Get Our EmailsGet Our Emails