Crowdsourcing can efficiently complete tasks that are difficult to automate, but the quality of crowdsourced data is tricky to evaluate. Algorithms to grade volunteer work often assume that all tasks are similarly difficult, an assumption that is frequently false. We use a cropland identification game with over 2,600 participants and 165,000 unique tasks to investigate how best to evaluate the difficulty of crowdsourced tasks and to what extent this is possible based on volunteer responses alone. Inter-volunteer agreement exceeded 90% for about 80% of the images and was negatively correlated with volunteer-expressed uncertainty about image classification. A total of 343 relatively difficult images were independently classified as cropland, non-cropland or impossible by two experts. The experts disagreed weakly (one said impossible while the other rated as cropland or non-cropland) on 27% of the images, but disagreed strongly (cropland vs. non-cropland) on only 7%. Inter-volunteer disagreement increased significantly with inter-expert disagreement. While volunteers agreed with expert classifications for most images, over 20% would have been mis-categorized if only the volunteers’ majority vote was used. We end with a series of recommendations for managing the challenges posed by heterogeneous tasks in crowdsourcing campaigns.