Machine Learning for Improving the Quality of Citizen Science Data.

Published in Ph.D Dissertation at Oregon State University, 2013

Citizen Science is a paradigm in which volunteers from the general public participate in scientific studies, often by performing data collection. This paradigm is especially useful if the scope of the study is too broad to be performed by a limited number of trained scientists. Although citizen scientists can contribute large quantities of data, data quality is often a concern due to variability in the skills of volunteers. In my thesis, I investigate applying machine learning techniques to improve the quality of data submitted to citizen science projects. The context of my work is eBird, which is one of the largest citizen science projects in existence. In the eBird project, citizen scientists act as a large global network of human sensors, recording observations of bird species and submitting these observations to a centralized database where they are used for ecological research such as species distribution modeling and reserve design. Machine learning can be used to improve data quality by modeling an observer’s skill level, developing an automated data verification model and discovering groups of misidentified species.

Download paper here