Learning Algorithms for Link Prediction based on Chance Constraints.
Published in The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), 2010
Although citizen science projects can engage a very large number of volunteers to collect volumes of data, they are susceptible to issues with data quality. Our experience with eBird, which is a broad-scale citizen science project to collect bird observations, has shown that a massive effort by volunteer experts is needed to screen data, identify outliers and flag them in the database. The increasing volume of data being collected by eBird places a huge burden on these volunteer experts and other automated approaches to improve data quality are needed. In this work, we describe a case study in which we evaluate an automated data quality filter that improves data quality by identifying outliers and categorizing these outliers as either unusual valid observations or mis-identified (invalid) observations. This automated data filter involves a two-step process: first, a data-driven method detects outliers (ie. observations that are unusual for a given region and date). Next, we use a data quality model based on an observer’s predicted expertise to decide if an outlier should be flagged for review. We applied this automated data filter retrospectively to eBird data from Tompkins Co., NY and found that that this automated process significantly reduced the workload of reviewers by as much as 43% and identifies 52% more potentially invalid observations.