research-article

CrowdER: crowdsourcing entity resolution

Authors:
Jiannan Wang

AMPLab, UC Berkeley and Tsinghua University

AMPLab, UC Berkeley and Tsinghua University
View Profile

,
Tim Kraska

AMPLab, UC Berkeley

AMPLab, UC Berkeley
View Profile

,
Michael J. Franklin

AMPLab, UC Berkeley

AMPLab, UC Berkeley
View Profile

,
Jianhua Feng

Tsinghua University

Tsinghua University
View Profile

Proceedings of the VLDB Endowment Volume 5 Issue 11pp 1483–1494https://doi.org/10.14778/2350229.2350263

Published:01 July 2012Publication History

Proceedings of the VLDB Endowment

Abstract

Entity resolution is central to data integration and data cleaning. Algorithmic approaches have been improving in quality, but remain far from perfect. Crowdsourcing platforms offer a more accurate but expensive (and slow) way to bring human insight into the process. Previous work has proposed batching verification tasks for presentation to human workers but even with batching, a human-only approach is infeasible for data sets of even moderate size, due to the large numbers of matches to be tested. Instead, we propose a hybrid human-machine approach in which machines are used to do an initial, coarse pass over all the data, and people are used to verify only the most likely matching pairs. We show that for such a hybrid system, generating the minimum number of verification tasks of a given size is NP-Hard, but we develop a novel two-tiered heuristic approach for creating batched tasks. We describe this method, and present the results of extensive experiments on real data sets using a popular crowdsourcing platform. The experiments show that our hybrid approach achieves both good efficiency and high accuracy compared to machine-only or human-only alternatives.

References

A. Arasu, M. Götz, and R. Kaushik. On active learning of record matching packages. In SIGMOD Conference, pages 783--794, 2010. Google ScholarDigital Library
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007. Google ScholarDigital Library
M. S. Bernstein, G. Little, R. C. Miller, B. Hartmann, M. S. Ackerman, D. R. Karger, D. Crowell, and K. Panovich. Soylent: a word processor with a crowd inside. In UIST, pages 313--322, 2010. Google ScholarDigital Library
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, pages 39--48, 2003. Google ScholarDigital Library
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006. Google ScholarDigital Library
P. Christen. Febrl: a freely available record linkage system with a graphical user interface. In HDKM, pages 17--25, 2008. Google ScholarDigital Library
P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng., 99(PrePrints), 2011. Google ScholarDigital Library
V. Chvatal. A greedy heuristic for the set-covering problem. Mathematics of Operations Research, 4(3):pp. 233--235, 1979.Google ScholarDigital Library
A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied Statistics, 28(1):20--28, 1979.Google ScholarCross Ref
A. Doan, R. Ramakrishnan, and A. Y. Halevy. Crowdsourcing systems on the world-wide web. Commun. ACM, 54(4):86--96, 2011. Google ScholarDigital Library
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007. Google ScholarDigital Library
A. Feng, M. J. Franklin, D. Kossmann, T. Kraska, S. Madden, S. Ramesh, A. Wang, and R. Xin. Crowddb: Query processing with the vldb crowd. PVLDB, 4(12):1387--1390, 2011.Google ScholarDigital Library
M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD, pages 61--72, 2011. Google ScholarDigital Library
P. C. Gilmore and R. E. Gomory. A linear programming approach to the cutting-stock problem. Operations Research, 9(6):849--859, 1961.Google ScholarDigital Library
O. Goldschmidt, D. S. Hochbaum, C. A. J. Hurkens, and G. Yu. Approximation algorithms for the k-clique covering problem. SIAM J. Discrete Math., 9(3):492--509, 1996. Google ScholarDigital Library
P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation, pages 64--67, 2010. Google ScholarDigital Library
S. R. Jeffery, M. J. Franklin, and A. Y. Halevy. Pay-as-you-go user feedback for dataspace systems. In SIGMOD Conference, pages 847--860, 2008. Google ScholarDigital Library
H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1):484--493, 2010. Google ScholarDigital Library
A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Human-powered sorts and joins. PVLDB, 5(1):13--24, 2011. Google ScholarDigital Library
A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR, pages 211--214, 2011.Google Scholar
R. McCann, W. Shen, and A. Doan. Matching schemas in online communities: A web 2.0 approach. In ICDE, pages 110--119, 2008. Google ScholarDigital Library
A. Parameswaran, H. Park, H. Garcia-Molina, N. Polyzotis, and J. Widom. Deco: Declarative crowdsourcing. Technical report, Stanford University. http://ilpubs.stanford.edu:8090/1015/.Google Scholar
A. J. Quinn and B. B. Bederson. Human-machine hybrid computation. In Position paper for CHI 2011 Workshop On Crowdsourcing And Human Computation, 2011.Google Scholar
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, pages 269--278, 2002. Google ScholarDigital Library
J. M. Valério and D. Carvalho. Exact solution of cutting stock problems using column generation and branch-and-bound. International Transactions in Operational Research, 5(1):35--44, 1998.Google ScholarCross Ref
C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst., 36(3):15, 2011. Google ScholarDigital Library
T. Yan, V. Kumar, and D. Ganesan. Crowdsearch: exploiting crowds for accurate real-time image search on mobile phones. In MobiSys, pages 77--90, 2010. Google ScholarDigital Library

Recommendations

Alliance Rules for Data Warehouse Cleansing
ICSPS '09: Proceedings of the 2009 International Conference on Signal Processing Systems

Data Cleansing is an activity performed on the data sets of data warehouse to enhance and maintain the quality and consistency of the data. This paper addresses the problems related with dirty data, entrance of dirty data and detection of dirty data in ...
Read More
TRR: Reducing Crowdsourcing Task Redundancy
Database and Expert Systems Applications
Abstract
In this paper, we address the problem of task redundancy in crowdsourcing systems while providing a methodology to decrease the overall effort required to accomplish a crowdsourcing task. Typical task assignment systems assign tasks to a fixed ...
Read More
Spatio-spectral fusion of satellite images based on dictionary-pair learning

This paper proposes a novel spatial and spectral fusion method for satellite multispectral and hyperspectral (or high-spectral) images based on dictionary-pair learning. By combining the spectral information from sensors with low spatial resolution but ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 5, Issue 11
July 2012
608 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 July 2012
Published in pvldb Volume 5, Issue 11
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 182
  Total Citations
  View Citations
- 1,076
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CrowdER: crowdsourcing entity resolution

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Alliance Rules for Data Warehouse Cleansing

TRR: Reducing Crowdsourcing Task Redundancy

Spatio-spectral fusion of satellite images based on dictionary-pair learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

CrowdER: crowdsourcing entity resolution

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Alliance Rules for Data Warehouse Cleansing

TRR: Reducing Crowdsourcing Task Redundancy

Spatio-spectral fusion of satellite images based on dictionary-pair learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media