Data Cleaning. Ihab F. Ilyas
Чтение книги онлайн.
Читать онлайн книгу Data Cleaning - Ihab F. Ilyas страница 5
Figure 6.5 Anup Chalamalla, Ihab F. Ilyas, Mourad Ouzzani, and Paolo Papotti. Descriptive and prescriptive data cleaning. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 445–456, 2014. DOI: 10.1145/2588555.2610520.
Figure 6.6 Anup Chalamalla, Ihab F. Ilyas, Mourad Ouzzani, and Paolo Papotti. Descriptive and prescriptive data cleaning. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 445–456, 2014. DOI: 10.1145/2588555.2610520.
Figure 6.7 Based On: Anup Chalamalla, Ihab F. Ilyas, Mourad Ouzzani, and Paolo Papotti. Descriptive and prescriptive data cleaning. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 445–456, 2014. DOI: 10.1145/2588555.2610520.
Figure 6.9 Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. That’s all folks! LLUNATIC goes open source. Proceedings of the VLDB Endowment, Vol. 7, No. 13. Copyright 2014 VLDB Endowment 2150-8097/14/08:1565–1568.
Figure 6.12 Maksims Volkovs, Fei Chiang, Jaroslaw Szlichta, and Rene’e J. Miller. Continuous data cleaning. In Proc. 30th Int. Conf. on Data Engineering, pages 244–255, 2014.
Figure 6.14 George Beskales, Ihab F. Ilyas, and Lukasz Golab. Sampling the repairs of functional dependency violations under hard constraints. Proc. VLDB Endowment, 3(1–2): 197–207, DOI: 10.14778/1920841.1920870.
Figure 6.15 Solmaz Kolahi and Laks V. S. Lakshmanan. 2009. On approximating optimum repairs for functional dependency violations. In Proceedings of the 12th International Conference on Database Theory (ICDT ’09), Ronald Fagin (Ed.). ACM, New York, NY, USA, 53–62. DOI: 10.1145/1514894.1514901.
Figure 6.16 Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. Guided data repair. Proc. VLDB Endowment, 4(5): 279–289, DOI: 10.14778/1952376.1952378.
Figure 6.17 Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. Guided data repair. Proc. VLDB Endowment, 4(5): 279–289, DOI: 10.14778/1952376.1952378.
Figure 6.18 Wenfei Fan and Floris Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. 2012. © Morgan & Claypool.
Figure 6.19 Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD ’15). ACM, New York, NY, USA, 1247–1261. DOI: 10.1145/2723372.2749431.
Figure 6.20 Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD ’15). ACM, New York, NY, USA, 1247–1261. DOI: 10.1145/2723372.2749431.
Figure 6.23 George Beskales, Ihab F. Ilyas, and Lukasz Golab. Sampling the repairs of functional dependency violations under hard constraints. Proc. VLDB Endowment, 3(1–2): 197–207, DOI: 10.14778/1920841.1920870.
Figure 7.1 Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive deduplication using active learning. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’02). ACM, New York, NY, USA, 269–278. DOI: 10.1145/775047.775087.
Figure 7.2 Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD ’18). ACM, New York, NY, USA, 19–34. DOI: 10.1145/3183713 .3196926.
Figure 7.3 Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD ’18). ACM, New York, NY, USA, 19–34. DOI: 10.1145/3183713 .3196926.
Figure 7.8 Jiannan Wang, Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, Tim Kraska, and Tova Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 469–480, 2014. DOI: 10.1145/2588555.2610505.
Figure 7.9 Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. Activeclean: Interactive data cleaning for statistical modeling. Proc. VLDB Endowment, 9(12, August 2016): 948–959. DOI: 10.14778/2994509.2994514.
Tables
Table 3.2 Jens Bleiholder and Felix Naumann. 2009. Data fusion. ACM Comput. Surv. 41, 1, Article 1 (January 2009), 41 pages. DOI: 10.1145/1456650.1456651 and Xin Luna Dong and Felix Naumann. Data fusion: resolving data conflicts for integration. Proc. VLDB Endowment, 2(2): 1654–1655, 2009.
Table 4.1 Based On: Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’11). ACM, New York, NY, USA, 3363–3372. DOI: 10.1145/1978942.1979444.
Table 5.2 Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava, and Bei Yu. On generating near-optimal tableaux for conditional functional dependencies. Proc. VLDB Endowment, 1(1): 376–390, DOI: 10.14778/1453856.1453900.
Table 6.1 Based On: Xu Chu, Ihab F. Ilyas, and Paolo Papotti. Holistic data cleaning: Putting violations into context. In Proc. 29th Int. Conf. on Data Engineering, pages 458–469, 2013b.
Table 6.3 Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2011. Interaction between record matching and data repairing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data (SIGMOD ’11). ACM, New York, NY, USA, 469–480. DOI: 10.1145/1989323.1989373.
Table 7.1 Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10, 11 (August 2017), 1190–1201. DOI: 10.14778/3137628.3137631.
1
Introduction
Enterprises have been acquiring large amounts of data from a variety of sources in order to build large data repositories that power their applications, with the goal of enabling richer and more informed analytics. Data collection and acquisition often introduce errors in data, e.g., missing values, typos, mixed formats, replicated entries for the same real-world entity, and violations of business and data integrity rules. A survey about the state of data science and machine learning (ML) reveals that