Building Resilient Data Pipelines

Abstract

When the goal is to build robust data pipelines, the quality of the data we use is key. While big data often provides sufficient information for large models to learn from, issues of unbalanced, incomplete, or incorrect data may lead to critical errors. In this talk I will discuss recent research conducted by my group studying human annotator behaviour and its implications on the quality of the collected data and the bias in it. I will first discuss how human bias is reflected in the data which is being collected by means of crowdsourcing and the consequences of unbalanced data for machine learning models. Then, I will present our work making use of fine-grained behavioural logs and eye-tracking to better model data curators and human annotators. Finally, I will present examples of biased labels and their impact on ML classification decisions.

Biography

Dr. Gianluca Demartini is an Associate Professor in Data Science at the University of Queensland, School of Information Technology and Electrical Engineering. His main research interests are Information Retrieval, Semantic Web, and Human Computation. His research has been supported by the Australian Research Council (ARC), the EU H2020 framework program, the UK Engineering and Physical Sciences Research Council (EPSRC), by the Swiss National Science Foundation (SNSF), by Meta, the Wikimedia foundation, and Google. He received Best Paper awards at the AAAI Conference on Human Computation and Crowdsourcing (HCOMP) in 2018 and at the European Conference on Information Retrieval (ECIR) in 2016 and in 2020. He has published more than 200 peer-reviewed scientific publications at top venues such as WWW, ACM SIGIR, VLDBJ, ISWC, and ACM CHI.