ACM CIKM 2022 Workshop
Human-In-the-Loop Data Curation
Tell Me More


Although data quality is a long-standing and enduring problem, it has recently received a resurgence of attention due to the fast proliferation of data analytics, machine learning, and decision-support applications built upon the wide-scale availability and accessibility of (big) data. The success of such applications heavily relies on not only the quantity, but also the quality of data. Data curation, which may include ingestion, annotation, cleaning, integration, etc., is a critical step to provide adequate assurances on the quality of analytics and machine learning results. Such data preparation activities are recognised as time and resource intensive for data scientists as data often comes with a number of challenges that need to be tackled before it can be used in practice. Data re-purposing and the resulting distance between design and use intentions of the data, is a fundamental issue behind many of these challenges. These challenges include a variety of data issues such as noise and outliers, incompleteness, representativeness or biases, heterogeneity of format or semantics, etc. Mishandling these challenges can lead to negative and sometimes damaging effects, especially in critical domains like healthcare, transport, and finance. An observable distinct feature of data quality in these contexts is the increasingly important role played by humans, being often the source of data generation and the active players in data curation. This workshop will provide an opportunity to explore the interdisciplinary overlap between manual, automated, and hybrid human-machine methods of data curation.

Call for Papers

The full-day workshop on Oct 21, 2022 will include the following three parts:
  • Part 1 features plenary sessions, including the keynotes, invited talks, and panel.
  • Part 2 features selected presentations from speakers whose papers are peer-reviewed and who attend in person.
  • Part 3 features lightning talks for extended abstracts that are not formally peer-reviewed.
We invite submissions for novel research papers around the following topics:
  • Quality control for crowdsourced data curation
  • Data worker incentivization and engagement, including techniques from citizen science and collective intelligence
  • Expertise finding and engagement for data curation
  • Supporting crowd workers and experts in data task completion
  • Supporting data curation task design for data requesters
  • Collaborative data work among humans and between humans and AI
  • Human studies into the transparency, reliability, and biases in manual and hybrid data curation
  • Interaction techniques for manual, collaborative, and hybrid human-machine data curation, eg.., conversational interfaces
  • Database and machine learning techniques for supporting large-scale and hybrid data curation
  • Human intervention in data cascades and machine learning lifecycle management
  • Benchmarks in machine learning, AI, and related areas
  • Privacy and security issues of data quality, e.g., data poisoning attacks
Research papers must describe original work that has not been previously published, not accepted for publication elsewhere, and not simultaneously submitted or currently under review in another journal or conference.

Submissions of research papers must be in English, in PDF format, and be at most 2-4 pages (including figures, tables, proofs, appendixes, acknowledgments, and any content except references) in length, with unrestricted space for references, in the current ACM two-column conference format. Suitable LaTeX, Word, and Overleaf templates are available from the ACM Website (use “sigconf” proceedings template for LaTeX and the Interim Template for Word).

Submissions must be anonymous and should be submitted electronically via EasyChair:

At least one author of each accepted paper for part 2 of the workshop is required to register for, and present the work at the workshop.

Best papers will be invited to submit an extended version to a special issue of the ACM Journal of Data and Information Quality to be published in Q3 2023.

Important Dates

Important dates (23:59 Anywhere on Earth):
  • August 15, 2022: Paper submission deadline
  • September 15, 2022: Paper acceptance notification
  • October 15, 2022: Final paper submission
  • October 21, 2022: Full-day workshop at CIKM 2022
  • November 4, 2022: Final paper submission
ACM JDIQ Special Issue on Human-in-the-loop Data Curation timeline:
  • Submission deadline: February 2023
  • First-round review decisions: June 2023
  • Deadline for revision submissions: August 2023
  • Notification of final decisions: October 2023
  • Camera-ready Manuscript: November 2023
  • Tentative publication: December 2023



NB. Each accepted paper has 10 minutes for presentation, and 5 more minutes for questions.
Each lightening talk is 5 minutes, and 3 more minutes for questions.

Time Activity Title
09.00-09.10 Welcome and Opening
09.10-10.00 Keynote 1 - Abraham Bernstein: Towards a Collaboration between Humans and Machines for Data Curation and Analysis
10.00-10.30 Break
10.30-11.30 Keynote 2 - Ujwal Gadiraju: Human-Centered AI: A Crowd Computing Perspective
11.30-12.15 Session 1: Methods Brendan Coon. HITL IRL: 12 Reflections on Expertise Finding and Engagement for a Large Data Curation Team
(lightening) Stephanie Eckman, Jacob Beck, Rob Chew and Frauke Kreuter. Improving Labeling Through Social Science Insights
(lightening) Sepideh Nikookar. Human-AI Complex Task Planning
Subhadip Paul, Anirban Chatterjee, Binay Gupta and Kunal Banerjee. Developing a Noise-Aware AI System for Change Risk Assessment with Minimal Human Intervention
12.15-14.00 Break
14.00-14.55 Session 2: NLP (lightening) Shubhanshu Mishra and Jana Diesner. PyTAIL: Interactive and Incremental Learning of NLP Models with Human in the Loop for Online Data
Baihan Lin. Knowledge Management System with NLP-Assisted Annotations: A Brief Survey and Outlook
(lightening) Sara Pidò and Pietro Pinoli. A Paradigm to Put Back the User into the AutoML Loop through Natural Language
Deepa Muralidhar and Ashwin Ashok. Creating a framework for a Benchmark Religion Dataset
(lightening) Bipasha Banerjee, Palakh Mignonne Jude, William A. Ingram, Kurt Luther and Edward A. Fox. Help Me Help You - A Mixed-Initiative Approach To Explore Book-length Documents
15.00-15.55 Session 3: Multimedia Meghana Deodhar, Xiao Ma, Yixin Cai, Alex Koes, Jilin Chen and Alex Beutel. A Human-ML Collaboration Framework for Improving Video Content Reviews
Kameswara Mantha, Ramanakumar Sankar, Yuping Zheng, Lucy Fortson, Thomas Pengo, Douglas Mashek, Mark Sanders, Trace Christensen, Jeffrey Salisbury, Laura Trouille, Jarrett Byrnes, Isaac Rosenthal, Henry Housekeeper and Kyle Cavanaugh. From Fat Deposits to Floating Forests: Cross-Domain Transfer Learning using PatchGAN-based Segmentation Model
Edwin Gamboa, Jose Alejandro Libreros Montaño, Dan Dubiner and Matthias Hirth. Human-AI Collaboration for Improving the Identification of Cars for Autonomous Driving
(lightening) Paola Santana-Morales, Antonio J. Tallón-Ballesteros, Tengyue Li and Simon Fong. Triple Attribute Subset Selection Metaheuristic for Multi-class High-dimensionality Problems
15.55-16.05 Closing
16.30 End

Program Committee

  • Ines Arous, University of Fribourg
  • Agathe Balayn, Delft University of Technology
  • Marco Brambilla, Politecnico di Milano
  • Fabio Casati, Servicenow
  • Matthew Lease, The University of Texas at Austin
  • Jahna Otterbacher, Open University of Cyprus
  • Hailong Sun, Beihang University
  • Jie Zhang, Nanyang Technological University
  • Organizers

    Send an email to j.yang-3[at] for questions.

    Gianluca Demartini

    The University of Queensland, Australia

    Shazia Sadiq

    The University of Queensland, Australia

    Jie Yang

    Delft University of Technology, Netherlands