The transformative power of machine learning (ML) is intrinsically linked to the availability of large, high-quality datasets. From deciphering the complexities of biological systems to navigating the nuances of human language, modern ML paradigms, particularly deep learning, thrive on data abundance [4, 7, 11]. However, the insatiable appetite of these algorithms presents a significant bottleneck: the efficient and effective collection of data at scale. This review explores the multifaceted landscape of data collection strategies tailored for large-scale ML projects, drawing upon recent advances in crowdsourcing, robotics, domain-specific methodologies, and data optimization techniques. We synthesize findings from diverse fields, highlighting both established best practices and emerging approaches that are shaping the future of data acquisition for an increasingly data-driven world. The challenge is no longer solely about algorithmic innovation; it is fundamentally about engineering scalable and robust data pipelines that fuel the next generation of intelligent systems.

Crowdsourcing and Human-in-the-Loop Data Annotation

The sheer volume of data required for large-scale ML often necessitates moving beyond traditional expert-driven annotation approaches. Crowdsourcing, leveraging the collective intelligence of online workers, has emerged as a powerful paradigm for rapidly and cost-effectively labeling massive datasets [1, 3]. [1] provides a comprehensive introduction to efficient data labeling via public crowdsourcing marketplaces, emphasizing its utility in scenarios where expert annotation is impractical due to time or cost constraints. They highlight the inherent trade-off in crowdsourced data – while scalability is a major advantage, the non-expert nature of crowd workers introduces noise and variability in label quality. To mitigate this, [1] underscores the importance of employing specialized techniques such as incremental relabeling, aggregation algorithms, and dynamic pricing strategies within crowdsourcing platforms. These techniques aim to refine noisy labels into high-quality annotations by iteratively improving label consensus and incentivizing worker accuracy.

The practical application of crowdsourcing extends beyond simple image classification tasks. [3] introduces CASHER (Crowdsourcing and Amortizing Human Effort for Real-to-Sim-to-Real), a novel pipeline that cleverly leverages crowdsourcing for generating digital twins of real-world scenes. This approach tackles the data scarcity challenge in robotics by shifting data collection from the expensive and time-consuming real world to the scalable realm of simulation. By crowdsourcing 3D reconstructions, CASHER facilitates the creation of diverse virtual environments where robots can accumulate vast amounts of training data. Initially driven by human demonstrations obtained through crowdsourcing, data collection in simulation progressively transitions to being autonomously generated by reinforcement learning (RL) agents. This amortization of human effort, facilitated by the increasing competence of the RL agent, demonstrates a super-linear scaling of data collection with human input, a significant leap beyond traditional linear scaling limitations. This approach is particularly relevant for complex domains like robotics where real-world data acquisition is inherently resource-intensive.

Further emphasizing the versatility of crowdsourcing in robotics, [8] presents ContactArt, a dataset and methodology for learning 3D interaction priors for hand-object manipulation. They address the annotation bottleneck in hand-object interaction data by developing a visual teleoperation system leveraging readily available iPhone technology. This system allows human operators to interact with simulated articulated objects, generating data with accurate pose and contact annotations derived directly from the simulator. The ease of data collection using iPhones, coupled with the simulated environment's rich ground truth information, significantly reduces the cost and complexity of acquiring large-scale, annotated datasets for robot learning. The data collected through this crowdsourced-inspired approach enables the learning of valuable interaction priors, improving the performance of hand and articulated object pose estimation, and demonstrating the power of combining simulation with scalable data acquisition methods.

Managing crowdsourced annotation projects effectively is critical for ensuring data quality and project success. [13] provides a valuable compendium of best practices for managing data annotation projects, derived from the extensive experience of annotation project managers at Bloomberg. This work, grounded in years of practical experience, offers actionable insights into various aspects of annotation project management, ranging from task design and worker selection to quality control and project monitoring. By systematizing the tacit knowledge accumulated in large-scale annotation efforts, [13] serves as a crucial guide for practitioners aiming to leverage crowdsourcing efficiently and reliably for their ML projects.

Robotics and Simulation: Scaling Data Acquisition in Embodied AI

Robotics presents unique challenges for data collection due to the inherent complexities of physical interaction and the high cost of real-world experimentation. Simulation has emerged as a critical tool for overcoming these limitations, enabling the generation of large datasets in controlled and cost-effective virtual environments [3, 8, 17]. However, the sim-to-real gap, the discrepancies between simulated and real-world physics, remains a significant hurdle for deploying policies trained solely in simulation.

[4] addresses the fundamental challenge of data scalability in robotics by introducing RoboNet, a large-scale open database for sharing robotic experience. Recognizing that most robotic learning experiments are small-scale and domain-specific, RoboNet aims to pool data from diverse robot platforms and tasks, creating a shared resource for the robotics community. The initial release of RoboNet comprises 15 million video frames from seven different robots, providing a substantial foundation for learning generalizable models for vision-based robotic manipulation. By training models on this diverse dataset, [4] demonstrates the potential for cross-robot generalization, showcasing that pre-training on RoboNet and fine-tuning on robot-specific data can outperform training from scratch with significantly more data. RoboNet exemplifies the power of collaborative data sharing in accelerating progress in robot learning and mitigating the data acquisition bottleneck.

Building upon the theme of sim-to-real transfer, [17] introduces Simulation-Guided Fine-tuning (SGFT), a framework designed to accelerate the adaptation of policies learned in simulation to the real world. While simulation provides a cost-effective source of data, direct zero-shot transfer often fails, especially in tasks requiring precise physical interaction. SGFT addresses this by leveraging value functions learned in simulation to guide exploration in the real world during fine-tuning. This structured exploration, informed by simulated priors, significantly enhances the efficiency of real-world adaptation, requiring an order of magnitude less real-world data compared to conventional fine-tuning methods. [17] provides both empirical validation across various dexterous manipulation tasks and theoretical justification for SGFT, underscoring the importance of incorporating simulated knowledge to bridge the sim-to-real gap effectively.

[7] explores the concept of pre-training behavioral priors for reinforcement learning, drawing parallels to the successful pre-training paradigms in natural language processing and computer vision. They propose Parrot, a method for learning behavioral priors from large, previously collected datasets across a range of tasks. These priors capture complex input-output relationships observed in successful trials, enabling faster learning of new tasks by RL agents. Parrot demonstrates its effectiveness in challenging robotic manipulation domains, outperforming prior methods and showcasing the potential of transfer learning to alleviate the data demands of RL. By pre-training on diverse datasets, Parrot facilitates the development of more data-efficient RL agents capable of rapidly adapting to novel environments and tasks.

[12] delves into the crucial aspect of efficient data collection strategies for robotic manipulation, specifically focusing on the concept of compositional generalization. They investigate whether robot policies can generalize to unseen combinations of environmental factors (e.g., object types, textures) based on data collected across variations of individual factors. Their empirical studies reveal that policies do exhibit compositional abilities, particularly when leveraging prior robotic datasets. Based on these insights, [12] proposes data collection strategies that explicitly exploit compositionality, focusing data acquisition on factor variations rather than exhaustive combinations. This approach significantly improves generalization performance with the same data collection effort, demonstrating the value of understanding and leveraging compositional generalization to optimize data collection in robotics.

Domain-Specific Data Collection Methodologies

While general-purpose data collection techniques like crowdsourcing and simulation are widely applicable, specific domains often necessitate tailored methodologies that address unique data characteristics and domain-specific knowledge. Several papers highlight innovative approaches to data collection in diverse fields ranging from cultural heritage to astronomy and software engineering.

[2] addresses the critical need for responsible and ethical data practices in the application of machine learning to cultural heritage collections. They introduce the "Collections as ML Data" checklist, a detailed set of guiding questions and practices for practitioners embarking on ML projects utilizing cultural heritage data. This checklist promotes a critical sociotechnical lens, encouraging practitioners to consider the manifold stakes and sensitivities inherent in cultural heritage data, including issues of provenance, representation, and community impact. By providing a structured framework for ethical data collection and utilization, [2] contributes to the responsible development and deployment of ML within the cultural heritage sector.

[5] presents the Council Data Project (CDP), an open-source platform designed to automate the curation of municipal governance data for research. Recognizing the scarcity of high-quality data for large-scale comparative studies of municipal governance, CDP leverages recent advances in speech-to-text and natural language processing to extract information from municipal council meetings. This automated data collection pipeline significantly reduces the barrier to accessing and analyzing municipal governance data, enabling new avenues of research into local government operations and performance. [5] demonstrates the platform's capabilities and outlines future directions, including the integration of machine learning models for tasks such as speaker annotation and named entity recognition, further enhancing the efficiency and richness of the collected data.

In the realm of astronomy, [10] provides a practical guide to setting up machine learning projects, leveraging the wealth of freely available astronomical datasets. While astronomy is data-rich, [10] emphasizes the importance of clarity in problem definition and rigorous workflows for verifying and calibrating ML models to ensure robust scientific insights. They offer a collection of guidelines, drawing from astronomical examples, to facilitate the development of scientifically sound and impactful ML projects in astronomy. These guidelines address crucial aspects such as data selection, model validation, and the interpretation of ML results in the context of astronomical phenomena.

[14] focuses on improving the quality of machine learning for software engineering (ML4SE) models through project-level fine-tuning. They investigate the potential for enhancing model performance by adapting models to the specific characteristics of individual software projects. Through experiments on the method name prediction task, [14] demonstrates that fine-tuning models on project-specific data can significantly improve quality compared to models trained solely on large, general datasets. This project-level adaptation highlights the value of incorporating domain-specific data and context to enhance the effectiveness of ML models in software engineering tasks. The open-sourced tools for data collection and experimentation further contribute to the accessibility and reproducibility of research in this area.

Data Visualization and Dimensionality Reduction for Large Datasets

As datasets grow in size and dimensionality, effective visualization and dimensionality reduction techniques become indispensable tools for exploration, analysis, and understanding. These techniques aid in uncovering hidden patterns, identifying data clusters, and gaining insights from complex, high-dimensional data.

[6] introduces the Collection Space Navigator (CSN), an interactive browser-based visualization tool designed for exploring large collections of visual digital artifacts associated with multidimensional data. CSN addresses the challenge of navigating and interpreting high-dimensional spaces often generated by machine learning embeddings or metadata associated with visual collections. It combines dimensionality reduction projections (e.g., t-SNE, UMAP) with configurable multidimensional filters, allowing users to interactively explore collections by zooming, scaling, transforming projections, and filtering data based on specific dimensions. This interactive exploration empowers users to uncover meaningful structures and patterns within large visual datasets, facilitating tasks such as research, curation, and data-driven discovery.

[9] proposes a deep learning approach to dimensionality reduction projections, offering a computationally efficient alternative to traditional methods like t-SNE. While t-SNE and related techniques are effective for visualizing data clusters, they are computationally expensive for large datasets and struggle with out-of-sample data. [9] trains a deep neural network to learn projections from a collection of data samples and their corresponding low-dimensional representations. This learned projection network can then rapidly generate projections for new data points, achieving significant speedups compared to SNE-class methods. The deep learning approach also offers stability, handles out-of-sample data effectively, and can be adapted to learn various projection techniques, providing a versatile tool for large-scale data visualization and analysis.

[18] explores the application of minimally supervised topological projections of self-organizing maps (SOMs) for phase of flight identification in general aviation. They address the challenges of large-scale, class-imbalanced flight data and the expense of manual labeling by proposing a novel minimally supervised SOM approach. This method utilizes nearest neighbor majority votes in the SOM U-matrix for class estimation, requiring significantly less labeled data compared to fully supervised approaches. [18] demonstrates that this minimally supervised method can achieve comparable or even superior performance to naive SOM approaches with only a small fraction of labeled data, highlighting the potential of topological projections for effective data analysis in data-scarce and class-imbalanced scenarios.

[16] investigates the use of sign stable projections and sign Cauchy projections for efficient computation of Lp distances in high-dimensional data, particularly relevant in streaming data scenarios. They propose using only the signs of projected data to estimate collision probabilities, deriving bounds and approximations for these probabilities. Interestingly, they find that for Cauchy random projections (p=1), the collision probability can be accurately approximated as functions of the chi-square similarity, a popular measure for non-negative data. This approach offers a computationally efficient method for similarity estimation in large-scale learning applications, especially when dealing with streaming data and histogram-based features.

Optimizing Data Collection Strategies

Beyond specific data collection methods, optimizing the overall data collection process is crucial for maximizing efficiency and minimizing costs. This involves strategic decisions about how much data to collect, what type of data to prioritize, and how to adapt data collection efforts based on model performance and task requirements.

[11] introduces a formal framework for optimizing data collection workflows, framing it as an optimal data collection problem. This framework allows designers to specify performance targets, collection costs, time horizons, and penalties for failing to meet targets. It generalizes to scenarios involving multiple data sources, such as labeled and unlabeled data in semi-supervised learning. [11] develops Learn-Optimize-Collect (LOC), an algorithm that minimizes expected future collection costs while ensuring performance targets are met. Numerical comparisons demonstrate that LOC significantly reduces the risk of failing to achieve desired performance levels compared to traditional data requirement estimation methods, while maintaining low overall collection costs. This optimization-based approach provides a principled way to manage data collection efforts and allocate resources effectively.

Future Directions

The field of data collection for large-scale machine learning is rapidly evolving, driven by the ever-increasing demand for data and the continuous advancements in data acquisition technologies. Several promising directions emerge from the current literature. Moving beyond passive data collection, active learning strategies that intelligently select the most informative data points for annotation hold immense potential [11]. Furthermore, developing robust methods for data valuation, quantifying the contribution of individual data points or subsets to model performance, will enable more targeted and efficient data acquisition.

References

  • Alexey Drutsa, Viktoriya Farafonova, Valentina Fedorova, Olga Megorskaya, Evfrosiniya Zerminova, Olga Zhilinskaya. Practice of Efficient Data Collection via Crowdsourcing at Large-Scale. arXiv:1912.04444v1 (2019). Available at: http://arxiv.org/abs/1912.04444v1
  • Benjamin Charles Germain Lee. The "Collections as ML Data" Checklist for Machine Learning & Cultural Heritage. arXiv:2207.02960v1 (2022). Available at: http://arxiv.org/abs/2207.02960v1
  • Marcel Torne, Arhan Jain, Jiayi Yuan, Vidaaranya Macha, Lars Ankile, Anthony Simeonov, Pulkit Agrawal, Abhishek Gupta. Robot Learning with Super-Linear Scaling. arXiv:2412.01770v2 (2024). Available at: http://arxiv.org/abs/2412.01770v2
  • Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, Chelsea Finn. RoboNet: Large-Scale Multi-Robot Learning. arXiv:1910.11215v2 (2019). Available at: http://arxiv.org/abs/1910.11215v2
  • Eva Maxfield Brown, Nicholas Weber. Councils in Action: Automating the Curation of Municipal Governance Data for Research. arXiv:2204.09110v3 (2022). Available at: http://arxiv.org/abs/2204.09110v3
  • Tillmann Ohm, Mar Canet Solà, Andres Karjus, Maximilian Schich. Collection Space Navigator: An Interactive Visualization Interface for Multidimensional Datasets. arXiv:2305.06809v1 (2023). Available at: http://arxiv.org/abs/2305.06809v1
  • Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, Sergey Levine. Parrot: Data-Driven Behavioral Priors for Reinforcement Learning. arXiv:2011.10024v1 (2020). Available at: http://arxiv.org/abs/2011.10024v1
  • Zehao Zhu, Jiashun Wang, Yuzhe Qin, Deqing Sun, Varun Jampani, Xiaolong Wang. ContactArt: Learning 3D Interaction Priors for Category-level Articulated Object and Hand Poses Estimation. arXiv:2305.01618v2 (2023). Available at: http://arxiv.org/abs/2305.01618v2
  • Mateus Espadoto, Nina S. T. Hirata, Alexandru C. Telea. Deep Learning Multidimensional Projections. arXiv:1902.07958v1 (2019). Available at: http://arxiv.org/abs/1902.07958v1
  • Johannes Buchner, Sotiria Fotopoulou. How to set up your first machine learning project in astronomy. arXiv:2502.08222v1 (2025). Available at: http://arxiv.org/abs/2502.08222v1
  • Rafid Mahmood, James Lucas, Jose M. Alvarez, Sanja Fidler, Marc T. Law. Optimizing Data Collection for Machine Learning. arXiv:2210.01234v1 (2022). Available at: http://arxiv.org/abs/2210.01234v1
  • Jensen Gao, Annie Xie, Ted Xiao, Chelsea Finn, Dorsa Sadigh. Efficient Data Collection for Robotic Manipulation via Compositional Generalization. arXiv:2403.05110v2 (2024). Available at: http://arxiv.org/abs/2403.05110v2
  • Tina Tseng, Amanda Stent, Domenic Maida. Best Practices for Managing Data Annotation Projects. arXiv:2009.11654v1 (2020). Available at: http://arxiv.org/abs/2009.11654v1
  • Egor Bogomolov, Sergey Zhuravlev, Egor Spirin, Timofey Bryksin. Assessing Project-Level Fine-Tuning of ML4SE Models. arXiv:2206.03333v1 (2022). Available at: http://arxiv.org/abs/2206.03333v1
  • Jun Hu, Bryan Hooi, Bingsheng He. Efficient Heterogeneous Graph Learning via Random Projection. arXiv:2310.14481v2 (2023). Available at: http://arxiv.org/abs/2310.14481v2
  • Ping Li, Gennady Samorodnitsky, John Hopcroft. Sign Stable Projections, Sign Cauchy Projections and Chi-Square Kernels. arXiv:1308.1009v1 (2013). Available at: http://arxiv.org/abs/1308.1009v1
  • Patrick Yin, Tyler Westenbroek, Simran Bagaria, Kevin Huang, Ching-an Cheng, Andrey Kobolov, Abhishek Gupta. Rapidly Adapting Policies to the Real World via Simulation-Guided Fine-Tuning. arXiv:2502.02705v1 (2025). Available at: http://arxiv.org/abs/2502.02705v1
  • Zimeng Lyu, Pujan Thapa, Travis Desell. Minimally Supervised Topological Projections of Self-Organizing Maps for Phase of Flight Identification. arXiv:2402.11185v1 (2024). Available at: http://arxiv.org/abs/2402.11185v1
  • More Saikat Barua's questions See All
    Similar questions and discussions