Let's compile the online data repositories proposed to enhance the reproducibility of articles. I'll add my options, and you can add yours, with their pros and cons, so we can explore the possibilities and limitations. Here is my list:

CODE

  • Github (https://github.com): the most widely used open-source platform for version control and collaborative development, commonly used in academia to share research code and documentation. Pros: Git-based version control, collaborative development, integration with CI/CD tools and academic repositories (e.g., Zenodo DOI integration). Cons: Not designed for data preservation or formal citation, lacks structured metadata for academic datasets and reproducibility standards.
  • GitLab (https://gitlab.com): an open-source DevOps platform that supports version-controlled academic code repositories with integrated CI/CD and institutional hosting options. Pros: Stronger privacy controls and self-hosting support, integrated issue tracking and CI, ideal for academic institutions. Cons: Smaller academic user base than GitHub, less third-party integration for academic publishing (e.g., Zenodo linkage not as seamless).
  • CodeOcean (https://codeocean.com): open science library, a cloud-based platform for sharing executable research code and data in reproducible capsules, tailored for academic transparency and open science. Pros: Supports fully reproducible research capsules, DOI assignment, peer review-friendly, journal integrations (e.g., Nature, IEEE). Cons: Limited free tier, commercial platform, smaller user community compared to GitHub.
  • huggingface (https://huggingface.co): AI community with models, datasets, and applications share. Hugging Face is an open AI hub for sharing machine learning models, datasets, and demos, with strong community and support for reproducible, citation-ready AI research. Pros: Model hub with versioning and citation, dataset sharing, integration with leading ML frameworks (PyTorch, TensorFlow), strong academic presence. Cons: Primarily focused on NLP and ML communities, less suited for non-AI disciplines, not ideal for storing broader project files or raw research data.
  • Jupyter Notebook + nbviewer / Binder (https://mybinder.org / https://nbviewer.org): Jupyter-based tools like Binder and nbviewer allow researchers to share interactive, executable code notebooks for reproducible computational experiments. Pros: Great for tutorials, reproducibility, and open science education; integrates with GitHub and Zenodo. Cons: Not a repository per se—relies on external hosting (e.g., GitHub), not ideal for long-term archival.
  • Dockstore (https://dockstore.org): a platform for sharing bioinformatics tools and workflows using Docker and CWL/WDL, widely adopted in genomics and biomedical research. Pros: Standards-compliant (e.g., GA4GH), strong reproducibility, integration with major cloud platforms and bioinformatics pipelines. Cons: Limited to life sciences; requires workflow language expertise (WDL, CWL, Nextflow).
  • DATA SHARE

  • Figshare: (https://figshare.com): a general-purpose open-access repository that allows researchers to upload and share a wide variety of research outputs, including datasets, figures, and presentations. Pros: Cons: Limited storage for free users, commercial ownership (part of Digital Science) may raise concerns for some institutions.
  • Zenodo (https://zenodo.org): an open-access research data repository developed by CERN and OpenAIRE, designed to support sharing of data, software, and publications. Pros: Free to use, provides DOIs, strong integration with GitHub, EU-supported, and non-commercial. Cons: File size limit (50GB per upload), less customization of metadata compared to others.
  • Dryad (https://datadryad.org): a curated, non-profit repository for data underlying scientific and medical publications. Pros: Focus on data curation, compliance with journal and funder requirements, and DOI assignment. Cons: Submission fee required, primarily designed for datasets associated with published research.
  • IEEE DataPort (https://ieee-dataport.org): IEEE DataPort is a data repository focused on datasets in engineering, technology, and computer science, supporting both open and subscription-based access. Pros: Supports very large datasets (up to 2TB), DOI assignments, and integration with IEEE publications. Cons: Some features require a subscription, and there is a narrower disciplinary focus.
  • Harvard Dataverse (https://dataverse.harvard.edu): an open-source data repository platform for sharing, citing, and preserving research data, widely used by academic institutions worldwide. Pros: Institutional support, versioning, DOI assignment, and supporting metadata standards. Cons: This system is limited to datasets and does not support other research outputs. Additionally, it requires setup for institutional hosting unless Harvard’s instance is used.
  • Mendeley Data (https://data.mendeley.com): an Elsevier-hosted repository for sharing datasets across disciplines, with an emphasis on citation and collaboration. Pros: Easy integration with Elsevier journals, DOI assignment, and support for large files. Cons: Commercial ownership (Elsevier), limited customization for metadata and access controls.
  • Please let me know what you think about these options. Which ones do you use? If there are some more options that are missing from the list, please let us know!

    More Attila Biró's questions See All
    Similar questions and discussions