ReproNim Principle 2: Data and Metadata Management

ReproNim Principle 2: Data and metadata management

Data and metadata management is essential for scientific reproducibility as they provide the foundation for understanding, validating, and building upon research findings. Through comprehensive documentation of experimental conditions, data collection methods, and analysis workflows, researchers can capture the critical context needed to understand and replicate their work (Borghi and Van Gulick, 2018). Management includes preserving both raw and processed data in accessible formats, maintaining clear version control, and implementing standardized metadata schemas. When data is properly managed with detailed documentation, researchers can access, understand, and work with the exact same dataset, well beyond the initial time frame in which the data were produced

Use standard data formats and extend them to meet your needs Standardized approaches for data and metadata management provide a common framework that ensures consistency in how data is documented, organized, and shared across the laboratory and with colleagues. This standardization makes it easier for you, your lab mates, your collaborators and colleagues to understand and use data, as they can quickly interpret the metadata structure and data organization. Adopting established, i.e., community, standards streamlines data integration, enables automated processing, and facilitates long-term preservation by ensuring that data remains accessible and interpretable even as technology evolves. Use of community standards is explicitly called out in the FAIR data principles (Wilkinson et al., 2016) as a key requirement for reusability. Whether through standardized metadata schemas, controlled vocabularies, or common file formats, these practices ultimately enhance data quality, research efficiency, and the potential for data reuse.

ReproNim resources
- BIDS: Brain Imaging Data Structure, a community standard for organizing and describing neuroimaging data
- NIDM: Neuroimaging Data Model, a community metadata standard for neuroimaging studies
- ReproIn: Convention and helpers for naming and organization of MRI sequences for turnkey DICOM to BIDS conversion.
- Neuroimaging CDEs:
- ReproSchema: A standardized framework for organizing and annotating behavioral assessment data
  - Tutorial: Setting Up a Data Collection and Annotation Framework for Multi-Site Behavioral Studies
Other resources:
- BIDS Website: Official BIDS website, includes related BIDS standards and a catalog of tools that work with BIDS
- INCF Standards and Best Practices portfolio: A catalog of standards and best practices for neuroscience that have been vetted by the INCF and community review
- GO-FAIR: Community standards and FAIR

Use version control from start to finish Versioning provides a systematic way to track changes and maintain the integrity of research data over time. When data is versioned properly, researchers can trace the evolution of their datasets, understanding exactly what changes were made, when they occurred, and who made them. When errors are discovered, teams can easily revert to previous versions or identify when and where any issues were introduced. Versioning also supports collaboration by allowing multiple researchers to work with the same dataset while maintaining a clear record of modifications and preventing conflicting changes. In addition, version control ensures reproducibility by enabling researchers to reference and access specific versions of datasets that were used for particular analyses or publications. So even as datasets continue to be updated and refined, the exact data state used for any given research output can be preserved and accessed, making it possible to verify and build upon previous findings with confidence.

Some things you can do
- Learn more:
  - Tutorial: Version control systems
  - Yoda Principles: Version everything and build everything off these versions. Three simple rules for making it easier to track versions across datasets and code through consistent directory names and structures.
- Some things you can try:
  - Git-based approaches: Adapts Git (with Git-LFS for large files) to version data alongside code. Benefits from familiar branching and merging workflows but can struggle with very large datasets.
  - DVC (Data Version Control): Purpose-built for data science, storing large files in cloud storage while tracking metadata in Git. Integrates with ML pipelines and tracks experiments while efficiently handling large datasets.
  - Pachyderm: Container-based system that versions both data and processing code together, ensuring reproducibility and providing automatic lineage tracking in data pipelines.
  - DataLad: Advanced software platform for managing and sharing data. DataLad automatically tracks all versions of data

Annotate data using standard, reproducible procedures Data annotation is the systematic process of adding labels, tags, or descriptive information to raw data to make it more meaningful and usable for analysis. Using standardized annotation practices, i.e., using common vocabularies and well documented annotation guidelines, creates a consistent framework for documenting research data, making it easier for other lab members and colleagues to annotate, understand and interpret the information correctly. When annotations follow established procedures, they provide clear, unambiguous descriptions of data elements, experimental conditions, and methodological choices. This standardization reduces the risk of misinterpretation and makes it possible to compare data across different studies or time periods reliably. Standardized annotations are a cornerstone of FAIR, as they make the data more findable, interoperable and reusable. When annotations follow well-defined standards used across the broader community, they create a reliable foundation for data sharing, reuse, and long-term preservation, ultimately contributing to the broader goals of open science and research reproducibility.

ReproNim resources:
ReproNim provides a suite of tools and standards for annotating and querying key steps in a neuroimaging study to common standards, e.g., NIDM
- Tools:
  - PyNIDM:
    - Tutorial: Creating a BIDS Data Dictionary and Adding Semantics. How to create a data dictionary to document and annotate participant metadata in a BIDS formatted dataset, and add semantic, i.e., meaning, to variables collected using the NIDM standard.
  - ReproSchema: A standardized framework for organizing and annotating behavioral assessment data
    - Tutorial: Setting Up a Data Collection and Annotation Framework for Multi-Site Behavioral Studies
  - Neurobagel: A system for distributed data sharing and discovery built on common annotations across the workflow
    - Tutorial: Sharing and Searching Metadata Using a ReproPond and the ReproLake

Planning