Software metadata extraction, consolidation and evaluation

⚠️ under development. Currently restructuring the whole project to align it with a clean architecture and to make it more modular and scalable.

We have developed a pipeline to gather metadata about research software specific to Computational Biology, harmonize and integrate it and to then be able to monitor certain features and evaluate their compliance with ** FAIRsoft indicators**. FAIRsoft are a set of research software FAIRness indicators, specifically devised to be assesed automatically.

This repository contains the code for:

Harmonization of raw metadata.
Integration of pieces of metadata belonging to the same software: integration use case.
Calculation of FAIRsoft indicators compliance and FAIRsoft scores.
Evaluation of language models for software identity resolution.

The code for the previos steps can be found in the respotories specified as follows:

Data extraction: each importer, which is responsible for extracting metadata from a specific source, has a repository of its own:

Installation

Install the dependencies

pip install -r requirements.txt

Usage

Data transformation

This is one use case and can be executed from the CLI in the following way:

python3 src/adapters/cli/transformation.py -l INFO

Data storage

During the whole process, metadata is stored in a Mongo Database (INB Mongo oeb-research-software). The database connection is configured through environment variables.

Development

Testing

To run tests, go to the root directory of this repository and use:

PYTHONPATH=$(pwd) pytest -v -s tests/

The previous command will run all tests except the ones marked as "manual". To run tests marked as "manual" use:

PYTHONPATH=$(pwd) pytest -v -s -m manual tests/

Logging

To add loggings, use:

import logging 

logger = logging.getLogger("rs-etl-pipeline")

The logger configuration can be found in src/infrastructure/logging_config.py. INFO logs are writen to terminal and all the rest to a file (re_etl_pipeline.log)

Name		Name	Last commit message	Last commit date
Latest commit History 383 Commits
.github/workflows		.github/workflows
data		data
docs		docs
human_annotations		human_annotations
scripts		scripts
src		src
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
Figure Indicators_1.pdf		Figure Indicators_1.pdf
LICENSE		LICENSE
README.md		README.md
diagram.png		diagram.png
mkdocs.yml		mkdocs.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Software metadata extraction, consolidation and evaluation

Installation

Usage

Data transformation

Data storage

Development

Testing

Logging

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

inab/research-software-etl

Folders and files

Latest commit

History

Repository files navigation

Software metadata extraction, consolidation and evaluation

Installation

Usage

Data transformation

Data storage

Development

Testing

Logging

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages