Pipeline

Ontopop is a repository for experimenting with RAG-based scientific information extraction. Our current experiment includes the following steps:

Download: Downloads a snapshot from the ORKG.
Correct: Cleans the raw ORKG dataset dump so that they can be ingested into GraphDB without errors. The validation is done with the RDF4J 3.7.4 validator.
Ingest: Ingests the Dataset into GraphDB.
Create_Dataset:
- Ontopop: Queries a subetset from the ORKG and applies a set preprocessing steps, resulting in the Ontopop dataset for evaluation. Also downloads the paper PDFs via links from the Ontopop dataset
- Templates: Queries a subset from the ORKG, processes it, and outputs the Templates dataset.
Generate: Uses the Ontopop dataset to extract/generate property values for a given property. The context in the prompt includes the k most relevant snippets from the paper, the property description, and optionally the contribution label.
Evaluate: Computes the semantic similarity between the pair-wise sets of contributor property values and generated property values, by using the cosine similarity of the respective sentence embeddings, a max and an average aggregation function. Also, it computes the average count of tokens per generated property value.
Visualize: Creates the following plots:
- Template usages grouped by the number of properties
- Usage and utilization of the top 5% mostly used ORKG templates, with the Contribution template in more detail.
- Average semantic similarities between the contributor property values and the generated property values.

Reproducing the experiment

Requirements

We use apptainer version 1.3.6-1.el9 for this experiment, which should be available in the target environment. It can be installed following this documentation. The space and hardware requirements are:

60G for the data, i.e. the location where $DATA in .env points to
A computing unit that can run 64 threads in parallel. Otherwise, the parameter in the fourth line below can be tuned for a different number of threads.
A GPU with CUDA cores. We use an Nvidia A100.

Execution

Access the .env file in the root director of this project and update the variables $HOME and $DATA to accommodate your local environment.

Give read and write permissions to your $DATA directory.

Build the apptainer image by executing the following command:

./setup.sh "datalab"

Execute each of the steps in Pipeline using a self-written orchestration script, similar to docker-compose:

apptainer-compose download "orkg"
apptainer-compose correct "orkg"
apptainer-compose ingest "orkg"
apptainer-compose create_dataset "ontopop" 64
apptainer-compose generate "tika" "meta-llama/Meta-Llama-3-8B-Instruct" "two_shot"
apptainer-compose generate "tika" "tiiuae/Falcon3-10B-Instruct" "two_shot"
apptainer-compose generate "tika" "mistralai/Mistral-7B-Instruct-v0.3" "two_shot"
apptainer-compose visualize "tika" "two_shot"

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
.env		.env
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
apptainer-compose.sh		apptainer-compose.sh
ontopop.def		ontopop.def
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pipeline

Reproducing the experiment

Requirements

Execution

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

GreenfishK/ontopop

Folders and files

Latest commit

History

Repository files navigation

Pipeline

Reproducing the experiment

Requirements

Execution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages