8000 GitHub - pkiraly/metadata-qa-ddb: Metadata quality assessment of Deutsche Digitale Bibliothek metadata
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

pkiraly/metadata-qa-ddb

Repository files navigation

Metadata quality assessment of Deutsche Digitale Bibliothek metadata

A metadata quality assessment tool customized for Deutsche Digitale Bibliothek's requirements against the incomming metadata records. This is an extension of Metadata Quality Assessment Framework.

Installation

The software depends on the following technologies:

  • MySQL
  • SQLite3
  • Java 11
  • R
  • Apache Solr
  • PHP

The following installation instructions work in Ubuntu. These contain simplified steps, for further details please consult the official documentation of these tools.

Auxiliary tools

sudo apt install jq wget curl

MySQL

sudo apt install mysql-server
sudo service mysql start

SQLite 3

sudo apt install sqlite3

Java

sudo apt install openjdk-11-jre-headless

R

wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo gpg --dearmor -o /usr/share/keyrings/r-project.gpg
echo "deb [signed-by=/usr/share/keyrings/r-project.gpg] https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/" | sudo tee -a /etc/apt/sources.list.d/r-project.list
sudo apt update
sudo apt install --no-install-recommends r-base r-cran-tidyverse r-cran-stringr r-cran-gridextra

Apache Solr

export SOLR_VERSION=9.1.1
cd /opt
curl -s -L https://archive.apache.org/dist/solr/solr/${SOLR_VERSION}/solr-${SOLR_VERSION}.tgz --output solr-${SOLR_VERSION}.tgz
tar zxf solr-${SOLR_VERSION}.tgz
rm solr-${SOLR_VERSION}.tgz
ln -s solr-${SOLR_VERSION} solr

run Apache Solr

/opt/solr/bin/solr start -m 2g

PHP

sudo apt install php php-http-request2 php-mysql php-sqlite3

Installing the software

Download and unzip the release package.

wget https://github.com/pkiraly/metadata-qa-ddb/releases/download/v1.0.0/metadata-qa-ddb-1.0.0-release.zip
unzip metadata-qa-ddb-1.0.0-release.zip
cd metadata-qa-ddb-1.0.0-release

Configuration

Set the configuration file.

  1. Create a configuration file:
cp configuration.cnf.template configuration.cnf
  1. Edit the configuration file
INPUT_DIR=<path to input directory>
OUTPUT_DIR=<path to output directory>

# FTP user name and password for the DDB FTP server
MQAF_FTP_USER=<FTP user name>
MQAF_FTP_PW=<FTP password>

# MySQL database settings
MQAF_DB_HOST=localhost
MQAF_DB_PORT=3306
MQAF_DB_DATABASE=<MySQL database name>
MQAF_DB_USER=<MySQL user name>
MQAF_DB_PASSWORD=<MySQL password>
# the type of configuration when call 'mysql' command. 
# valid values: LOCAL_FILE, TEMP_FILE or ENVIRONMENT_VARIABLES
MYSQL_CONFIG_TYPE=LOCAL_FILE

# Apache Solr settings
MQAF_SOLR_HOST=localhost
MQAF_DBSOLR_PORT=8983
# the prefix of the Solr core, default is 'qa_ddb'
# the cores will be names as [SOLR_CORE_PREFIX]_[METADATA_SCHEMA], e.g. qa_ddb_lido, qa_ddb_marc etc.
MQAF_SOLR_CORE_PREFIX=ddb_qa

# validation related settings
MQAF_VALIDATION_PARAMS=

With the MQAF_VALIDATION_PARAMS you can set individual parameters regarding to the running of the validation. For example to skip the image dimension check and content type check, you can add:

MQAF_VALIDATION_PARAMS="--skipDimension --skipContentType"

Log in to MySQL, create a database and a dedicated user:

CREATE DATABASE ddb;

CREATE USER '<user name>'@'localhost' IDENTIFIED BY '<password>';
GRANT ALL PRIVILEGES ON ddb.* TO '<user name>'@'localhost' WITH GRANT OPTION;
FLUSH PRIVILEGES;

Running the software

Download files and import file information into the database

create the database tables

scripts/create_database.mysql.sh

download files from FTP server

scripts/ingest/01_download_from_ftp.sh

unzip the downloaded zip files

scripts/ingest/02_extract_downloaded_files.sh

extract file info (path, size etc.) from the directory. The file paths contain semantic information about the data providers. The following data elements are extracted: file path, metadata schema, provider identifier, provider name, data set identifier, data set name, last modification date, file size. These data are saved into $OUTPUT_DIR/files.csv file.

scripts/ingest/03_extract_basic_info_from_downloaded_files.sh

import file info into MySQL (it first transforms CSV to SQL)

scripts/ingest/04_import_basic_info.mysql.sh

(optional) harvest Europeana-EDM records for each data sets

scripts/ingest/05_harvest_edm.mysql.sh

Index and store basic data

These commands will index the following fields into Apache Solr: identifier, provider identifier and title. The 'what to index?' question is answered by the schema's yaml definition file at the main/resources directory. The record ID and the container file is stored in MySQL. The record's full XML presentation is stored in a SQLite3 database. It is needed only for the display of records in the web user interface.

index DDB-EDM records

scripts/index/01_index_ddb-edm.sh

index MARC records

scripts/index/02_index_marc.sh

index DDB-DC records

scripts/index/03_index_ddb-dc.sh

index LIDO records

scripts/index/04_index_lido.sh

index METS-MODS records

scripts/index/05_index_mets-mods.sh

Run quality assessment

The quality assessment reads from the $INPUT_DIR directory and writes the results to $OUTPUT_DIR directory as CSV files, such as edm-ddb.csv.

quality assessment of DDB-EDM records

scripts/process/01_process_ddb-edm.sh

quality assessment of MARC

scripts/process/02_process_marc.sh

quality assessment of DDB-DC

scripts/process/03_process_ddb-dc.sh

quality assessment of LIDO

scripts/process/04_process_lido.sh

quality assessment of METS-MODS

scripts/process/05_process_mets-mods.sh

Importing measurement results to database

import DDB-EDM

scripts/process/11_import_ddb-edm.mysql.sh

import MARC

scripts/process/11_import_marc.mysql.sh

import DDB-DC

scripts/process/11_import_dc.mysql.sh

import LIDO

scripts/process/11_import_lido.mysql.sh

import METS-MODS

scripts/process/11_import_mets-mods.mysql.sh

calculate aggregated results

scripts/process/12_calculate_aggregations.mysql.sh

Dockerized version

The tool can be run in a dockerized fashion. The most straightforward way to do that is using docker compose. It contains the following components:

  • mqaf-ddb-db: a MySQL server container (it uses mysql:latest from Dockerhub)
  • mqaf-ddb-solr: an Apache Solr server container (it uses solr:9.6.1 from Dockerhub)
  • mqaf-ddb-cli: the application backend that contains a command line interface (it uses metadata-qa-ddb:main from Github)
  • mqaf-ddb-report: the application web frontend (it uses metadata-qa-ddb-web:v2.0 from Github)

Variables

MySQL server container

  • MQAF_DB_CONTAINER: the name of the MySQL container (default: mqaf-ddb-db)
  • MQAF_DB_PORT: the MySQL port (default: 3307)
  • MQAF_DB_DATABASE: database name (default: ddb)
  • MQAF_DB_USER: database user (default: ddb)
  • MQAF_DB_PASSWORD: database user password (default: ddb)

Apache Solr server container

  • MQAF_SOLR_CONTAINER: the name of the Apache Solr container (default: mqaf-ddb-solr)
  • MQAF_SOLR_PORT: the Apache Solr port (default: 8983)
  • MQAF_SOLR_DATA: the Apache Solr data directory outside the container (default: ./solr-data)
  • MQAF_SOLR_ENTRY: a directory outside the container containing scripts (e.g. copying config files, setting variables) that Apache Solr will run at setup (default: ./docker-configuration/solr)
  • MQAF_SOLR_CONTAINER: (default: mqaf-ddb-solr)

application backend container

  • MQAF_CLI_CONTAINER: the name of the backend container (default: mqaf-ddb-cli)
  • MQAF_CLI_IMAGE: the name of the image used for the container (default: ghcr.io/pkiraly/metadata-qa-ddb:main)
  • DDB_INPUT: the input data directory that will be mounted to both backend and frontend (default: test-ddb/input)
  • DDB_OUTPUT: the output data directory that will be mounted to both backend and frontend (default: test-ddb/output)
  • MQAF_SOLR_CORE_PREFIX: the prefix for the Solr index names (default: ddb-qa)
  • MQAF_VALIDATION_PARAMS: validation parameters (default: not set)
  • MQAF_DATA: (default: not set)

application web frontend container

  • MQAF_REPORT_CONTAINER: the name of the web frontend container (default: mqaf-ddb-report)
  • MQAF_REPORT_IMAGE: the name of the image used for the container (default: ghcr.io/pkiraly/metadata-qa-ddb-web:v2.0)
  • MQAF_REPORT_WEBPORT: the web port (default: 90)
  • DDB_CONFIG: the configuration directory (default: ./test-ddb/config)
703F

About

Metadata quality assessment of Deutsche Digitale Bibliothek metadata

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 
0