Self-Hosting Semantic Scholar Data

This repository contains all the code and instructions you'll need to begin your journey on self-hosting data from the Semantic Scholar datasets API using free AWS services. This code is provided by the team at Moara.io as a thank you to Semantic Scholar and Ai2 for their efforts in propelling the world forward. Enjoy building!

These instructions will walk you through:
Downloading the Semantic Scholar datasets → Creating searchable tables → Querying the data → Optionally join datasets to create more meaningful views of the datasets.

Prerequisites

Before starting, ensure you have the following:

Python 3.8 or higher
Semantic Scholar API Key (You can request it from Semantic Scholar)
An AWS Account
Basic understanding of Python and SQL

Project Structure

SS-self-hosting/
│
├── src/                        # Folder for source code
│   ├── download_datasets.py     # Script to download and upload datasets into AWS S3.
│   ├── query_datasets.py        # Script to query saved data using AWS Athena
│
├── config/                     # Configuration files
│   └── .env.template            # Template for environment variables
│
├── requirements.txt             # Python package dependencies at the project root
├── README.md                    # Comprehensive setup and usage instructions
├── LICENSE                      # License for the repository
├── samplelines.json             # Example lines from all 10 datasets.

Setup

1. Clone the Repository

Start by cloning the repository to your local machine:

git clone https://github.com/moaraio/SS-self-hosting.git
cd SS-self-hosting

2. Set Up a Python Virtual Environment

To ensure a clean workspace and avoid conflicts between dependencies, create and activate a virtual environment:

Step 1. Create the virtual environment:

python -m venv venv

Step 2. Activate the virtual environment:

On macOS/Unix/Linux:

source venv/bin/activate

On Windows:

venv\Scripts\activate

Once activated, your terminal prompt will be prefixed with (venv) to indicate that you're working within the virtual environment.

3. Install Dependencies

With the virtual environment activated, install the required Python packages by running:

pip install -r requirements.txt

This will install boto3, pandas, requests, and python-dotenv, which are needed for working with AWS services, handling data, and downloading datasets. It will also install tqdm which is a helpful import to visualize our progress.

4. Creating an IAM User with Correct Permissions

Once your AWS account is created, follow these steps to create an IAM user with the necessary permissions:

Step 1: Access the IAM Console

Log in to the AWS Management Console.
In the search bar, type IAM and select the IAM service.

Step 2: Create a New IAM User

On the left sidebar, select Users and click Create User.
Enter a User Name (e.g., your-name).
Click Next.

Step 3: Attach Permissions

On the Set permissions page, choose Attach policies directly option.
In the search box, search for AmazonS3FullAccess.
Select the checkbox next to AmazonS3FullAccess.
Search for AmazonAthenaFullAccess and select the checkbox for that policy.
Click Next.

Step 4: Review and Create User

Review the details of the user and make sure the correct policies are attached: AmazonS3FullAccess and AmazonAthenaFullAccess.
Click Create User.

Step 5: Add programatic access to the user

Once the user is created, open the user details by clicking the user's name.
In the summary pane, select Create access key.
From the use case menu, select Local code.
Select the confirmation and click Next.
Enter a brief description and click Create access key.
You will see the Access Key ID and Secret Access Key. Download these credentials as a .csv file or copy them to a secure location. You will use these credentials to configure your AWS access in the .env file for this project.

5. Set Up AWS and Semantic Scholar API Credentials

After creating the AWS IAM user and retreiving your Semantic Scholar API key, you can now configure your .env file.

Step 1: Copy the environment template to create your .env file:

cp config/.env.template .env

Step 2: Edit the .env file to add your credentials and variables:

AWS_REGION=your-aws-region
AWS_ACCESS_KEY_ID=your-access-key-id
AWS_SECRET_ACCESS_KEY=your-secret-access-key
S3_BUCKET_NAME=your-s3-bucket-name
ATHENA_OUTPUT_BUCKET=s3://<your-s3-bucket-name>/query-results/
SEMANTIC_SCHOLAR_API_KEY=your-api-key

Make sure to replace your-access-key-id, your-secret-access-key, and other placeholders with the appropriate values. Note you will use your s3 bucket name twice.

Downloading and Uploading Data to S3

Once the setup is complete, you can download datasets from Semantic Scholar and upload them to your S3 bucket.

Option A: Programmatic Setup (Recommended)

The script download_datasets.py will automatically check if the bucket exists. If the bucket does not exist, the script will create it for you. This ensures the process is streamlined and ready for downloading the datasets.

Run the following script:

python src/download_datasets.py

This script will:

Check if the S3 bucket exists; if not, it will create the bucket and name it based on your .env file.
Download/Stream the papers and abstracts datasets from Semantic Scholar.
Upload these datasets to the specified S3 bucket.

Option B: Manual Setup via AWS Management Console

If you prefer to manually create the S3 bucket, follow these steps:

Log in to the AWS Management Console.
Navigate to S3.
Click Create bucket.
Name your bucket (e.g., my-semanticscholar-bucket), select a region, and click Create.

Important: Once your S3 bucket is ready, ensure the S3_BUCKET_NAME and ATHENA_OUTPUT_BUCKET in your .env file matches the name of the bucket you've manually created. After creating the bucket, you must still run the following script to download and upload the datasets to your S3 bucket:

python src/download_datasets.py

Querying the Data

Once you have uploaded your datasets to S3, the next step is to query this data using AWS Athena.

Before creating tables, consider what data you’ll be querying. Each dataset has multiple fields, but you may only need a subset of them depending on your use case. For example, if you’re only interested in papers titles, years, and authors, you can create a table with just these fields from the papers dataset. For a full list of available fields across the datasets, see this document (ADD DOCUMENT LINK). Use it to understand what data you need before proceeding with table creation and querying.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
config		config
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
samplelines.json		samplelines.json

License

moaraio/SS-self-hosting

Folders and files

Latest commit

History

Repository files navigation

Self-Hosting Semantic Scholar Data

Prerequisites

Before starting, ensure you have the following:

Project Structure

Setup

1. Clone the Repository

2. Set Up a Python Virtual Environment

Step 1. Create the virtual environment:

Step 2. Activate the virtual environment:

On macOS/Unix/Linux:

On Windows:

3. Install Dependencies

4. Creating an IAM User with Correct Permissions

Step 1: Access the IAM Console

Step 2: Create a New IAM User

Step 3: Attach Permissions

Step 4: Review and Create User

Step 5: Add programatic access to the user

5. Set Up AWS and Semantic Scholar API Credentials

Step 1: Copy the environment template to create your .env file:

Step 2: Edit the .env file to add your credentials and variables:

Downloading and Uploading Data to S3

Option A: Programmatic Setup (Recommended)

Option B: Manual Setup via AWS Management Console

Important: Once your S3 bucket is ready, ensure the S3_BUCKET_NAME and ATHENA_OUTPUT_BUCKET in your .env file matches the name of the bucket you've manually created. After creating the bucket, you must still run the following script to download and upload the datasets to your S3 bucket:

Querying the Data

1. Setting up Athena Database and Table(s)

Step 1. Create a Database

Step 2. Create a Table

2. Running Queries Using query_datasets.py

Configuration

Running the Script

Example Query:

Custom Queries

3. Creating and Editing Queries

Semantic Scholar API official documentation and additional resources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

2. Running Queries Using `query_datasets.py`

Packages