This repository contains all the code and instructions you'll need to begin your journey on self-hosting data from the Semantic Scholar datasets API using free AWS services. This code is provided by the team at Moara.io as a thank you to Semantic Scholar and Ai2 for their efforts in propelling the world forward. Enjoy building!
These instructions will walk you through:
Downloading the Semantic Scholar datasets → Creating searchable tables → Querying the data → Optionally join datasets to create more meaningful views of the datasets.
- Python 3.8 or higher
- Semantic Scholar API Key (You can request it from Semantic Scholar)
- An AWS Account
- Basic understanding of Python and SQL
SS-self-hosting/
│
├── src/ # Folder for source code
│ ├── download_datasets.py # Script to download and upload datasets into AWS S3.
│ ├── query_datasets.py # Script to query saved data using AWS Athena
│
├── config/ # Configuration files
│ └── .env.template # Template for environment variables
│
├── requirements.txt # Python package dependencies at the project root
├── README.md # Comprehensive setup and usage instructions
├── LICENSE # License for the repository
├── samplelines.json # Example lines from all 10 datasets.
Start by cloning the repository to your local machine:
git clone https://github.com/moaraio/SS-self-hosting.git
cd SS-self-hosting
To ensure a clean workspace and avoid conflicts between dependencies, create and activate a virtual environment:
python -m venv venv
source venv/bin/activate
venv\Scripts\activate
Once activated, your terminal prompt will be prefixed with (venv) to indicate that you're working within the virtual environment.
With the virtual environment activated, install the required Python packages by running:
pip install -r requirements.txt
This will install boto3
, pandas
, requests
, and python-dotenv
, which are needed for working with AWS services, handling data, and downloading datasets. It will also install tqdm
which is a helpful import to visualize our progress.
Once your AWS account is created, follow these steps to create an IAM user with the necessary permissions:
- Log in to the AWS Management Console.
- In the search bar, type
IAM
and select the IAM service.
- On the left sidebar, select Users and click Create User.
- Enter a User Name (e.g.,
your-name
). - Click Next.
- On the Set permissions page, choose Attach policies directly option.
- In the search box, search for
AmazonS3FullAccess
. - Select the checkbox next to AmazonS3FullAccess.
- Search for
AmazonAthenaFullAccess
and select the checkbox for that policy. - Click Next.
- Review the details of the user and make sure the correct policies are attached:
AmazonS3FullAccess
andAmazonAthenaFullAccess
. - Click Create User.
- Once the user is created, open the user details by clicking the user's name.
- In the summary pane, select Create access key.
- From the use case menu, select Local code.
- Select the confirmation and click Next.
- Enter a brief description and click Create access key.
- You will see the Access Key ID and Secret Access Key. Download these credentials as a
.csv
file or copy them to a secure location. You will use these credentials to configure your AWS access in the.env
file for this project.
After creating the AWS IAM user and retreiving your Semantic Scholar API key, you can now configure your .env file.
cp config/.env.template .env
AWS_REGION=your-aws-region
AWS_ACCESS_KEY_ID=your-access-key-id
AWS_SECRET_ACCESS_KEY=your-secret-access-key
S3_BUCKET_NAME=your-s3-bucket-name
ATHENA_OUTPUT_BUCKET=s3://<your-s3-bucket-name>/query-results/
SEMANTIC_SCHOLAR_API_KEY=your-api-key
Make sure to replace your-access-key-id, your-secret-access-key, and other placeholders with the appropriate values. Note you will use your s3 bucket name twice.
Once the setup is complete, you can download datasets from Semantic Scholar and upload them to your S3 bucket.
The script download_datasets.py
will automatically check if the bucket exists. If the bucket does not exist, the script will create it for you. This ensures the process is streamlined and ready for downloading the datasets.
Run the following script:
python src/download_datasets.py
This script will:
- Check if the S3 bucket exists; if not, it will create the bucket and name it based on your .env file.
- Download/Stream the papers and abstracts datasets from Semantic Scholar.
- Upload these datasets to the specified S3 bucket.
If you prefer to manually create the S3 bucket, follow these steps:
- Log in to the AWS Management Console.
- Navigate to S3.
- Click Create bucket.
- Name your bucket (e.g., my-semanticscholar-bucket), select a region, and click Create.
Important: Once your S3 bucket is ready, ensure the S3_BUCKET_NAME and ATHENA_OUTPUT_BUCKET in your .env file matches the name of the bucket you've manually created. After creating the bucket, you must still run the following script to download and upload the datasets to your S3 bucket:
python src/download_datasets.py
Once you have uploaded your datasets to S3, the next step is to query this data using AWS Athena.