In this data project you will write a single file of code to import a bundle of fifty (50) separate JSON files into a new collection within a MongoDB database. Each file contains one or more record.
The script you write should import every individual record when executed once. You cannot run the script 50 separate times to import all files and records.
The data for import can be found in the data/
directory of this repository.
Follow the steps below carefully and create a solution of your own.
Fork this repo so that you are working with your own copy of the code and can add, commit, and push freely. You will be submitting the URL to your fork for grading.
Open your repository in Gitpod by appending https://gitpod.io/#
before the GitHub URL.
You are free to write this assignment in a bash
script using a CLI tool called mongoimport
, in Python using pymongo
,
or both.
See this page in Canvas for how to set up connection credentials if you need them.
DO NOT commit the password to your GitHub repository!
For command-line access to MongoDB you have MONGO-ATLAS
as an available command pre-built into your Gitpod configuration.
If using Python in Gitpod you already have the MONGOPASS
environment variable and can then connect to a db/collection in Atlas using this code:
from pymongo import MongoClient, errors
from bson.json_util import dumps
import os
import json
MONGOPASS = os.getenv('MONGOPASS')
uri = "mongodb+srv://cluster0.pnxzwgz.mongodb.net/"
client = MongoClient(uri, username='nmagee', password=MONGOPASS, connectTimeoutMS=200, retryWrites=True)
# specify a database
db = client.<db-name>
# specify a collection
collection = db.<collection-name>
If using mongoimport
within a bash
script in Gitpod, you should create a new Gitpod User Variable with your full URI (including username, password, host address, and DB name).
For instance, you could create a new variable named MONGODB
with a value of mongodb+srv://nmagee:xxxxxxxxxxx@cluster0.pnxzwgz.mongodb.net/<YOUR-DB>
Note that you should replace the xxxxxx
with the actual password, and replace <YOUR-DB>
with the name of your MongoDB database. Your URI must specify a database at the end.
Then you can call this URI within a terminal command to connect automatically each time with each invocation like this:
mongoimport --uri $MONGODB --collection test data.json --jsonArray
It is up to you to determine how you want to import the fifty files within data/
. But here are some methods you might want to draw from:
In bash
you can traverse a directory's contents, item by item:
for file in data/*
do
echo "$file"
done
This means that each file name, in the course of each for-loop, becomes a variable you can use or pass to other code or commands.
For instance, this version of the above code would pass the file name of each file into a separate Python script:
for file in data/*
do
/usr/bin/python3 my-import-script.py $file
done
In Python you can also traverse a directory's contents, item by item:
import os
path = "data"
for (root, dirs, file) in os.walk(path):
for f in file:
print(f)
To import a single JSON file into MongoDB using mongoimport
use this syntax:
mongoimport --uri $MONGODB --collection test data.json --jsonArray
Note these details:
-
The DB name should already be specified as part of the URI you created above.
-
The collection should be specified in the command. You can create a new collection by simply naming it here. This example imports into a collection mamed "test".
-
The data file itself is then passed.
-
Finally, you must indicate that you are passing in a
--jsonAarray
for the command to succeed.
To import a single JSON file into MongoDB using pymongo
in a Python script use this syntax:
# assuming you have defined a connection to your db and collection already:
# Loading or Opening the json file
with open('data.json') as file:
file_data = json.load(file)
# Inserting the loaded data in the collection
# if JSON contains data more than one entry
# insert_many is used else insert_one is used
if isinstance(file_data, list):
collection.insert_many(file_data)
else:
collection.insert_one(file_data)
You may encounter an error when importing this stack of files. Your code should be able to handle this without breaking. You do not need to log your errors but your code should continue processing additional records as much as possible.
Remember to assess where in your flow the errors occur, and handle them accordingly at each/any/every point with separate handlers as needed.
If you cannot elegantly handle the errors you may want to change your approach, or even your programming language.
As you test your code, MongoDB will not allow you to re-import files that have already been imported. To drop the collection to test again, open mongosh
by typing MONGO-ATLAS
in the Gitpod terminal.
Then:
use nem2p # specify your database name
db.COLLECTION.drop() # where COLLECTION is the name of your collection
After successfully importing all complete records contained in all the files, determine how many:
- complete documents have been imported into your collection
- complete documents could not be imported
- corrupted documents exist in the fileset
This file can be manually calculated and populated. Your code does need to output this file.
Note the number of records for each category in the file named count.txt
and commit it to your repository.
Add, commit, and push your working script and the count.txt
file to your fork of the repository.
Submit the GitHub URL of your fork for grading. This DP is worth 17 possible points.
Element | Points |
---|---|
Setup | 1 |
Connection | 2 |
Looping through files | 3 |
Inserting into DB | 4 |
Error Handling | 4 |
Count accuracy | 2 |