The goal of this project is to assess wikipedia article comments and label them as "semantics" or "syntax". For this purpose we have built the following components:
- Wikipedia events listener
- Active learning (for labeling of data)
- Machine learning classification
NOTE: This project is not intented to be used in Production and comes without any warranty. This is suppose to be learning playground to understand building blocks of a common ML project.
Fastest way to setup and test everything is by using the Docker Linux containers. Technically, this also means you should able to test it on any platform of your choice (Windows/Linux/Mac). Howerver, following instructions are only tested on macOS (mostly Mojave).
The order in which containers are going to run is important. Please follow the instructions in the same sequence as they are provided.
Make sure that the following software is already installed. Also, active Azure subscription (Trial, Visual Studio Enterprise etc.) is needed to store the files that are generated as part of the data processing and ML modeling/training.
-
Azure CLI 2.0.52+
-
Git 2.17.2+
-
Docker for Mac 18.09.0+
-
Python 3.6+
-
Download the code:
git clone https://github.com/rbinrais/py-ml.git && cd py-ml
-
Login to Azure using CLI. Skip this step if you are already logged-in using CLI.
az login
-
Create a new resource group. You may want to choose a location (
-l
) that is closest to your phsycial location.az group create -l eastus -n pyml-rg
-
Create a new storage account. You may want to choose a location (
-l
) that is closest to your physical location.az storage account create -n pyml -g pyml-rg -l eastus --sku Standard_LRS
-
Read the stroage account keys. Run the following command and copy value of either key1 or key2. You will need this key later whenever reference to AZURE-STORAGE-KEY is made.
az storage account keys list -n pyml -o table
NOTE: In the intructions below you need to replace REGISTRY placeholder with the name of your preffered Docker registry. You can also choose not to use any registry name in case you don't want to push the image to remote Docker registry. I have uploaded all the container images to the public Docker Hub under registry named rbinrais. However, you can use Azure Container Registry or any other private registry for that matter. Also, replace AZURE-STORAGE-KEY placeholder with the actual stroage key captured earlier.
-
First, create a directory that will store Wikipedia comments and titles in a csv format.
mkdir wikipedia
-
Build the container:
docker build -t <<REGISTRY>>/pyml-wikipedia:1.0 -f Dockerfile.Wiki .
-
Run the container:
Update PATH with the full path to the wikipedia folder. You can always run
pwd
command to get the path.docker run --name "pyml-wikipedia" --rm -it -e "UPLOAD_TO_AZURE_BLOB=True" -e "TIMEOUT_IN_SEC=15" -e "AZURE_STORAGE_KEY=<<AZURE-STORAGE-KEY>>" -e "AZURE_STORAGE_ACCOUNT=pyml" -e "AZURE_CONTAINER_NAME=wikidata" -v <<PATH>>/wikipedia:/usr/src/app/wikipedia rbinrais/pyml-wikipedia:1.0
At this point container will start lisenting to the events as they are generated by Wikipedia event stream. Comments along with the article titles are displayed on the sceeen. Wait until container finishes processing of the events and then check the Wikipedia folder for comments.csv
and titles.csv files
.
These files are also uploaded to the Azure blob storage account 'pyml' under a container named 'wikidata'. You can list all the blob files stored in the pyml
Azure stroage account under `wikidata' container by using the command below. Alternatively, you can also view these files by browsing to the Azure Portal: https://portal.azure.com
az storage blob list --container-name wikidata --account-name pyml --auth-mode key -o table
You are now going to build and run a container that enables you to label comments manually as these labels are needed in the next step.
-
Build the container:
docker build -t <<REGISTRY>>/pyml-manually_label:1.0 -f Dockerfile.Manual .
-
Run the container:
docker run --name "pyml-manually_label" --rm -it -e "AZURE_STORAGE_KEY=<<AZURE-STORAGE-KEY>>" -e "AZURE_STORAGE_ACCOUNT=pyml" -e "AZURE_CONTAINER_NAME=wikidata" -v <<PATH>>/wikipedia:/usr/src/app/wikipedia <<REGISTRY>>/pyml-manually_label:1.0
As soon as the container starts, the console will present you with a a prompt asking you to provide label for each comment (stored in comments.csv file from previous step) manually one a time. The choices are either "syntax" or "semantics" (without quotes). Use your best judegement to mark each comment. Press enter to move to the next comment.
After all of the comments are labeled the output file comments_with_labels.csv
will be generated and stored on the Azure blob storage.
You are now going to build and run a container that enables you to perfrom auto labelling of comments (no manual work is needed).
-
Build the container:
docker build -t <<REGISTRY>>/pyml-auto_label_gen:1.0 -f Dockerfile.Automated .
-
Run the container:
docker run --name "pyml-auto_label_gen" -it -e "AZURE_STORAGE_KEY=<<AZURE-STORAGE-KEY>>" -e "AZURE_STORAGE_ACCOUNT=pyml" -e "AZURE_CONTAINER_NAME=wikidata" -v <<PATH>>/wikipedia:/usr/src/app/wikipedia:/usr/src/app/wikipedia <<REGISTRY>>/pyml-auto_label_gen:1.0
The output file labeled_data.csv
is generated and and uploaded to the Azure storage as a blob.
You are now going to build and run container that generates a model.
-
Build the container:
docker build -t <<REGISTRY>>/pyml-modeling:1.0 -f Dockerfile.Modeling .
-
Run the container:
docker run --name "pyml-modeling" -it -e "AZURE_STORAGE_KEY=<<AZURE-STORAGE-KEY>>" -e "AZURE_STORAGE_ACCOUNT=pyml" -e "AZURE_CONTAINER_NAME=wikidata" <<REGISTRY>>/pyml-modeling:1.0
The output contains two files; clf.joblib
(model) and label_encoder.joblib
and both are uploaded to Azure blob storage.
Prediction is done by reading the model clf.joblib
created in the previous step. The container automatically downloads the model file along with the encoder from the Azure blob storage. It also uses clf.joblib
comments.csv file (also stored on the Azure stroage as blob) created earlier by the pyml-wikipedia container.
-
Build the container:
docker build -t <<REGISTRY>>/pyml-predict:1.0 -f Dockerfile.Predict .
-
Run the container:
docker run --name "pyml-predict" -it -e "AZURE_STORAGE_KEY=<<AZURE-STORAGE-KEY>>" -e "AZURE_STORAGE_ACCOUNT=wikipedia" -e "AZURE_CONTAINER_NAME=pyml" <<REGISTRY>>/pyml-predict:1.0
The pedicted_labels.csv
file is generated as an output and uploaded to the Azure blob storage. This file contains all the comments along with the predicted labels syntax or semantics.