8000 GitHub - rbinrais/py-ml
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

rbinrais/py-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Example ML project for Azure

The goal of this project is to assess wikipedia article comments and label them as "semantics" or "syntax". For this purpose we have built the following components:

  • Wikipedia events listener
  • Active learning (for labeling of data)
  • Machine learning classification

NOTE: This project is not intented to be used in Production and comes without any warranty. This is suppose to be learning playground to understand building blocks of a common ML project.

Development Machine Setup

Fastest way to setup and test everything is by using the Docker Linux containers. Technically, this also means you should able to test it on any platform of your choice (Windows/Linux/Mac). Howerver, following instructions are only tested on macOS (mostly Mojave).

The order in which containers are going to run is important. Please follow the instructions in the same sequence as they are provided.

Prerequisets:

Make sure that the following software is already installed. Also, active Azure subscription (Trial, Visual Studio Enterprise etc.) is needed to store the files that are generated as part of the data processing and ML modeling/training.

  • Azure CLI 2.0.52+

  • Git 2.17.2+

  • Docker for Mac 18.09.0+

  • Python 3.6+

  • Download the code:

    git clone https://github.com/rbinrais/py-ml.git && cd py-ml

  • Login to Azure using CLI. Skip this step if you are already logged-in using CLI.

    az login

  • Create a new resource group. You may want to choose a location (-l) that is closest to your phsycial location.

    az group create -l eastus -n pyml-rg

  • Create a new storage account. You may want to choose a location (-l) that is closest to your physical location.

    az storage account create -n pyml -g pyml-rg -l eastus --sku Standard_LRS

  • Read the stroage account keys. Run the following command and copy value of either key1 or key2. You will need this key later whenever reference to AZURE-STORAGE-KEY is made.

    az storage account keys list -n pyml -o table

NOTE: In the intructions below you need to replace REGISTRY placeholder with the name of your preffered Docker registry. You can also choose not to use any registry name in case you don't want to push the image to remote Docker registry. I have uploaded all the container images to the public Docker Hub under registry named rbinrais. However, you can use Azure Container Registry or any other private registry for that matter. Also, replace AZURE-STORAGE-KEY placeholder with the actual stroage key captured earlier.

Build & Run Wikepedia Listener Container:

  • First, create a directory that will store Wikipedia comments and titles in a csv format.

    mkdir wikipedia

  • Build the container:

    docker build -t <<REGISTRY>>/pyml-wikipedia:1.0 -f Dockerfile.Wiki .

  • Run the container:

    Update PATH with the full path to the wikipedia folder. You can always run pwd command to get the path.

    docker run --name "pyml-wikipedia" --rm -it -e "UPLOAD_TO_AZURE_BLOB=True" -e "TIMEOUT_IN_SEC=15" -e "AZURE_STORAGE_KEY=<<AZURE-STORAGE-KEY>>" -e "AZURE_STORAGE_ACCOUNT=pyml" -e "AZURE_CONTAINER_NAME=wikidata" -v <<PATH>>/wikipedia:/usr/src/app/wikipedia rbinrais/pyml-wikipedia:1.0

At this point container will start lisenting to the events as they are generated by Wikipedia event stream. Comments along with the article titles are displayed on the sceeen. Wait until container finishes processing of the events and then check the Wikipedia folder for comments.csv and titles.csv files.

These files are also uploaded to the Azure blob storage account 'pyml' under a container named 'wikidata'. You can list all the blob files stored in the pyml Azure stroage account under `wikidata' container by using the command below. Alternatively, you can also view these files by browsing to the Azure Portal: https://portal.azure.com

az storage blob list --container-name wikidata --account-name pyml --auth-mode key -o table

Build & Run Manual Labeling Container:

You are now going to build and run a container that enables you to label comments manually as these labels are needed in the next step.

  • Build the container:

    docker build -t <<REGISTRY>>/pyml-manually_label:1.0 -f Dockerfile.Manual .

  • Run the container:

    docker run --name "pyml-manually_label" --rm -it -e "AZURE_STORAGE_KEY=<<AZURE-STORAGE-KEY>>" -e "AZURE_STORAGE_ACCOUNT=pyml" -e "AZURE_CONTAINER_NAME=wikidata" -v <<PATH>>/wikipedia:/usr/src/app/wikipedia <<REGISTRY>>/pyml-manually_label:1.0

As soon as the container starts, the console will present you with a a prompt asking you to provide label for each comment (stored in comments.csv file from previous step) manually one a time. The choices are either "syntax" or "semantics" (without quotes). Use your best judegement to mark each comment. Press enter to move to the next comment.

After all of the comments are labeled the output file comments_with_labels.csv will be generated and stored on the Azure blob storage.

Build & Run Auto Label Container:

You are now going to build and run a container that enables you to perfrom auto labelling of comments (no manual work is needed).

  • Build the container:

    docker build -t <<REGISTRY>>/pyml-auto_label_gen:1.0 -f Dockerfile.Automated .

  • Run the container:

    docker run --name "pyml-auto_label_gen" -it -e "AZURE_STORAGE_KEY=<<AZURE-STORAGE-KEY>>" -e "AZURE_STORAGE_ACCOUNT=pyml" -e "AZURE_CONTAINER_NAME=wikidata" -v <<PATH>>/wikipedia:/usr/src/app/wikipedia:/usr/src/app/wikipedia <<REGISTRY>>/pyml-auto_label_gen:1.0

The output file labeled_data.csv is generated and and uploaded to the Azure storage as a blob.

Build & Run Modeling Container:

You are now going to build and run container that generates a model.

  • Build the container:

    docker build -t <<REGISTRY>>/pyml-modeling:1.0 -f Dockerfile.Modeling .

  • Run the container:

    docker run --name "pyml-modeling" -it -e "AZURE_STORAGE_KEY=<<AZURE-STORAGE-KEY>>" -e "AZURE_STORAGE_ACCOUNT=pyml" -e "AZURE_CONTAINER_NAME=wikidata" <<REGISTRY>>/pyml-modeling:1.0

The output contains two files; clf.joblib (model) and label_encoder.joblib and both are uploaded to Azure blob storage.

Build & Run Predict Labels Container:

Prediction is done by reading the model clf.joblib created in the previous step. The container automatically downloads the model file along with the encoder from the Azure blob storage. It also uses clf.joblib comments.csv file (also stored on the Azure stroage as blob) created earlier by the pyml-wikipedia container.

  • Build the container:

    docker build -t <<REGISTRY>>/pyml-predict:1.0 -f Dockerfile.Predict .

  • Run the container:

    docker run --name "pyml-predict" -it -e "AZURE_STORAGE_KEY=<<AZURE-STORAGE-KEY>>" -e "AZURE_STORAGE_ACCOUNT=wikipedia" -e "AZURE_CONTAINER_NAME=pyml" <<REGISTRY>>/pyml-predict:1.0

The pedicted_labels.csv file is generated as an output and uploaded to the Azure blob storage. This file contains all the comments along with the predicted labels syntax or semantics.

ML Pipeline Using Azure Logic Apps

Pipeline Pipeline

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages

0