SageMaker ML Pipeline: NLP Classification & MLOps Platform Engineering

🛡️ Disclaimer: This repository was created as part of my independent learning in machine learning and MLOps.
While I consulted various public resources along the way, including open tutorials and documentation, all code and commentary in this repo were written and developed by me.
No proprietary, paid, or licensed content has been copied or redistributed.

Educational Notebooks

Positional Encoding Explained
A clean walkthrough of how transformer models (like BERT) encode word positions using sinusoidal patterns — includes visualizations and code examples. Great for anyone trying to understand the inner workings of attention mechanisms.

SageMaker ML Pipeline: NLP Classification & MLOps Platform Engineering

This repo demonstrates the full pipeline for deploying NLP models using AWS SageMaker — from data ingestion and preprocessing to model training, deployment, and inference. It’s built with real-world, cloud-native practices in mind, and designed for integration into an internal ML platform like the one described in Genmab's AI & Analytics Engineering team.

Project Goals

Classify news headlines into topics using a fine-tuned DistilBERT model
Train and deploy the model using Amazon SageMaker
Enable real-time inference via a containerized API (ready for SageMaker hosting)
Showcase MLOps: containerization, CI/CD automation, infrastructure-as-code
Document core transformer theory (positional encodings) for team enablement

Tech Stack

Layer	Tools & Frameworks
Language	Python 3.10
Modeling	PyTorch, HuggingFace Transformers
Data	Pandas, S3-hosted corpus (newsCorpora)
Training	Custom PyTorch class, `script.py`
Inference	`inference.py` using `model_fn` / `predict_fn` / etc.
Infrastructure	AWS SageMaker, IAM, S3
MLOps	Docker, GitHub Actions (CI), Terraform (infra)

Core Features

Multiclass Text Classification with DistilBERT

Four target categories: Business, Entertainment, Science, Health
Tokenization and padding via HuggingFace
Dropout + fully connected classification head

Containerized Inference

inference.py defines a SageMaker-compatible model serving interface
Supports JSON-based inputs
Returns probabilities + predicted label

🐳 Docker-Ready

FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "inference.py"]

Infrastructure-as-Code

Provisioned via Terraform:

resource "aws_sagemaker_notebook_instance" "sagemaker_notebook" {
  name          = "nlp-model-notebook"
  instance_type = "ml.t2.medium"
  role_arn      = aws_iam_role.sagemaker_execution_role.arn
}

Also provisions IAM roles with AmazonSageMakerFullAccess

CI/CD Automation

GitHub Actions included in .github/workflows/train.yml:

Lint checks with flake8
Install dependencies
(Optional) run training or test workflow

Educational Notebooks

Positional Encoding Explained
Walkthrough of how transformer models encode position information using sinusoidal functions — includes visuals + annotated code.
TrainingNotebook.ipynb
Experimental training loop with dropout tuning, batch sizes, and activation function exploration.

Weights, Biases, and Core ML Concepts

This repo includes notebooks that explore how loss functions, optimizers, and activation functions affect model performance — especially in neural networks like DistilBERT.

For anyone looking to build rock-solid intuition on these core ideas, I highly recommend StatQuest with Josh Starmer on YouTube.
His explanations of gradient descent, activation functions, and loss landscapes are some of the clearest out there.

Rough Folder Guide (has been moved around)

├── script.py                       # Core model training script
├── inference.py                   # SageMaker-style inference pipeline
├── Positional_Encodings.ipynb     # Transformer theory explainer
├── Dockerfile                     # Container config for inference
├── requirements.txt               # All Python dependencies
├── main.tf                        # Terraform infra for SageMaker
└── .github/workflows/train.yml    # CI/CD with GitHub Actions

Author

Christopher Gaughan, Ph.D.

Machine Learning Engineer | Biochemical Engineer | Cloud Architect

License

MIT License — free to use, modify, and deploy.

⚠️ Note: This repository is actively being organized to support fully runnable notebooks from end to end.
Currently, some notebooks (e.g., TrainingNotebook.ipynb) may require running from specific directories or setting paths manually.
A streamlined execution flow and directory structure are in progress as time allows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Educational Notebooks

SageMaker ML Pipeline: NLP Classification & MLOps Platform Engineering

Project Goals

Tech Stack

Core Features

Multiclass Text Classification with DistilBERT

Containerized Inference

🐳 Docker-Ready

Infrastructure-as-Code

CI/CD Automation

Educational Notebooks

Weights, Biases, and Core ML Concepts

Rough Folder Guide (has been moved around)

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
SageMaker		SageMaker
Dockerfile		Dockerfile
EDAMultiClassTextClassification.ipynb		EDAMultiClassTextClassification.ipynb
OptionalExperimentalNoteboook.ipynb		OptionalExperimentalNoteboook.ipynb
Positional_Encodings.ipynb		Positional_Encodings.ipynb
README.md		README.md
TrainingNotebook.ipynb		TrainingNotebook.ipynb
loss_optimizer_activation2.ipynb		loss_optimizer_activation2.ipynb
loss_optimizer_activation_functions.ipynb		loss_optimizer_activation_functions.ipynb
main.tf		main.tf
requirements.txt		requirements.txt
script.py		script.py

christophergaughan/SageMaker

Folders and files

Latest commit

History

Repository files navigation

Educational Notebooks

SageMaker ML Pipeline: NLP Classification & MLOps Platform Engineering

Project Goals

Tech Stack

Core Features

Multiclass Text Classification with DistilBERT

Containerized Inference

🐳 Docker-Ready

Infrastructure-as-Code

CI/CD Automation

Educational Notebooks

Weights, Biases, and Core ML Concepts

Rough Folder Guide (has been moved around)

Author

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages