XGBlog Exercise: Machine Learning

Article Link: (March 2025)
(Dealing With Missing Values, Part 1.)

Topic:

No real world data collection process is perfect, and we are often left with all sorts of noise in our dataset: incorrectly recorded values, non-recorded values, corruption of data, etc. If we are able to spot all those irregular points, oftentimes the best we can do is treat them as missing values. Missing values are the fact of life if you work in data science, machine learning, or any other field that relies on the real-world data. Most of us hardly give those data points much thought, and when we do we rely on many ready-made tools, algorithms, or rules of thumb to deal with them. However, to do them proper justice you sometimes need to dig deeper, and make a judicious choice of what to do with them. And what you end up doing with them, like in many other circumstances in data science, can be boiled down to the trusted old phrase of “it depends”. Missing data can significantly impact the results of analyses and models, potentially leading to biased or misleading outcomes. -Bojan Tunguz

Article Link: (June 2025)
(Dealing With Missing Values, Part 2.)

Topic:

Multivariate Imputation by Chained Equations (MICE) is a sophisticated approach for handling missing data, especially effective when the missing values follow a complex pattern under the Missing At Random (MAR) assumption. Unlike simpler methods that handle each feature independently, MICE iteratively models each feature based on the others, capturing the inherent relationships within your dataset.

The MICE process works by first filling missing values with initial estimates, often simple ones like mean or median values. It then iteratively refines these estimates by modeling each feature with regression techniques, conditional on the others. This approach allows MICE to accurately preserve multivariate relationships and provide uncertainty estimates for the imputed values.

However, MICE requires careful tuning. It involves deciding on the number of iterations to run and handling the computational complexity that arises from modeling each feature iteratively. -Bojan Tunguz

Article Link: (June 2025)
(Dealing With Missing Values, Part 3.)

Topic:
Three more ways of dealing with missing values.

Probabilistic & Statistical Approaches (Bayesian/EM-style) explicity model missingness under a rigorous framework.

Interpolation (Time-Series & Ordered Data) fill in missing data by estimating values based on surrounding known data points.

Robust Model Design (Tree-Based & Native Missing-Value Handling) inherently manages missing values during training, eliminating the need for a separate imputation step and often resulting in excellent predictive performance.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
gen-data		gen-data
.gitignore		.gitignore
README.md		README.md
filter_values.py		filter_values.py
requirements.txt		requirements.txt
techniques.ipynb		techniques.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

XGBlog Exercise: Machine Learning

About

Uh oh!

Releases

Packages

Uh oh!

Languages

rb-thompson/xgb-exercises

Folders and files

Latest commit

History

Repository files navigation

XGBlog Exercise: Machine Learning

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages