MDI

Missing Data Imputation Python Library (version 0.1)

This repository offers techniques for handling missing data and encoding categorical data such that it is appropriate to neural network classifiers and other tasks. We provide six different imputation strategies and include examples using the Adult dataset. Will soon include data, python and latex code for a wip paper on MDI, Random Forest and Neural Networks.

Techniques for handling categorical missing data

We categorize proposed imputation methods into six groups listed below:

Case substitution One observation with missing data is replaced with another non-sampled obser- vation.

Summary statistic Replace the missing data with the mean, median, or mode of the feature vec- tor. Using a numerical approach directly is not appropriate for nonordinal categorical data.

One-hot Create a binary variable to indicate whether or not a specific feature is missing.

Hot deck and cold deck Compute the K-Nearest Neighbors of the observation with missing data and assign the mode of the K-neighbors to the missing data. algorithm.

Prediction Model Train a prediction model (e.g., random forests) to predict the missing value.

Factor analysis Perform factor analysis (e.g., principal component analysis (PCA)) on the design matrix, project the design matrix onto the first N eigenvectors and replace the missing values by the values that might be given by the projected design matrix.

Adult Dataset example

The figure below shows frequency of job category in the Adult dataset before and after the imputation techniques above were used.
Code can be found here

Congresssional voting records dataset example

Code can be found here

TO DO

Compute error bars for prediction accuracy for each classifier/method - J

Make sure that PCA only operates on complete features - R

Use Non-Negative Matrix Factorization instead of PCA - R

Name		Name	Last commit message	Last commit date
Latest commit History 216 Commits
data		data
images		images
report		report
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bayesian_parameter_optimization.py		bayesian_parameter_optimization.py
create_folders.py		create_folders.py
draw_network.py		draw_network.py
example_adult.py		example_adult.py
example_adult_mcar.py		example_adult_mcar.py
example_votes.py		example_votes.py
include_data.csv		include_data.csv
include_votes.csv		include_votes.csv
missing_data_imputation.py		missing_data_imputation.py
neural_networks.py		neural_networks.py
nnet_bin_scaled.py		nnet_bin_scaled.py
nnet_full_bin_scaled.py		nnet_full_bin_scaled.py
nnet_lasagne.py		nnet_lasagne.py
nnet_utils.py		nnet_utils.py
parameter_search.py		parameter_search.py
params.py		params.py
plot_errors_boxplot.py		plot_errors_boxplot.py
plot_parameters_tried.py		plot_parameters_tried.py
plotting.py		plotting.py
predict_with_all_models.py		predict_with_all_models.py
predict_with_best_dt_and_rf.py		predict_with_best_dt_and_rf.py
predict_with_best_model.py		predict_with_best_model.py
predict_with_dt_and_rf.py		predict_with_dt_and_rf.py
preprocess_data.py		preprocess_data.py
preprocess_test_data.py		preprocess_test_data.py
preprocess_votes.py		preprocess_votes.py
processing.py		processing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MDI

Techniques for handling categorical missing data

Adult Dataset example

Congresssional voting records dataset example

TO DO

About

Releases

Packages

Languages

License

Gurpreethgnis/MDI

Folders and files

Latest commit

History

Repository files navigation

MDI

Techniques for handling categorical missing data

Adult Dataset example

Congresssional voting records dataset example

TO DO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages