Lightweight prediction model (AUC 0.77 from just 800 rows of data)
Predict does:
- exploratory data analysis
- feature engineering
- predictive modeling
With pip, run:
pip install predict
git clone https://github.com/melvynkim/predict.git
cd predict
pip install -r requirements.txt
py.test tests
For rich visualizations, run Predict from a Jupyter notebook.
For classification, use:
%matplotlib inline
import predict
pd = predict.Classifier(
train_data='train.csv',
test_data='test.csv',
target_col='Survived',
id_col='PassengerId')
pd.analyze()
pd.model()
For regression, use the predict.Regressor
class.
Tip: To prevent scrolling in notebooks, select Cell > Current Outputs > Toggle Scrolling
.
There are two primary methods:
analyze
runs exploratory data analysismodel
builds and evaluates different models
Optionally pass test data if you want to generate a CSV file with predictions.
Data can be a file
predict.Classifier(train_data='train.csv', ...)
Or a data frame
train_df = pd.read_csv('train.csv')
# do preprocessing
# ...
predict.Classifier(train_data=train_df, ...)
Specify datetime columns with:
predict.Classifier(datetime_cols=['created'], ...)
Predict has support for a number of eval metrics.
Classification
accuracy
- # correct / total (default)auc
- area under the ROC curvemlogloss
- multi class log loss
Regression
rmse
- root mean square error (default)rmsle
- root mean square logarithmic error
Specify an eval metric with:
predict.Classifier(eval_metric='mlogloss', ...)
Predict builds and compares different models. Currently, it uses:
- boosted trees
- simple benchmarks (mode for classification, mean and median for regression)
XGBoost is required for boosted trees. Install it with:
pip install xgboost
Dataset | Eval Metric | v0.1 | Current |
---|---|---|---|
House Prices | RMSLE | 0.14069 | 0.13108 |
Rental Listing Inquiries | Multi Class Log Loss | - | 0.61861 |
Titanic | Accuracy | 0.77512 | 0.77512 |