Description
I almost feel bad for reporting this one.
Using the yacht hydrodynamics UIC dataset, I got this error:
(env) (base) C:\Users\josep\Jeenee\AutoML\automl_train>python model.py -d ..\automl-testbench\yacht-hydrodynamics\data.csv -m train
Traceback (most recent call last):
File "model.py", line 46, in <module>
model_train(df, encoders, args, model)
File "C:\Users\josep\Jeenee\AutoML\automl_train\pipeline.py", line 347, in model_train
X, y = process_data(df, encoders)
File "C:\Users\josep\Jeenee\AutoML\automl_train\pipeline.py", line 296, in process_data
df['Length-beam ratio'].values, encoders['length_beam_ratio_bins'], labels=False, include_lowest=True, duplicates='drop')
File "C:\Users\josep\Jeenee\AutoML\venv\lib\site-packages\pandas\core\reshape\tile.py", line 235, in cut
raise ValueError('bins must increase monotonically.')
ValueError: bins must increase monotonically.
Hmmm, odd. Let's take a look at pipeline.py...
# Length-beam ratio
length_beam_ratio_enc = df['Length-beam ratio']
length_beam_ratio_bins = length_beam_ratio_enc.quantile(
np.linspace(0, 1, 10+1))
encoders['length_beam_ratio_bins'] = length_beam_ratio_bins
# ....
# Length-beam ratio
length_beam_ratio_enc = pd.cut(
df['Length-beam ratio'].values, encoders['length_beam_ratio_bins'], labels=False, include_lowest=True, duplicates='drop')
The error is referring to the .cut line, which I had previously patched to include the duplicates='drop'
bit. But the current error isn't related to that, but complaining about the encoder. Hmmm, nothing looks odd in the data about that column. Let's open up pdb and take a look...
>>> encoders['length_beam_ratio_bins']
[2.73, 2.76, 3.15, 3.15, 3.1499999999999995, 3.15, 3.17, 3.32, 3.51, 3.51, 3.64]
facepalm
Well now! I suppose I'll concede that's technically not monotonically increasing!
I appended a .round(4)
to the two .quantile
lines of encoders/numeric
(lines 12 and 15), which worked for this test case. This is certainly not an adequate general solution, however, as e.g. that'll break data on data that needs precision at the 5th decimal place...