8000 Float conversion issue screwing with numeric encoders. · Issue #27 · minimaxir/automl-gs · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Float conversion issue screwing with numeric encoders. #27
Open
@germanjoey

Description

@germanjoey

I almost feel bad for reporting this one.

Using the yacht hydrodynamics UIC dataset, I got this error:

(env) (base) C:\Users\josep\Jeenee\AutoML\automl_train>python model.py -d ..\automl-testbench\yacht-hydrodynamics\data.csv -m train
Traceback (most recent call last):
  File "model.py", line 46, in <module>
    model_train(df, encoders, args, model)
  File "C:\Users\josep\Jeenee\AutoML\automl_train\pipeline.py", line 347, in model_train
    X, y = process_data(df, encoders)
  File "C:\Users\josep\Jeenee\AutoML\automl_train\pipeline.py", line 296, in process_data
    df['Length-beam ratio'].values, encoders['length_beam_ratio_bins'], labels=False, include_lowest=True, duplicates='drop')
  File "C:\Users\josep\Jeenee\AutoML\venv\lib\site-packages\pandas\core\reshape\tile.py", line 235, in cut
    raise ValueError('bins must increase monotonically.')
ValueError: bins must increase monotonically.

Hmmm, odd. Let's take a look at pipeline.py...

    # Length-beam ratio
    length_beam_ratio_enc = df['Length-beam ratio']
    length_beam_ratio_bins = length_beam_ratio_enc.quantile(
        np.linspace(0, 1, 10+1))
    encoders['length_beam_ratio_bins'] = length_beam_ratio_bins
    
    # ....

    # Length-beam ratio
    length_beam_ratio_enc = pd.cut(
        df['Length-beam ratio'].values, encoders['length_beam_ratio_bins'], labels=False, include_lowest=True, duplicates='drop')

The error is referring to the .cut line, which I had previously patched to include the duplicates='drop' bit. But the current error isn't related to that, but complaining about the encoder. Hmmm, nothing looks odd in the data about that column. Let's open up pdb and take a look...

>>> encoders['length_beam_ratio_bins']
[2.73, 2.76, 3.15, 3.15, 3.1499999999999995, 3.15, 3.17, 3.32, 3.51, 3.51, 3.64]

facepalm

Well now! I suppose I'll concede that's technically not monotonically increasing!

I appended a .round(4) to the two .quantile lines of encoders/numeric (lines 12 and 15), which worked for this test case. This is certainly not an adequate general solution, however, as e.g. that'll break data on data that needs precision at the 5th decimal place...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0