-
Notifications
You must be signed in to change notification settings - Fork 133
FEAT add the Frequency Encoder #707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FEAT add the Frequency Encoder #707
Conversation
Thanks for this feature, @coconutattitude
|
Hi @coconutattitude, are you planning to continue this PR? |
shall we try to revive this? :) it would be a nice addition |
@rcap107 are you interested in continuing this PR once StringEncoder is merged? |
I can work on it, yes |
This feature was implemented in #1235 |
I don't think it was because this PR aims to encode categories using their frequencies; hence, I don't see how the periodic encoder solves this. WDYT? |
I don't think it was because this PR aims to encode categories using their frequencies; hence, I don't see how the periodic encoder solves this. WDYT?
I agree. I might have misunderstood this PR.
Given that such encoder is quite popular, I am +1 with the proposed feature.
|
My bad, I mixed up this PR and the old discussion on the periodic encoder. This is absolutely unrelated from #1235 |
Hey, I'm picking up this one. Looking at the possible implementation, there are some points that we should consider:
|
I think that identifying a dataset where this transformer performs well would strongly motivate the inclusion of this feature. It would also help us craft a narrative around it and clarify its role among the other encoders. If I recall correctly, OP was using it on IP addresses. |
I would go for the number of bins, which is easier to set and more intuitive than bin edges
Do you mean cut for pandas and qcut polars? |
Good point, I do not have good suggestions for that unfortunately 😅
I agree
No, I mean having both uniform and quantile bins in the same transformer, rather than having a transformer that does uniform binning and one that does quantile binning. There would be a flag that selects the transformer to use. Both pandas and polars have a qcut function, so it's just a matter of deciding where to put it. |
Actually, doesn't the KBinsDiscretizer already do most of what this feature needs? |
Yes indeed! The only drawback I see is that the KBinsDiscretizer converts Pandas and Polars dataframes into NumPy arrays, which creates a copy and has a larger memory footprint. @GaelVaroquaux, should we make our tiny KBinsDiscretizer with skrub's sauce? |
Yes indeed! The only drawback I see is that the KBinsDiscretizer converts Pandas and Polars dataframes into NumPy arrays, which creates a copy and has a larger memory footprint. @GaelVaroquaux, should we make our tiny KBinsDiscretizer with skrub's sauce?
Let's say not for now, because we are not sure how important the usecase is. But if the usecase pops up in datasets and examples, why not.
|
Addresses #706 issue.
Here's a first iteration of the class and it's a proof of concept made during EuroSciPy 2023. Documentation and tests are needed. I'll add quantile strategies in the future. What do you think about this proof of concept?