8000 Refactor CLI scraping: fix trending words, handle dynamic classes, improve outputs, and add request headers by spithash · Pull Request #1 · agmmnn/etym-cli · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Refactor CLI scraping: fix trending words, handle dynamic classes, improve outputs, and add request headers #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

spithash
Copy link

Addresses multiple issues in the CLI scraper for the Online Etymology Dictionary:

Trending Words:
The previous method relied on CSS classes that are dynamically generated and change frequently, causing scraping failures. To fix this, the trending words are now extracted from the /word/test page where the sidebar is consistently present, using stable selectors instead of brittle class names.

Request Headers:
Added a User-Agent header to all HTTP requests to mimic a browser and reduce risk of being blocked.

Output Formatting:
Improved both plain text and rich output functions for better title extraction, whitespace trimming, and readability.

Fuzzy Search:
Added headers and error handling for robustness.

General:
Updated selectors to be more resilient against frontend changes and dynamic classes.

spithash added 2 commits June 18, 2025 09:14
…prove outputs, and add request headers

This PR addresses multiple issues in the CLI scraper for the Online Etymology Dictionary:

    Trending Words:
    The previous method relied on CSS classes that are dynamically generated and change frequently, causing scraping failures. To fix this, the trending words are now extracted from the /word/test page where the sidebar is consistently present, using stable selectors instead of brittle class names.

    Request Headers:
    Added a User-Agent header to all HTTP requests to mimic a browser and reduce risk of being blocked.

    Output Formatting:
    Improved both plain text and rich output functions for better title extraction, whitespace trimming, and readability.

    Fuzzy Search:
    Added headers and error handling for robustness.

    General:
    Updated selectors to be more resilient against frontend changes and dynamic classes.
'from html import unescape'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0