8000 RSS scrape image from article if otherwise none is found by rakicjovan · Pull Request #630 · glanceapp/glance · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

RSS scrape image from article if otherwise none is found #630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

rakicjovan
Copy link

Some RSS Feeds don't provide a media tag and the other failsafes to try and get an image to display also sometimes fail, an example would be the RSS Feed of Bleeping Computer:
https://www.bleepingcomputer.com/feed/

This PR implements another failover which uses the worker pool to scrape an image from the article itself.
It uses the CSS selectors article img, main img, and .post-content img to look for images typically found in article content. The first element found within these selectors that has a valid src attribute is used as the preview image.

Tested using Docker and the case of Bleeping Computer where else no pictures are displayed, loading times don't seem affected thanks to the worker pool.
Keep in mind to change the User Agent if trying to replicate with Bleeping Computer, otherwise you will be blocked by Cloudflare.

Before:
image

After:
image

@dhanadhan
Copy link

@rakicjovan Can you tell me how to fix this? What config do I need to set to generate images?

@rakicjovan
Copy link
Author

@dhanadhan You'll have to clone my fork of the repo and build the docker image or the binary.
My config for the BleepingComputer RSS looks like this:

- type: rss
            style: detailed-list
            limit: 15
            collapse-after: 3
            feeds:
              - url: https://www.bleepingcomputer.com/feed/
                headers:
                  User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3
                title: Bleeping Computer

I found that if I don't change the User-Agent, glance gets blocked by Cloudflare.
Maybe you'll have to limit it to 10 items from BleepingComputer to not get rate limited.

@svilenmarkov
Copy link
Member

While this is potentially a big improvement in some cases, personally I feel like if a feed doesn't provide proper thumbnails then just let it be. The RSS widget is already one of the slowest things to load and this would exacerbate the issue further. There was another PR which adds fallback thumbnails, and that wouldn't be as good as this, but it's much simpler less costly to add.

@rakicjovan
Copy link
Author

@svilenmarkov I've been daily driving it and haven't experienced any issues regarding loading times thanks to the efficiency of go and the worker pool. As you said it could potentially be a big improvement for people using feeds which don't provide the image directly in the feed.
How about an option in the yaml config like "scrape-image" with a default value of false, so the user would need to explicitly set the option if needed.

@rakicjovan
Copy link
Author

@svilenmarkov RSS image scraping is now disabled by default. It must be explicitly enabled per RSS feed via the config file. The configuration documentation has been updated accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0