RSS scrape image from article if otherwise none is found #630

rakicjovan · 2025-05-01T21:13:12Z

Some RSS Feeds don't provide a media tag and the other failsafes to try and get an image to display also sometimes fail, an example would be the RSS Feed of Bleeping Computer:
https://www.bleepingcomputer.com/feed/

This PR implements another failover which uses the worker pool to scrape an image from the article itself.
It uses the CSS selectors article img, main img, and .post-content img to look for images typically found in article content. The first element found within these selectors that has a valid src attribute is used as the preview image.

Tested using Docker and the case of Bleeping Computer where else no pictures are displayed, loading times don't seem affected thanks to the worker pool.
Keep in mind to change the User Agent if trying to replicate with Bleeping Computer, otherwise you will be blocked by Cloudflare.

Before:

After:

dhanadhan · 2025-05-18T05:43:54Z

@rakicjovan Can you tell me how to fix this? What config do I need to set to generate images?

rakicjovan · 2025-05-18T12:05:18Z

@dhanadhan You'll have to clone my fork of the repo and build the docker image or the binary.
My config for the BleepingComputer RSS looks like this:

- type: rss
            style: detailed-list
            limit: 15
            collapse-after: 3
            feeds:
              - url: https://www.bleepingcomputer.com/feed/
                headers:
                  User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3
                title: Bleeping Computer

I found that if I don't change the User-Agent, glance gets blocked by Cloudflare.
Maybe you'll have to limit it to 10 items from BleepingComputer to not get rate limited.

svilenmarkov · 2025-05-19T20:20:16Z

While this is potentially a big improvement in some cases, personally I feel like if a feed doesn't provide proper thumbnails then just let it be. The RSS widget is already one of the slowest things to load and this would exacerbate the issue further. There was another PR which adds fallback thumbnails, and that wouldn't be as good as this, but it's much simpler less costly to add.

rakicjovan · 2025-05-19T22:10:51Z

@svilenmarkov I've been daily driving it and haven't experienced any issues regarding loading times thanks to the efficiency of go and the worker pool. As you said it could potentially be a big improvement for people using feeds which don't provide the image directly in the feed.
How about an option in the yaml config like "scrape-image" with a default value of false, so the user would need to explicitly set the option if needed.

rakicjovan · 2025-06-06T17:55:44Z

@svilenmarkov RSS image scraping is now disabled by default. It must be explicitly enabled per RSS feed via the config file. The configuration documentation has been updated accordingly.

implemented RSS image scraping

fa7bc62

rakic-jovan added 2 commits June 6, 2025 19:04

Merge branch 'dev' into feature-devRSSImageScraper

1d84304

make image scraping from RSS optional via YAML

7353f63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RSS scrape image from article if otherwise none is found #630

RSS scrape image from article if otherwise none is found #630

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RSS scrape image from article if otherwise none is found #630

Are you sure you want to change the base?

RSS scrape image from article if otherwise none is found #630

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!