Implement behavior on parsing of a SitemapIndex with a bad sitemap link in it #259

Chaiavi · 2019-08-18T16:38:11Z

Ran the "Main" of SiteMapTester with the argument of: https://www.google.com/sitemap.xml

Our recursive sitemap tester fails with the following exception:

[main] INFO crawlercommons.sitemaps.SiteMapTester - Parsing https://www.google.com/adwords/intl/ALL_ar/adwords/sitemap.xml 
Exception in thread "main" crawlercommons.sitemaps.UnknownFormatException: Failed to detect MediaType of sitemap 'https://www.google.com/adwords/intl/ALL_ar/adwords/sitemap.xml'
	at crawlercommons.sitemaps.SiteMapParser.parseSiteMap(SiteMapParser.java:269)
	at crawlercommons.sitemaps.SiteMapTester.parse(SiteMapTester.java:77)
	at crawlercommons.sitemaps.SiteMapTester.parse(SiteMapTester.java:85)
	at crawlercommons.sitemaps.SiteMapTester.parse(SiteMapTester.java:85)
	at crawlercommons.sitemaps.SiteMapTester.main(SiteMapTester.java:53)

The text was updated successfully, but these errors were encountered:

sebastian-nagel · 2019-09-09T15:09:36Z

Good catch: SitemapTester and also the recursive parsing method SiteMapParser.walkSiteMap(...) do not catch exceptions if some of the subsitemaps of a sitemap index fail for some reason. https://www.google.com/adwords/intl/ALL_ar/adwords/sitemap.xml isn't a sitemap but a HTML page, so the failure is expected.

+1 to catch exceptions and log them instead. If anybody needs a validating parser which quits on the first error, it can be easily implemented. But we might also provide an optional argument to control the behavior on errors. Any thoughts?

Chaiavi · 2019-09-10T06:04:47Z

Just to clarify the scenario

This specific scenario is attempting to Sitemap Parse Adwords Main Sitemap which is a legit sitemapIndex, containing links to other bad sitemaps like this one

I am +1 on catching the internal exception and logging it

About the validating parser option, if we would like to implement it, can we use the Strict or AllowPartial flags ?

sebastian-nagel · 2019-09-10T09:56:59Z

The option strict has a clearly different meaning, see #267. The option allowPartial may fit but we could also define a new one with the defined semantics to keep going when parsing sitemap indexes recursively.

evanhalley · 2021-11-18T18:33:12Z

From an implementation standpoint, should it be the responsibility of the user of SiteMapParser to catch and handle exceptions? (ie. the SiteMapTester class, logs the exception and continues instead of terminating)

sebastian-nagel · 2021-11-19T13:43:59Z

@evanhalley: Yes, I principally agree that users shall handle the exceptions.

However, in the current SiteMapParser implementation all kinds of exceptions are caught, eventually logged and then ignored. The parser is tuned to keep going and not to fail at the first invalid XML tag or entity, any malformed URL, lastmod date, priority, etc. That makes sense because a web crawler wants URLs as many as possible.

In order to keep going the parseSiteMap method cannot throw the exception, loosing its internal state that way. But maybe it's a good idea to allow users to register an exception handler? And we could provide one (default, same behavior as now) to ignore format errors if it's possible to continue, and one to report or fail on any error (to implement a validating parser). Any thoughts?

the SiteMapTester class, logs the exception and continues

Yes, and strictly speaking the SiteMapTester is a user of the API (with #197 it would be moved to the tools module). It'd need to catch the exception because it calls the parser to in a loop for each (sub)sitemap of a sitemap index. The method SiteMapParser.walkSiteMap(...) is part of the API.

rzo1 · 2022-07-07T15:29:50Z

In order to keep going the parseSiteMap method cannot throw the exception, loosing its internal state that way. But maybe it's a good idea to allow users to register an exception handler? And we could provide one (default, same behavior as now) to ignore format errors if it's possible to continue, and one to report or fail on any error (to implement a validating parser). Any thoughts?

I like the idea. In crawler4j, we are doing something similar (although, we do not use the concept of an exception handler): We just provide some (protected) hook-up methods like onUnhandledException, onContentFetchException(...) to enable the user to do something with exceptions thrown.

So I am +1 here :)

Chaiavi added bug sitemaps labels Aug 18, 2019

Chaiavi changed the title ~~Sitemap Tester Fails to Parse Google's Sitemap~~ Implement behavior on parsing of a SitemapIndex with a bad sitemap link in it Sep 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement behavior on parsing of a SitemapIndex with a bad sitemap link in it #259

Implement behavior on parsing of a SitemapIndex with a bad sitemap link in it #259

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Implement behavior on parsing of a SitemapIndex with a bad sitemap link in it #259

Implement behavior on parsing of a SitemapIndex with a bad sitemap link in it #259

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!