8000 Implement behavior on parsing of a SitemapIndex with a bad sitemap link in it · Issue #259 · crawler-commons/crawler-commons · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Implement behavior on parsing of a SitemapIndex with a bad sitemap link in it #259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Chaiavi opened this issue Aug 18, 2019 · 6 comments

Comments

@Chaiavi
Copy link
Member
Chaiavi commented Aug 18, 2019

Ran the "Main" of SiteMapTester with the argument of: https://www.google.com/sitemap.xml

Our recursive sitemap tester fails with the following exception:

[main] INFO crawlercommons.sitemaps.SiteMapTester - Parsing https://www.google.com/adwords/intl/ALL_ar/adwords/sitemap.xml 
Exception in thread "main" crawlercommons.sitemaps.UnknownFormatException: Failed to detect MediaType of sitemap 'https://www.google.com/adwords/intl/ALL_ar/adwords/sitemap.xml'
	at crawlercommons.sitemaps.SiteMapParser.parseSiteMap(SiteMapParser.java:269)
	at crawlercommons.sitemaps.SiteMapTester.parse(SiteMapTester.java:77)
	at crawlercommons.sitemaps.SiteMapTester.parse(SiteMapTester.java:85)
	at crawlercommons.sitemaps.SiteMapTester.parse(SiteMapTester.java:85)
	at crawlercommons.sitemaps.SiteMapTester.main(SiteMapTester.java:53)
@sebastian-nagel
Copy link
Contributor

Good catch: SitemapTester and also the recursive parsing method SiteMapParser.walkSiteMap(...) do not catch exceptions if some of the subsitemaps of a sitemap index fail for some reason. https://www.google.com/adwords/intl/ALL_ar/adwords/sitemap.xml isn't a sitemap but a HTML page, so the failure is expected.

+1 to catch exceptions and log them instead. If anybody needs a validating parser which quits on the first error, it can be easily implemented. But we might also provide an optional argument to control the behavior on errors. Any thoughts?

@Chaiavi
Copy link
Member Author
Chaiavi commented Sep 10, 2019

Just to clarify the scenario

This specific scenario is attempting to Sitemap Parse Adwords Main Sitemap which is a legit sitemapIndex, containing links to other bad sitemaps like this one

I am +1 on catching the internal exception and logging it

About the validating parser option, if we would like to implement it, can we use the Strict or AllowPartial flags ?

@sebastian-nagel
Copy link
Contributor

The option strict has a clearly different meaning, see #267. The option allowPartial may fit but we could also define a new one with the defined semantics to keep going when parsing sitemap indexes recursively.

@Chaiavi Chaiavi changed the title Sitemap Tester Fails to Parse Google's Sitemap Implement behavior on parsing of a SitemapIndex with a bad sitemap link in it Sep 11, 2019
@evanhalley
Copy link
Contributor

From an implementation standpoint, should it be the responsibility of the user of SiteMapParser to catch and handle exceptions? (ie. the SiteMapTester class, logs the exception and continues instead of terminating)

@sebastian-nagel
Copy link
Contributor

@evanhalley: Yes, I principally agree that users shall handle the exceptions.

However, in the current SiteMapParser implementation all kinds of exceptions are caught, eventually logged and then ignored. The parser is tuned to keep going and not to fail at the first invalid XML tag or entity, any malformed URL, lastmod date, priority, etc. That makes sense because a web crawler wants URLs as many as possible.

In order to keep going the parseSiteMap method cannot throw the exception, loosing its internal state that way. But maybe it's a good idea to allow users to register an exception handler? And we could provide one (default, same behavior as now) to ignore format errors if it's possible to continue, and one to report or fail on any error (to implement a validating parser). Any thoughts?

the SiteMapTester class, logs the exception and continues

Yes, and strictly speaking the SiteMapTester is a user of the API (with #197 it would be moved to the tools module). It'd need to catch the exception because it calls the parser to in a loop for each (sub)sitemap of a sitemap index. The method SiteMapParser.walkSiteMap(...) is part of the API.

@rzo1
Copy link
Member
rzo1 commented Jul 7, 2022

In order to keep going the parseSiteMap method cannot throw the exception, loosing its internal state that way. But maybe it's a good idea to allow users to register an exception handler? And we could provide one (default, same behavior as now) to ignore format errors if it's possible to continue, and one to report or fail on any error (to implement a validating parser). Any thoughts?

I like the idea. In crawler4j, we are doing something similar (although, we do not use the concept of an exception handler): We just provide some (protected) hook-up methods like onUnhandledException, onContentFetchException(...) to enable the user to do something with exceptions thrown.

So I am +1 here :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
0