-
Notifications
You must be signed in to change notification settings - Fork 80
Implement behavior on parsing of a SitemapIndex with a bad sitemap link in it #259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Good catch: SitemapTester and also the recursive parsing method +1 to catch exceptions and log them instead. If anybody needs a validating parser which quits on the first error, it can be easily implemented. But we might also provide an optional argument to control the behavior on errors. Any thoughts? |
Just to clarify the scenario This specific scenario is attempting to Sitemap Parse Adwords Main Sitemap which is a legit sitemapIndex, containing links to other bad sitemaps like this one I am +1 on catching the internal exception and logging it About the validating parser option, if we would like to implement it, can we use the Strict or AllowPartial flags ? |
The option strict has a clearly different meaning, see #267. The option allowPartial may fit but we could also define a new one with the defined semantics to keep going when parsing sitemap indexes recursively. |
From an implementation standpoint, should it be the responsibility of the user of |
@evanhalley: Yes, I principally agree that users shall handle the exceptions. However, in the current SiteMapParser implementation all kinds of exceptions are caught, eventually logged and then ignored. The parser is tuned to keep going and not to fail at the first invalid XML tag or entity, any malformed URL, lastmod date, priority, etc. That makes sense because a web crawler wants URLs as many as possible. In order to keep going the parseSiteMap method cannot throw the exception, loosing its internal state that way. But maybe it's a good idea to allow users to register an exception handler? And we could provide one (default, same behavior as now) to ignore format errors if it's possible to continue, and one to report or fail on any error (to implement a validating parser). Any thoughts?
Yes, and strictly speaking the |
I like the idea. In So I am +1 here :) |
Uh oh!
There was an error while loading. Please reload this page.
Ran the "Main" of SiteMapTester with the argument of: https://www.google.com/sitemap.xml
Our recursive sitemap tester fails with the following exception:
The text was updated successfully, but these errors were encountered: