-
Notifications
You must be signed in to change notification settings - Fork 80
Add namespace aware DOM/SAX parsing for XML Sitemaps. #173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rsing is also namespace aware, but finding elements is left "relaxed" by only matching on the element "localName".
Thanks @grokwich, looks good at first glance but would need a more thorough review. BTW you can apply the project's formatting rules with |
Cool. Thx for the formatting tip. Btw, if you wish to include the enhancement, feel free to hack up the code to suit your project. |
@@ -436,18 +445,18 @@ protected SiteMap parseXmlSitemap(URL sitemapUrl, Document doc) { | |||
SiteMap sitemap = new SiteMap(sitemapUrl); | |||
sitemap.setType(SitemapType.XML); | |||
|
|||
NodeList list = doc.getElementsByTagName("url"); | |||
NodeList list = doc.getElementsByTagNameNS(Namespace.SITEMAP, "url"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would happen here if the namespace is not set explicitly in the XML? Why not do as above and just match based on the localname no matter what the NS is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnioche i believe there would be no match without specifying the namespace. Yes, there would be risk in breaking existing sitemaps that do not specify the namespace. I was more forgiving about RSS/Atom, since they optionally allow namespaces.
But i believe the Sitemap XML spec requires a namespace...
https://www.sitemaps.org/protocol.html
https://support.google.com/webmasters/answer/183669#errors
And this validator fails when the namespace declaration is missing...
http://freetools.webmasterworld.com/tools/site-validator/
If you match any namespace, then you also risk incorrectly processing an element belonging to another namespace (example, i can imaging that may exist in another mixed-in namespace).
So the decision seems to boil down to whether you wish to support invalid sitemap XML's that do not include the namespace decl. If that is a common case, then i can see merit, but my impression is that google would reject a Sitemap XML without the ns?
If you do go forward with this Pull and require the namespace, then it may be a good idea to fail-fast and report an error if the ns is missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @grokwich,
I've tried your pull request on a set of few 10,000 sitemaps using the SAX parser (see 1). There is a significant impact: the number of extracted URLs drops from 193 million to 188 million. Afaics, that's (mostly) not caused by absent namespace declarations but by a namespace URL not equal to http://www.sitemaps.org/schemas/sitemap/0.9. E.g., in this sitemap.xml:
<urlset xmlns="http://www.google.com/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
It could be just an outdated location or version (like https://www.google.com/schemas/sitemap/0.84/). But is this really a reason to make the parser fail? What happens if a new sitemap version is released?
I agree that the parser should support namespaces but at least for the SAX parser robustness is most important, with the aim to extract URLs even if the sitemap is not valid. A validation mode or parser would be nice to have but it should be optional as it's just a different use case. If used for a crawler robustness and fault-tolerance matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sebastian-nagel.
193M URL's is quite the test-bed:-)
Google Search Console has a sitemap testing tool. When i test with no namespace decl i get a failure...
Also if i provide a bad namespace URL i get a failure...
But interestingly if i test using the URL examples you provided, then the test passes. I guess this makes sense since the independent sitemap spec evolved from google's old internal sitemap format. They don't seem to document those legacy namespaces, so i assume they are deprecated but still supported for legacy clients.
I also tried a few 3'rd party validators, and they reject the Google legacy URL's.
http://tools.seochat.com/tools/site-validator
http://freetools.webmasterworld.com/tools/site-validator
The spec is clear that the namespace is required...
https://www.sitemaps.org/protocol.html
Here is also an excerpt from a Google doc to troubleshoot failed processing, where they state that the namespace must exist and have the exact URL specified (expand the "Complete error list" at the bottom of the following page)...
https://support.google.com/webmasters/answer/183669#error-list
Here is the quote...
Unsupported format
Your sitemap is not in a supported format. Sitemaps must be in XML format, and use the correct header.
Common XML mistakes:
Your sitemap must use the correct header. For example, if your sitemap contains video information, it would have the following header:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
The namespace in the header must be "http://www.sitemaps.org/schemas/sitemap/0.9" (not .9).
All XML attributes must enclosed in either single quotes (') or double quotes (") and those quotes must be straight, not curly. Word-processing programs such as Microsoft Word can insert curly quotes.
Another consideration is when a sitemap adds additional namespaces. If one of those includes elements names that overlap with element names from sitemap XML, then you run the risk of incorrectly processing an element from another namespace.
For example, here is a case where you would incorrectly foobar:url...
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:foobar="http://www.foobar.com/schemas/foobar">
<url>
...
</url>
<foobar:url>
...
</foobar:url>
....
<urlset>
As for versioning, if a new namespace is introduced for sitemap (unlikely), then presumably new capability is being added that will change the way a client should process the doc. In that case the sitemap parsers would have to be enhanced to support processing the new version, and clients should have to specify which version to parse with (default being the old version).
Whatever pragmatic decision you guys make is good with me. Thx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @grokwich,
First off, many thanks for the very detailed response & testing! After dealing with all of the fun issues with invalid robot.txt files found in the wild, I appreciate your research and the time spent documenting things.
My two cents - I think your middle-ground solution is a good way to handle the two conflicting goals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to the middle-ground approach
Thanks for the testing and commenting
String changeFreq = getElementValue(elem, "changefreq"); | ||
String priority = getElementValue(elem, "priority"); | ||
String loc = getElementValue(elem, "loc"); | ||
String lastMod = getElementValue(Namespace.SITEMAP, elem, "lastmod"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
} else if ("feed".equals(qName)) { | ||
elementStack.push(localName); | ||
|
||
if ( Namespace.SITEMAP.equals(uri) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here - why don't we consider only localnames regardless of the NS? What if the NS is not set?
@crawler-commons/committers @grokwich Could you please review it? |
Hi @jnioche The use of the wildcard namespace looks good to me. Though I thought the "middle ground approach" was to match against the wildcard namespace by default (non-strict), and allow configuration that would require correct/strict use of the sitemap namespace? If i'm reading correctly it appears that the default is "strict = true"? I would personally be fine with that, but i thought the concern was breaking legacy cases that @sebastian-nagel identified? You are also overloading Sitemap.strict. The legacy meaning is related to sitemap baseurl processing, and now you are also including the correct use of the sitemap namespace (if strict == true ). That may be fine, but it will constrain configuration to all strict (URL + namespace processing), or all non-strict. It may also be the reason why you made namespace processing strict by default (since legacy URL processing is strict by default)? One other personal nice-to-have, if i configure for strict namespace processing it would be nice to confirm that the correct sitemap namespace is declared. If not, then i would prefer to at least have a Warning logged (maybe even an Error)? Currently the code will silently result in zero sitemap URL's if the the namespace is not correctly declared (and strict namespace processing is configured). |
@grokwich thanks for your comments. Yes, I have lumped together the control over the URLs and the namespace processing under the existing strict variable. Although semantically different they are relatively close in the sense that it is about enforcing the specifications. So you are right, by default we don't use the wildcard. I am reluctant to change the API to add a separate boolean parameter for enforcing the namespace. Instead, we could have pass a Map<String,Object> around to hold the configuration. This would future-proof the code as we'd be able to add more config elements without breaking the API. However, this is a substantial change and I'd rather deal with it in a separate PR. @sebastian-nagel in your benchmarks is there any way to see what we gained with this PR? There's bound to be sitemaps with namespaced elements which we could not get before. Also what do you think of the approach I described above i.e use wildcard when strict = false and add more subtle configuration separately? |
I've found only a single document in sitemap-test-2017-03-04.warc.gz
But the reasons to be strict are quite different: follow the format spec vs. avoid to be spammed. If #85 is implemented I would configure parser for commoncrawl.org to be strict and accept only URLs from the list of cross-submit hosts and the sitemap host itself. But parsing should be still permissive regarding format. The SAX parser already has a flag |
Have opened a new PR #176 with a separate field for configuring strictNamespace |
Sorry i was unable to review the latest pull. Just took a peek, looks great. |
Add namespace aware DOM/SAX parsing for XML Sitemaps. RSS and Atom parsing is also namespace aware, but finding elements has been left "relaxed" by only matching on the element "localName" (legacy behavior).
Here is test xml that was failing: