Add namespace aware DOM/SAX parsing for XML Sitemaps. #173

grokwich · 2017-09-05T15:27:48Z

Add namespace aware DOM/SAX parsing for XML Sitemaps. RSS and Atom parsing is also namespace aware, but finding elements has been left "relaxed" by only matching on the element "localName" (legacy behavior).

Here is test xml that was failing:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<sit:urlset xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1" xmlns:sit="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1999/xhtml link.xsd http://www.google.com/schemas/sitemap-video/1.1 video.xsd http://www.sitemaps.org/schemas/sitemap/0.9 sitemap.xsd http://www.google.com/schemas/sitemap-image/1.1 image.xsd">
    <sit:url>
        <sit:loc>http://www.example.com/1</sit:loc>
        <sit:changefreq>daily</sit:changefreq>
    </sit:url>
    <sit:url>
        <sit:loc>http://www.example.com/2</sit:loc>
        <sit:changefreq>daily</sit:changefreq>
    </sit:url>
</sit:urlset>

…rsing is also namespace aware, but finding elements is left "relaxed" by only matching on the element "localName".

jnioche · 2017-09-05T16:03:37Z

Thanks @grokwich, looks good at first glance but would need a more thorough review.

BTW you can apply the project's formatting rules with mvn java-formatter:format.

grokwich · 2017-09-05T16:16:00Z

Cool. Thx for the formatting tip. Btw, if you wish to include the enhancement, feel free to hack up the code to suit your project.

jnioche · 2017-09-08T14:52:00Z

src/main/java/crawlercommons/sitemaps/SiteMapParser.java

@@ -436,18 +445,18 @@ protected SiteMap parseXmlSitemap(URL sitemapUrl, Document doc) {
        SiteMap sitemap = new SiteMap(sitemapUrl);
        sitemap.setType(SitemapType.XML);

-        NodeList list = doc.getElementsByTagName("url");
+        NodeList list = doc.getElementsByTagNameNS(Namespace.SITEMAP, "url");


what would happen here if the namespace is not set explicitly in the XML? Why not do as above and just match based on the localname no matter what the NS is?

@jnioche i believe there would be no match without specifying the namespace. Yes, there would be risk in breaking existing sitemaps that do not specify the namespace. I was more forgiving about RSS/Atom, since they optionally allow namespaces.

But i believe the Sitemap XML spec requires a namespace...
https://www.sitemaps.org/protocol.html
https://support.google.com/webmasters/answer/183669#errors

And this validator fails when the namespace declaration is missing...
http://freetools.webmasterworld.com/tools/site-validator/

If you match any namespace, then you also risk incorrectly processing an element belonging to another namespace (example, i can imaging that may exist in another mixed-in namespace).

So the decision seems to boil down to whether you wish to support invalid sitemap XML's that do not include the namespace decl. If that is a common case, then i can see merit, but my impression is that google would reject a Sitemap XML without the ns?

If you do go forward with this Pull and require the namespace, then it may be a good idea to fail-fast and report an error if the ns is missing?

Hi @grokwich,

I've tried your pull request on a set of few 10,000 sitemaps using the SAX parser (see 1). There is a significant impact: the number of extracted URLs drops from 193 million to 188 million. Afaics, that's (mostly) not caused by absent namespace declarations but by a namespace URL not equal to http://www.sitemaps.org/schemas/sitemap/0.9. E.g., in this sitemap.xml:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml">

It could be just an outdated location or version (like https://www.google.com/schemas/sitemap/0.84/). But is this really a reason to make the parser fail? What happens if a new sitemap version is released?

I agree that the parser should support namespaces but at least for the SAX parser robustness is most important, with the aim to extract URLs even if the sitemap is not valid. A validation mode or parser would be nice to have but it should be optional as it's just a different use case. If used for a crawler robustness and fault-tolerance matters.

Hi @sebastian-nagel.

193M URL's is quite the test-bed:-)

Google Search Console has a sitemap testing tool. When i test with no namespace decl i get a failure...

Also if i provide a bad namespace URL i get a failure...

But interestingly if i test using the URL examples you provided, then the test passes. I guess this makes sense since the independent sitemap spec evolved from google's old internal sitemap format. They don't seem to document those legacy namespaces, so i assume they are deprecated but still supported for legacy clients.

I also tried a few 3'rd party validators, and they reject the Google legacy URL's.

http://tools.seochat.com/tools/site-validator
http://freetools.webmasterworld.com/tools/site-validator

The spec is clear that the namespace is required...
https://www.sitemaps.org/protocol.html

Here is also an excerpt from a Google doc to troubleshoot failed processing, where they state that the namespace must exist and have the exact URL specified (expand the "Complete error list" at the bottom of the following page)...

https://support.google.com/webmasters/answer/183669#error-list

Here is the quote...

Unsupported format Your sitemap is not in a supported format. Sitemaps must be in XML format, and use the correct header. Common XML mistakes: Your sitemap must use the correct header. For example, if your sitemap contains video information, it would have the following header: <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1"> The namespace in the header must be "http://www.sitemaps.org/schemas/sitemap/0.9" (not .9). All XML attributes must enclosed in either single quotes (') or double quotes (") and those quotes must be straight, not curly. Word-processing programs such as Microsoft Word can insert curly quotes.

Another consideration is when a sitemap adds additional namespaces. If one of those includes elements names that overlap with element names from sitemap XML, then you run the risk of incorrectly processing an element from another namespace.

For example, here is a case where you would incorrectly foobar:url...

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:foobar="http://www.foobar.com/schemas/foobar"> <url> ... </url> <foobar:url> ... </foobar:url> .... <urlset>

As for versioning, if a new namespace is introduced for sitemap (unlikely), then presumably new capability is being added that will change the way a client should process the doc. In that case the sitemap parsers would have to be enhanced to support processing the new version, and clients should have to specify which version to parse with (default being the old version).

⚠️ All that being said, i completely understand the concern of breaking existing (invalid) behavior. One middle-ground would be to make the default wild-card, but allow the parser to be configured as strict (to avoid namespace clashes).

Whatever pragmatic decision you guys make is good with me. Thx.

Hi @grokwich,

First off, many thanks for the very detailed response & testing! After dealing with all of the fun issues with invalid robot.txt files found in the wild, I appreciate your research and the time spent documenting things.

My two cents - I think your middle-ground solution is a good way to handle the two conflicting goals.

+1 to the middle-ground approach
Thanks for the testing and commenting

jnioche · 2017-09-08T14:53:04Z

src/main/java/crawlercommons/sitemaps/SiteMapParser.java

-                String changeFreq = getElementValue(elem, "changefreq");
-                String priority = getElementValue(elem, "priority");
-                String loc = getElementValue(elem, "loc");
+                String lastMod = getElementValue(Namespace.SITEMAP, elem, "lastmod");


same as above

jnioche · 2017-09-08T14:54:39Z

src/main/java/crawlercommons/sitemaps/sax/DelegatorHandler.java

-        } else if ("feed".equals(qName)) {
+        elementStack.push(localName);
+
+        if ( Namespace.SITEMAP.equals(uri) ) {


same here - why don't we consider only localnames regardless of the NS? What if the NS is not set?

jnioche · 2017-10-03T11:32:50Z

@crawler-commons/committers @grokwich
I have applied the formatting (although it looks a bit dodgy in places but that doesn't matter too much) + implemented the approach suggested above by allowing the namespaces to be anything when the strict flag is set to false. It lives in the branchgrokwich-add_namespace_aware.

Could you please review it?

grokwich · 2017-10-03T14:17:56Z

Hi @jnioche

The use of the wildcard namespace looks good to me.

Though I thought the "middle ground approach" was to match against the wildcard namespace by default (non-strict), and allow configuration that would require correct/strict use of the sitemap namespace?

If i'm reading correctly it appears that the default is "strict = true"? I would personally be fine with that, but i thought the concern was breaking legacy cases that @sebastian-nagel identified?

You are also overloading Sitemap.strict. The legacy meaning is related to sitemap baseurl processing, and now you are also including the correct use of the sitemap namespace (if strict == true ).

That may be fine, but it will constrain configuration to all strict (URL + namespace processing), or all non-strict. It may also be the reason why you made namespace processing strict by default (since legacy URL processing is strict by default)?

One other personal nice-to-have, if i configure for strict namespace processing it would be nice to confirm that the correct sitemap namespace is declared. If not, then i would prefer to at least have a Warning logged (maybe even an Error)? Currently the code will silently result in zero sitemap URL's if the the namespace is not correctly declared (and strict namespace processing is configured).

jnioche · 2017-10-05T10:22:04Z

@grokwich thanks for your comments.

Yes, I have lumped together the control over the URLs and the namespace processing under the existing strict variable. Although semantically different they are relatively close in the sense that it is about enforcing the specifications. So you are right, by default we don't use the wildcard.

I am reluctant to change the API to add a separate boolean parameter for enforcing the namespace. Instead, we could have pass a Map<String,Object> around to hold the configuration. This would future-proof the code as we'd be able to add more config elements without breaking the API. However, this is a substantial change and I'd rather deal with it in a separate PR.

@sebastian-nagel in your benchmarks is there any way to see what we gained with this PR? There's bound to be sitemaps with namespaced elements which we could not get before. Also what do you think of the approach I described above i.e use wildcard when strict = false and add more subtle configuration separately?

sebastian-nagel · 2017-10-05T14:20:21Z

what we gained

I've found only a single document in sitemap-test-2017-03-04.warc.gz http://www.vanguardiapopular.org/sitemap04.xml. Anyway, even it's only few, they're valid!

lumped together the control over the URLs and the namespace processing under the existing strict variable. Although semantically different they are relatively close in the sense that it is about enforcing the specifications.

But the reasons to be strict are quite different: follow the format spec vs. avoid to be spammed. If #85 is implemented I would configure parser for commoncrawl.org to be strict and accept only URLs from the list of cross-submit hosts and the sitemap host itself. But parsing should be still permissive regarding format. The SAX parser already has a flag allowPartial to allow URLs from sitemaps only partially parsed due to XML errors or because they are truncated. I would opt for a new flag (strictParsing or strictNamespace).

jnioche · 2017-10-10T15:26:14Z

Have opened a new PR #176 with a separate field for configuring strictNamespace

jnioche · 2017-10-17T09:48:52Z

Implemented in #176 - thanks again @grokwich for your contribution and comments, much appreciated!

grokwich · 2017-10-20T17:29:04Z

Sorry i was unable to review the latest pull. Just took a peek, looks great.
I'm glad i could contribute @jnioche, thank you all for taking the time to push this into your nice lib.
Amazing regression analysis @sebastian-nagel!-) Makes me even more confident using the lib as it continues to evolve.

Add namespace aware DOM/SAX parsing for XML Sitemaps. RSS and Atom pa…

6d9b236

…rsing is also namespace aware, but finding elements is left "relaxed" by only matching on the element "localName".

grokwich mentioned this pull request Sep 5, 2017

Add sitemap namespace awareness. #174

Closed

jnioche reviewed Sep 8, 2017

View reviewed changes

jnioche added this to the 0.9 milestone Sep 25, 2017

jnioche added the enhancement label Sep 25, 2017

jnioche mentioned this pull request Oct 10, 2017

Add namespace aware DOM/SAX parsing for XML Sitemaps #176

Merged

jnioche closed this Oct 17, 2017

sebastian-nagel mentioned this pull request May 15, 2018

[Sitemaps] Support legacy namespaces #210

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add namespace aware DOM/SAX parsing for XML Sitemaps. #173

Add namespace aware DOM/SAX parsing for XML Sitemaps. #173

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add namespace aware DOM/SAX parsing for XML Sitemaps. #173

Add namespace aware DOM/SAX parsing for XML Sitemaps. #173

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!