8000 Add namespace aware DOM/SAX parsing for XML Sitemaps. by grokwich · Pull Request #173 · crawler-commons/crawler-commons · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add namespace aware DOM/SAX parsing for XML Sitemaps. #173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

grokwich
Copy link
@grokwich grokwich commented Sep 5, 2017

Add namespace aware DOM/SAX parsing for XML Sitemaps. RSS and Atom parsing is also namespace aware, but finding elements has been left "relaxed" by only matching on the element "localName" (legacy behavior).

Here is test xml that was failing:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<sit:urlset xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1" xmlns:sit="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1999/xhtml link.xsd http://www.google.com/schemas/sitemap-video/1.1 video.xsd http://www.sitemaps.org/schemas/sitemap/0.9 sitemap.xsd http://www.google.com/schemas/sitemap-image/1.1 image.xsd">
    <sit:url>
        <sit:loc>http://www.example.com/1</sit:loc>
        <sit:changefreq>daily</sit:changefreq>
    </sit:url>
    <sit:url>
        <sit:loc>http://www.example.com/2</sit:loc>
        <sit:changefreq>daily</sit:changefreq>
    </sit:url>
</sit:urlset>

…rsing is also namespace aware, but finding elements is left "relaxed" by only matching on the element "localName".
@jnioche
Copy link
Contributor
jnioche commented Sep 5, 2017

Thanks @grokwich, looks good at first glance but would need a more thorough review.

BTW you can apply the project's formatting rules with mvn java-formatter:format.

@grokwich
Copy link
Author
grokwich commented Sep 5, 2017

Cool. Thx for the formatting tip. Btw, if you wish to include the enhancement, feel free to hack up the code to suit your project.

@@ -436,18 +445,18 @@ protected SiteMap parseXmlSitemap(URL sitemapUrl, Document doc) {
SiteMap sitemap = new SiteMap(sitemapUrl);
sitemap.setType(SitemapType.XML);

NodeList list = doc.getElementsByTagName("url");
NodeList list = doc.getElementsByTagNameNS(Namespace.SITEMAP, "url");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would happen here if the namespace is not set explicitly in the XML? Why not do as above and just match based on the localname no matter what the NS is?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnioche i believe there would be no match without specifying the namespace. Yes, there would be risk in breaking existing sitemaps that do not specify the namespace. I was more forgiving about RSS/Atom, since they optionally allow namespaces.

But i believe the Sitemap XML spec requires a namespace...
https://www.sitemaps.org/protocol.html
https://support.google.com/webmasters/answer/183669#errors

And this validator fails when the namespace declaration is missing...
http://freetools.webmasterworld.com/tools/site-validator/

If you match any namespace, then you also risk incorrectly processing an element belonging to another namespace (example, i can imaging that may exist in another mixed-in namespace).

So the decision seems to boil down to whether you wish to support invalid sitemap XML's that do not include the namespace decl. If that is a common case, then i can see merit, but my impression is that google would reject a Sitemap XML without the ns?

If you do go forward with this Pull and require the namespace, then it may be a good idea to fail-fast and report an error if the ns is missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @grokwich,

I've tried your pull request on a set of few 10,000 sitemaps using the SAX parser (see 1). There is a significant impact: the number of extracted URLs drops from 193 million to 188 million. Afaics, that's (mostly) not caused by absent namespace declarations but by a namespace URL not equal to http://www.sitemaps.org/schemas/sitemap/0.9. E.g., in this sitemap.xml:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.9"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">

It could be just an outdated location or version (like https://www.google.com/schemas/sitemap/0.84/). But is this really a reason to make the parser fail? What happens if a new sitemap version is released?

I agree that the parser should support namespaces but at least for the SAX parser robustness is most important, with the aim to extract URLs even if the sitemap is not valid. A validation mode or parser would be nice to have but it should be optional as it's just a different use case. If used for a crawler robustness and fault-tolerance matters.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sebastian-nagel.

193M URL's is quite the test-bed:-)

Google Search Console has a sitemap testing tool. When i test with no namespace decl i get a failure...
image

Also if i provide a bad namespace URL i get a failure...
image

But interestingly if i test using the URL examples you provided, then the test passes. I guess this makes sense since the independent sitemap spec evolved from google's old internal sitemap format. They don't seem to document those legacy namespaces, so i assume they are deprecated but still supported for legacy clients.

I also tried a few 3'rd party validators, and they reject the Google legacy URL's.

http://tools.seochat.com/tools/site-validator
http://freetools.webmasterworld.com/tools/site-validator

The spec is clear that the namespace is required...
https://www.sitemaps.org/protocol.html

Here is also an excerpt from a Google doc to troubleshoot failed processing, where they state that the namespace must exist and have the exact URL specified (expand the "Complete error list" at the bottom of the following page)...

https://support.google.com/webmasters/answer/183669#error-list

Here is the quote...

Unsupported format
Your sitemap is not in a supported format. Sitemaps must be in XML format, and use the correct header.
Common XML mistakes:

Your sitemap must use the correct header. For example, if your sitemap contains video information, it would have the following header:
<?xml version="1.0" encoding="UTF-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">

The namespace in the header must be "http://www.sitemaps.org/schemas/sitemap/0.9" (not .9).
All XML attributes must enclosed in either single quotes (') or double quotes (") and those quotes must be straight, not curly. Word-processing programs such as Microsoft Word can insert curly quotes.

Another consideration is when a sitemap adds additional namespaces. If one of those includes elements names that overlap with element names from sitemap XML, then you run the risk of incorrectly processing an element from another namespace.

For example, here is a case where you would incorrectly foobar:url...

<urlset 
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    xmlns:foobar="http://www.foobar.com/schemas/foobar">

   <url>
   ...
   </url>
   
   <foobar:url>
   ...
   </foobar:url>

   ....
<urlset> 

As for versioning, if a new namespace is introduced for sitemap (unlikely), then presumably new capability is being added that will change the way a client should process the doc. In that case the sitemap parsers would have to be enhanced to support processing the new version, and clients should have to specify which version to parse with (default being the old version).

⚠️ All that being said, i completely understand the concern of breaking existing (invalid) behavior. One middle-ground would be to make the default wild-card, but allow the parser to be configured as strict (to avoid namespace clashes).

Whatever pragmatic decision you guys make is good with me. Thx.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @grokwich,

First off, many thanks for the very detailed response & testing! After dealing with all of the fun issues with invalid robot.txt files found in the wild, I appreciate your research and the time spent documenting things.

My two cents - I think your middle-ground solution is a good way to handle the two conflicting goals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the middle-ground approach
Thanks for the testing and commenting

String changeFreq = getElementValue(elem, "changefreq");
String priority = getElementValue(elem, "priority");
String loc = getElementValue(elem, "loc");
String lastMod = getElementValue(Namespace.SITEMAP, elem, "lastmod");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

} else if ("feed".equals(qName)) {
elementStack.push(localName);

if ( Namespace.SITEMAP.equals(uri) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - why don't we consider only localnames regardless of the NS? What if the NS is not set?

@jnioche jnioche added this to the 0.9 milestone Sep 25, 2017
@jnioche
Copy link
Contributor
jnioche commented Oct 3, 2017

@crawler-commons/committers @grokwich
I have applied the formatting (although it looks a bit dodgy in places but that doesn't matter too much) + implemented the approach suggested above by allowing the namespaces to be anything when the strict flag is set to false. It lives in the branchgrokwich-add_namespace_aware.

Could you please review it?

@grokwich
Copy link
Author
grokwich commented Oct 3, 2017

Hi @jnioche

The use of the wildcard namespace looks good to me.

Though I thought the "middle ground approach" was to match against the wildcard namespace by default (non-strict), and allow configuration that would require correct/strict use of the sitemap namespace?

If i'm reading correctly it appears that the default is "strict = true"? I would personally be fine with that, but i thought the concern was breaking legacy cases that @sebastian-nagel identified?

You are also overloading Sitemap.strict. The legacy meaning is related to sitemap baseurl processing, and now you are also including the correct use of the sitemap namespace (if strict == true ).

That may be fine, but it will constrain configuration to all strict (URL + namespace processing), or all non-strict. It may also be the reason why you made namespace processing strict by default (since legacy URL processing is strict by default)?

One other personal nice-to-have, if i configure for strict namespace processing it would be nice to confirm that the correct sitemap namespace is declared. If not, then i would prefer to at least have a Warning logged (maybe even an Error)? Currently the code will silently result in zero sitemap URL's if the the namespace is not correctly declared (and strict namespace processing is configured).

@jnioche
Copy link
Contributor
jnioche commented Oct 5, 2017

@grokwich thanks for your comments.

Yes, I have lumped together the control over the URLs and the namespace processing under the existing strict variable. Although semantically different they are relatively close in the sense that it is about enforcing the specifications. So you are right, by default we don't use the wildcard.

I am reluctant to change the API to add a separate boolean parameter for enforcing the namespace. Instead, we could have pass a Map<String,Object> around to hold the configuration. This would future-proof the code as we'd be able to add more config elements without breaking the API. However, this is a substantial change and I'd rather deal with it in a separate PR.

@sebastian-nagel in your benchmarks is there any way to see what we gained with this PR? There's bound to be sitemaps with namespaced elements which we could not get before. Also what do you think of the approach I described above i.e use wildcard when strict = false and add more subtle configuration separately?

@sebastian-nagel
Copy link
Contributor

what we gained

I've found only a single document in sitemap-test-2017-03-04.warc.gz http://www.vanguardiapopular.org/sitemap04.xml. Anyway, even it's only few, they're valid!

lumped together the control over the URLs and the namespace processing under the existing strict variable. Although semantically different they are relatively close in the sense that it is about enforcing the specifications.

But the reasons to be strict are quite different: follow the format spec vs. avoid to be spammed. If #85 is implemented I would configure parser for commoncrawl.org to be strict and accept only URLs from the list of cross-submit hosts and the sitemap host itself. But parsing should be still permissive regarding format. The SAX parser already has a flag allowPartial to allow URLs from sitemaps only partially parsed due to XML errors or because they are truncated. I would opt for a new flag (strictParsing or strictNamespace).

@jnioche
Copy link
Contributor
jnioche commented Oct 10, 2017

Have opened a new PR #176 with a separate field for configuring strictNamespace

@jnioche
Copy link
Contributor
jnioche commented Oct 17, 2017

Implemented in #176 - thanks again @grokwich for your contribution and comments, much appreciated!

@jnioche jnioche closed this Oct 17, 2017
@grokwich
Copy link
Author

Sorry i was unable to review the latest pull. Just took a peek, looks great.
I'm glad i could contribute @jnioche, thank you all for taking the time to push this into your nice lib.
Amazing regression analysis @sebastian-nagel!-) Makes me even more confident using the lib as it continues to evolve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0