8000 Identify and address Google search and Dataset search issues · Issue #12 · DataONEorg/dataone · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Identify and address Google search and Dataset search issues #12
Open
@amoeba

Description

@amoeba

Being comprehensively indexed by search engines such as Google is a substantial benefit for DataONE and DataONE Members. Ideally, the whole variety of information (datasets, people, portals, metrics, etc.) housed within DataONE would be findable through traditional search engines and, for datasets, also Google Dataset Search.

As of 2021, our primary tool for knowing whether or not we are comprehensively indexed is the Google Search Console which provides a whole suite of tools for diagnosing issues.

Some of the problems we've addressed in the past include:

Issues we have ahead of us include:

  • There's a separation between how we're doing SEO on dataone.org and search.dataone.org. We should probably integrate the two under dataone.org
  • For specific types of content (eg portals), our search index presence is very far from complete. Portal users would find it very valuable to show up in Google search. We need to find a way to include all of this stuff, not just datasets
  • Our dataset URLs aren't being fully indexed (see below)

Dataset coverage

Summary

At the time of writing, we have 846,622 dataset and portal URLs listed in our sitemaps and Google has discovered them all correctly. 812k of these are marked as "Excluded". When we drill down into the index coverage of those URLs, we get this breakdown:

Type # Pages
Discovered - not currently indexed 770,302
Duplicate, submitted URL not selected as canonical 30,977
Duplicate without user-selected canonical 12,265
Crawled - currently not indexed 7,986
Pages with redirect 424
Blocked due to other 4xx issue 6

Discovered - not currently indexed

The majority of these are "Discovered - not currently indexed", which his defined as:

Discovered - currently not indexed: The page was found by Google, but not crawled yet. Typically, Google wanted to crawl the URL but this was expected to overload the site; therefore Google rescheduled the crawl. This is why the last crawl date is empty on the report.

This makes sense to me as our individual dataset landing pages are very slow relative to what Google expects. I'm hoping that we can hear more back from the Google team about whether this is truly what's going on or if it's something else. We know from unofficial sources that Google's crawl infrastructure has two queues: One for fast sites, and one for sites it had hoped were fast but put in a separate, lower-priority queue. I'd guess we're in the latter.

Duplicate, submitted URL not selected as canonical

What we see here is that Google is selecting a URL like https://dataone.org/datasets/R1-x138-079-0042-010 as our canonical. These URLs are reported in our embedded JSON-LD and they really are our canonical URLs. I think we should consider switching our sitemap implementation on the CNs to use https://dataone.org/datasets URLs instead.

Other things

Aside from the above categories, some of the odd stuff we've seen is:

[Note: Please feel to edit this issue to be more complete]

Next steps

  • Correctly configure robots.txt and web server configs on various hosts
    • Production
      • Verify robots.txt Allow + Sitemap directives. Disallow on API routes within (ie /metacat)
    • Non-production
      • Header Set X-Robots-Tag "noindex, nofollow"

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0