8000 update file `crawlers-user-agents.data` · Issue #2638 · coreruleset/coreruleset · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

update file crawlers-user-agents.data #2638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fzipi opened this issue Jun 18, 2022 · 17 comments
Closed

update file crawlers-user-agents.data #2638

fzipi opened this issue Jun 18, 2022 · 17 comments

Comments

@fzipi
Copy link
Member
fzipi commented Jun 18, 2022

There is no information on where this list comes from. We need to get some sources and update the list.

This looks like a good place to start: https://github.com/JayBizzle/Crawler-Detect (other packages rely on this one, so it is probably a very good source)

Others:

May be useful:

Query used: https://github.com/search?o=desc&q=crawler+bot&s=stars&type=Repositories

@azurit
Copy link
Member
azurit commented Jul 23, 2022

@fzipi Are you sure about https://github.com/JayBizzle/Crawler-Detect? I was looking at the database for a while na found lots of FPs:

How do you suggest to process these data? Simply taking it all doesn't seems to be a good idea but processing and verifying thousands of unknown records isn't real. Also, full User-Agent headers are missing.

@fzipi
Copy link
Member Author
fzipi commented Jul 23, 2022

Those links are examples from a quick search.

The class uses that raw file to create a big regex to look for user-agents

They are not required, just input for us to take the decision. If they are not useful, let's bring others 😄

@fzipi
Copy link
Member Author
fzipi commented Aug 17, 2022

@azurit Any updates on this list?

@azurit
Copy link
Member
azurit commented Aug 17, 2022

@fzipi Well, i don't fully understand what am i supposed to do. I, probably, should not just take the list of crawlers from random site and use it to update our rules. But validating hundreds of User-Agents by hand does not sounds very sane either.

@fzipi
Copy link
Member Author
fzipi commented Dec 12, 2022

@azurit Nobody does. Your proposal is as good as mine, or anyone. Do you want to lead this or not?

@azurit
Copy link
Member
azurit commented Dec 12, 2022

@fzipi Sorry i'm currently not able to.

@azurit azurit removed their assignment Dec 12, 2022
@dune73
Copy link
Member
dune73 commented Dec 15, 2022

I'm pondering overhauling the entire UA lists. This would be covered too. Let's keep it open for a moment.

@dune73
Copy link
Member
dune73 commented Dec 15, 2022

OK, here we go.

Status Quo / Existing rules

913110 PL1 critical scanners-headers.data	 8 entries issue:#2647
913120 PL1 critical scanners-urls.data		17 entries issue:#2648

913100 PL1 critical scanners-user-agents.data	93 entries issue:#2645
913101 PL2 critical scripting-user-agents.data	18 entries issue:#2646
913102 PL2 critical crawlers-user-agents.data   27 entries issue:#2638

The headers and the scanners are minimal and severely outdated. I am inclined to drop both rules - or we expand them a big deal. Or we leave it as is.

It would be unfortunate to drop 913110 since it can be used to fingerprint CRS on a server (by enumerating one of the HTTP headers listed there you can raise the anomaly score to arbitrary levels...)

The UA however is covered in 913100 and it's two PL2 strict sibling.

A way forward

The distinction between scripting user agents and crawlers is arbitrary and hard to maintain if we want to source existing online lists, since they do not make this distinction. It is probably easier to drop this distinction and join the two categories.

The trickier problem is the security scanners in 913100. I do not think there is a comprehensive list for them. I have written to Simon Bennetts who maintains a list of open source security scanners, but I doubt he would be able to help us out.

Either way, I think we should start from scratch and create two new UA rules:

  • 913130 PL1 Security Scanners
  • 913131 PL2 Scripting Agents, Crawlers and everything else we do not want

How to get there

As stated above, I think we should start from scratch. And I think it would be best to make the holistic source file for 913131 the master file.

I presume we will not get a separate source for the security scanners, so there would be a manual selection process based on 913131 to get 913130. This would also have the side-effect of bring cumulative. A UA triggering 913130 would also trigger its stricter sibling 913131.

There is a real chance the src files will also mark googlebot and friends. So we also need to maintain a list of exclusions. Two of the sources proposed below carry such a list, but we will have to add a list of our own as well. To be double-sure we never list googlebot.

I would rather not keep a manual src list though. If there is something new popping up, we should redirect to one of the 3 sources or all of them.

Sources for the new holistic source file

- Mitchell Krozga: Apache ultimate bad bot list (638 stars)
        mitchellkrog@gmail.com
        https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker
        https://raw.githubusercontent.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/master/Apache_2.4/custom.d/globalblacklist.conf

        curl https://raw.githubusercontent.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/master/Apache_2.4/custom.d/globalblacklist.conf -s | grep BrowserMatchNoCase | grep bad_bot | cut -d\" -f2 | cut -b7- | sed -e "s/(?:.*//" | sed -e "s/\\\ / /g"
        
        Exclusions:
       	Same file marked with good_bot
 
- Jay Bizzle: Crawler Detect (1.7K stars)
        m@rkbee.ch
        https://github.com/JayBizzle/Crawler-Detect
        https://raw.githubusercontent.com/JayBizzle/Crawler-Detect/master/raw/Crawlers.txt

        Exclusions:
                https://raw.githubusercontent.com/JayBizzle/Crawler-Detect/master/raw/Exclusions.txt

- Martin Monperrus: Crawler User Agents (904 stars)
        martin.monperrus@gnieh.org
        https://github.com/monperrus/crawler-user-agents
        https://raw.githubusercontent.com/monperrus/crawler-user-agents/master/crawler-user-agents.json

- Loadkpi: Crawler Detect (99 Stars)
        https://github.com/loadkpi/crawler_detect
        Is a ruby gem based on Jay Bizzle above

        -> uninteresting

- Nicolas Mure: CrawlerDetectBundle (23 stars, stale repo)

        -> uninteresting

From src list to reconciled list

graph TD;
      Krogza:UltimateBadBots-->SrcList
      JayBizzle:CrawlerDetect-->SrcList
      Monperrus:CrawlerUAs-->SrcList
      Krogza:good_bots-->ExclusionList
      JayBizzle:Exclusions-->ExclusionList
      CRS:ManualExclusionList-->ExclusionList
      SrcList-->ReconciledList
      ExclusionList-->ReconciledList
Loading

@fzipi
Copy link
Member Author
fzipi commented Dec 15, 2022

Excellent write-up.

  • I agree with the approach.
  • Somehow this looks like the start of something more: "There is a certain risk of adding....", but maybe I'm wrong and the initial caps confused me there.
  • Probably a reasonable curl + sort + grep + uniq will get us to a reasonable unified list.
  • In the crawler/exclusions lists there are regular expressions. Are we going to use these lists as regex or pmFromFile?

8000
@dune73
Copy link
Member
dune73 commented Dec 15, 2022

Thank you for the confirmation.

I have meanwhile dropped the "There is a certain risk ..." It's been an edit error. No additional thought in that broken sentence.

I am thinking of a multifacetted curl + grep + sed + cut + awk monster to come up with SrcList. And then use egrep and apply the regexes found in the exclusions lists to turn this into the ReconciledList. The final result should be consumed with pmFromFile.

@theseion
Copy link
Contributor

Sounds great! Thanks @dune73!

@dune73
Copy link
Member
dune73 commented Dec 16, 2022

Simon Bennetts responded and the response is not very comforting:

You would probably have to run them and check what they send by default :/
And you should probably do that whenever a new version is released.
Unless you can find where they define it in the code of course ;)

He continues to explain that ZAP changes the UA with every major release, and that it's resembling a browser.

I think this actually confirms my proposal from above with the 913131 being the new workhorse at PL2 with the automatically generated list and 913130 is a selection done by hand.

@dune73
Copy link
Member
dune73 commented Dec 16, 2022

Update: I have a minimal version covering "Apache ultimate bad bot list" and "Crawler Detect" - 1731 entries - and deployed it on netnea.com for testing purposes.

No exclusions yet, but I would like to get an idea of FPs / guidance for manual exclusions.

@dune73
Copy link
Member
dune73 commented Dec 19, 2022

Still no exclusions, but this monster comes up with a list of >2K user agents:

                CURL_OPTIONS="--silent"
                (
                   curl $CURL_OPTIONS https://raw.githubusercontent.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/master/Apache_2.4/custom.d/globalblacklist.conf -s | \
                   grep BrowserMatchNoCase | \
                   grep bad_bot | \
                   cut -d\" -f2 | cut -b7- | \
                   sed -e "s/(?:.*//" -e "s/\\\ / /g"

                   curl $CURL_OPTIONS https://raw.githubusercontent.com/JayBizzle/Crawler-Detect/master/raw/Crawlers.txt | sed \
                    -e "/.*bot|crawl.*/s/[(|]/\n/g" \
                    -e "/^\[a-z0-9.*/d" \
                    -e "s/\^docker.*/docker/"    \
                    -e "s/\^NG.*/NG/"     \
                    -e "s/\^Ruby.*/Ruby/" \
                    -e "s/\^VSE.*/VSE/"   \
                    -e "s/\^XRL.*/XRL/"   \
                    -e "s/^Aprc.*/Aprc/"  \
                    -e "s/^CAAM.*/CAAM/"  \
                    -e "s/^centuryb.o.t9.*/centuryb.o.t9/" \
                    -e "s/^Daum.*/Daum/"   \
                    -e "s/^developers.*google.*/developers.google/" \
                    -e "s/^Drupal.*/Drupal/"  \
                    -e "s/^Fetch.*/Fetch/"    \
                    -e "s/^Fever.*/Fever/"    \
                    -e "s/^Go .*/Go/"     \
                    -e "s/HAA.A..RTLAND.*/HAARTLAND\nHAAARTLAND/"  \
                    -e "s/^IPS.*/IPS/"    \
                    -e "s/NING../NGIN\//"   \
                    -e "s/newspaper../newspaper\//"   \
                    -e "s/^PTST.*/PTST/"   \
                    -e "s/^Realplayer%20/Realplayer /"  \
                    -e "s/^Sitemap.*/Sitemap/"   \
                    -e "s/^Wallpapers.*/Wallpapers/"  \
                    -e "s/^xpymep.*/xpymep/"  \
                    -e "s/^Y!J-.*/Y!J-/"   \
                    -e "s/^ YLT/YLT/"   \
                    -e "s/^Java.*/Java/"    \
                    -e "s/^NewsBlur.*/NewsBlur/"  \
                    -e "s/Yandex(.*/Yandex/"
                    
                   curl $CURL_OPTIONS -s "https://raw.githubusercontent.com/monperrus/crawler-user-agents/master/crawler-user-agents.json" | grep pattern | cut -d\" -f4 | sed \
                    -e "s/Ahrefs.*/Ahrefs/" \
                    -e "s/AdsBot-Google.*/AdsBot-Google/" \
                    -e "s/Bark.rR.owler/BarkRowler/" \
                    -e "s/BlogTraffic.*/BlogTraffic/" \
                    -e "s/.Cc.urebot/Curebot/" \
                    -e "s/Livelap.bB.ot/LivelapBot/" \
                    -e "s/.pP.ingdom/pingdom/" \
                    -e "s/.*PTST/PTST/" \
                    -e "s/Mediapartners.*/Mediaparners/" \
                    -e "s/S.eE..mM.rushBot/SemrushBot/" \
                    -e "s/.*sentry/sentry/" \
                    -e "s/..\.uptime..\./.uptime./" \
                    -e "s/\[wW\]get/wget/"
                ) | sed \
                        -e "s/\\\\\//\//" \
                        -e "s/\\\././g" \
                        -e "s/\\\\\\\//g" \
                        -e "s/\\\\\././" \
                        -e "s/\\\\(/(/" \
                        -e "s/\\\\)/)/" \
                        -e "s/^[\^]//" \
                        -e "s/\$$//" \
                | sort | tr "A-Z" "a-z" | uniq

There is a ton of false positives now and also things I really wonder should be blocked at PL2. Maybe we need a PL3 strict sibling with everything and then stripped down versions for PL2 and PL1. That would be

  • 10% at PL1
  • 10% + 80% at PL2
  • 10% + 80% + 10% PL3

So the majority would be identified at PL2 (but again at PL3 because reasonable mechanics). Every automation UA out there detected at PL3.

Well every one, but maybe with the exception of major search engines.

@dune73
Copy link
Member
dune73 commented Feb 27, 2023

For the record: I picked up work on this again. It's just a lot of work, so it takes time.

@dune73
Copy link
Member
dune73 commented Apr 5, 2023

Interesting list of "verified" bots that could be used to govern the exceptions to the lists posted above: https://radar.cloudflare.com/traffic/verified-bots

@dune73
Copy link
Member
dune73 commented Jun 23, 2023

Fixed via #3236

@dune73 dune73 closed this as completed Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
0