-
-
Notifications
You must be signed in to change notification settings - Fork 401
update file crawlers-user-agents.data
#2638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@fzipi Are you sure about https://github.com/JayBizzle/Crawler-Detect? I was looking at the database for a while na found lots of FPs: How do you suggest to process these data? Simply taking it all doesn't seems to be a good idea but processing and verifying thousands of unknown records isn't real. Also, full |
Those links are examples from a quick search. The class uses that raw file to create a big regex to look for user-agents They are not required, just input for us to take the decision. If they are not useful, let's bring others 😄 |
@azurit Any updates on this list? |
@fzipi Well, i don't fully understand what am i supposed to do. I, probably, should not just take the list of crawlers from random site and use it to update our rules. But validating hundreds of User-Agents by hand does not sounds very sane either. |
@azurit Nobody does. Your proposal is as good as mine, or anyone. Do you want to lead this or not? |
@fzipi Sorry i'm currently not able to. |
I'm pondering overhauling the entire UA lists. This would be covered too. Let's keep it open for a moment. |
OK, here we go. Status Quo / Existing rules
The headers and the scanners are minimal and severely outdated. I am inclined to drop both rules - or we expand them a big deal. Or we leave it as is. It would be unfortunate to drop 913110 since it can be used to fingerprint CRS on a server (by enumerating one of the HTTP headers listed there you can raise the anomaly score to arbitrary levels...) The UA however is covered in 913100 and it's two PL2 strict sibling. A way forwardThe distinction between scripting user agents and crawlers is arbitrary and hard to maintain if we want to source existing online lists, since they do not make this distinction. It is probably easier to drop this distinction and join the two categories. The trickier problem is the security scanners in 913100. I do not think there is a comprehensive list for them. I have written to Simon Bennetts who maintains a list of open source security scanners, but I doubt he would be able to help us out. Either way, I think we should start from scratch and create two new UA rules:
How to get thereAs stated above, I think we should start from scratch. And I think it would be best to make the holistic source file for 913131 the master file. I presume we will not get a separate source for the security scanners, so there would be a manual selection process based on 913131 to get 913130. This would also have the side-effect of bring cumulative. A UA triggering 913130 would also trigger its stricter sibling 913131. There is a real chance the src files will also mark googlebot and friends. So we also need to maintain a list of exclusions. Two of the sources proposed below carry such a list, but we will have to add a list of our own as well. To be double-sure we never list googlebot. I would rather not keep a manual src list though. If there is something new popping up, we should redirect to one of the 3 sources or all of them. Sources for the new holistic source file
From src list to reconciled listgraph TD;
Krogza:UltimateBadBots-->SrcList
JayBizzle:CrawlerDetect-->SrcList
Monperrus:CrawlerUAs-->SrcList
Krogza:good_bots-->ExclusionList
JayBizzle:Exclusions-->ExclusionList
CRS:ManualExclusionList-->ExclusionList
SrcList-->ReconciledList
ExclusionList-->ReconciledList
|
Excellent write-up.
|
Thank you for the confirmation. I have meanwhile dropped the "There is a certain risk ..." It's been an edit error. No additional thought in that broken sentence. I am thinking of a multifacetted curl + grep + sed + cut + awk monster to come up with SrcList. And then use egrep and apply the regexes found in the exclusions lists to turn this into the ReconciledList. The final result should be consumed with pmFromFile. |
Sounds great! Thanks @dune73! |
Simon Bennetts responded and the response is not very comforting:
He continues to explain that ZAP changes the UA with every major release, and that it's resembling a browser. I think this actually confirms my proposal from above with the 913131 being the new workhorse at PL2 with the automatically generated list and 913130 is a selection done by hand. |
Update: I have a minimal version covering "Apache ultimate bad bot list" and "Crawler Detect" - 1731 entries - and deployed it on netnea.com for testing purposes. No exclusions yet, but I would like to get an idea of FPs / guidance for manual exclusions. |
Still no exclusions, but this monster comes up with a list of >2K user agents:
There is a ton of false positives now and also things I really wonder should be blocked at PL2. Maybe we need a PL3 strict sibling with everything and then stripped down versions for PL2 and PL1. That would be
So the majority would be identified at PL2 (but again at PL3 because reasonable mechanics). Every automation UA out there detected at PL3. Well every one, but maybe with the exception of major search engines. |
For the record: I picked up work on this again. It's just a lot of work, so it takes time. |
Interesting list of "verified" bots that could be used to govern the exceptions to the lists posted above: https://radar.cloudflare.com/traffic/verified-bots |
Fixed via #3236 |
Uh oh!
There was an error while loading. Please reload this page.
There is no information on where this list comes from. We need to get some sources and update the list.
This looks like a good place to start: https://github.com/JayBizzle/Crawler-Detect (other packages rely on this one, so it is probably a very good source)
Others:
May be useful:
Query used: https://github.com/search?o=desc&q=crawler+bot&s=stars&type=Repositories
The text was updated successfully, but these errors were encountered: