8000 feat: scanner overhaul: new rules, new data files by dune73 · Pull Request #3202 · coreruleset/coreruleset · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

feat: scanner overhaul: new rules, new data files #3202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from

Conversation

dune73
Copy link
Member
@dune73 dune73 commented Apr 24, 2023

It's been a lot of work, but here is the PR that brings new rules to inspect the user-agent of clients:

We have 3 existing rules focusing on the user agent. 913100 and two PL2 stricter siblings:

  • 913100 PL1 Found User-Agent associated with security scanner, based on scanners-user-agents.data
  • 913101 PL2 Found User-Agent associated with scripting/generic HTTP client, based on scripting-user-agents.data
  • 913102 PL2 Found User-Agent associated with web crawler/bot, based on crawlers-user-agents.data

When we switch to base our rules on existing online keyword lists, then the distinction between scripting/generic, crawler/bots and other automatic agents is no longer possible. The existing sources are not making this distinction. And if they are grouping it's different groups, incompatible taxonomies between different sources, etc.

I tried to maintain certain levels of badness among the agents, but it's too much manual work. So the new approach is mostly automatic. The core is a list of automated agents that we merge from three different online sources. Out of this full list of automated agents, there is a manual list of security scanners like nmap, zgrab, nikto, etc. That list has to be maintained by hand. It's a replacement for 913100, but it's an overhaul since 913100 has been horribly outdated.

At PL2 we have the automated list mentioned above without a list of acceptable user agents. This list of acceptable user agents has two online sources and a manual list.

Then at PL4 we block all automated agents including the google bot and the let's encrypt agent.

Let me repeat: we no longer distinguish between crawlers, HTTP client libraries and site security audit services like SSLLabs. They are either benign and popular, then they are part of an allow-list at PL2 in order to reduce false positives. Or they are not overly popular or not benign, then they will trigger at PL2. I attempted to differenciate between scrapers and search engine bots first or to distinguish security service tools, but I gave up eventually. It's a list of 1900 UAs after all.

At PL4 finally, we trigger on every trace of an automated agent. This will hit GoogleBot as well as Let's Encrypt. We have to make this transparent.

Now problem: The new manual list of security scanners is a replacement for 913100. But 913101 and 913102 are simply gone. The new rules at PL2 and PL4 are not replacements. The new rules are still stricter siblings to a certain extent. But only to a certain extent. New rule IDs are not obvious.

I see four possible approaches:

  • Fuck you approach: 913100 PL1 stays, 913101 PL2 is replaced with list of non-acceptable agents, 913102 PL2 moves to PL4 to spot all automated agents.
  • Partial move 1: 913100 PL1 stays, 913101/2 disappear. New rules as 913103 PL2 and 913104 PL4
  • Partial move 2: 913100 PL1 stays, 913101/2 disappear. New rules as 913130 PL2 and 913131 PL4
  • Full move: 91310x are removed, new rules as 913130 PL1, 913131 PL2 and 913132 PL4

They all have their merits. For the PR, I opted for the full move option. I am open to discuss this though.

Automated agents sources:

The 3 online sources had to be transformed heavily in order to make them usable with the @pmFromFile operator and in order to avoid near-duplicates.

Benign agents sources:

After retrieving the online sources and looking through them by hand several times, I also double-checked them with the commercial list of whatsmybrowser.com, a 78GB list of real world user-agents strings. In fact about 1/4 of the acceptable user agents are not in the whatsmybrowser database. I'm keeping them on the list nevertheless.

So here are the 3 new data files:

  • user-agents-security-scanners.data (PL1, 32 entries)
  • user-agents-non-acceptable-automated-agents.data (PL2, 1790 entries)
  • user-agents-automated-agents.data (PL4, 1933 entries)

And the new rules:

  • 913130 PL1 Found User-Agent associated with security scanner
  • 913131 PL2 Found User-Agent that is associated with non-acceptable automated user agent
  • 913132 PL4 Found User-Agent that is associated with automated user agent

The PR is not yet ready to be merged. The whole scripting and manual source files
need to be cleaned up and added before it's ready. And then the tests of course. I'm labeling accordingly.

@dune73 dune73 added 👀 Needs action ⚠️ do not merge Additional work or discussion is needed despite passing tests labels Apr 24, 2023
@RedXanadu
Copy link
Member
RedXanadu commented Apr 24, 2023

Wow! What a list 😀

These changes appear to be large enough to deserve a new set of rule IDs. On the other hand, the meaning of PL 1 rule 913100 seems to have remained the same, so it's also arguable to keep that rule ID.

Both of those options seem sensible, IMO (full move or partial move).

@theMiddleBlue
Copy link
Contributor

that's great!!

I'm writing a couple of review

@theMiddleBlue
Copy link
Contributor
theMiddleBlue commented Apr 24, 2023

I've created a script to query my production over 9.922.403 logs for each entries of the file user-agents-non-acceptable-automated-agents.data looking for legitimate UA. I'm going trough this list and I'll open a review if I found some legit UA:

[0s] - ELASTICSEARCH Found 309 results for 007ac9 crawler
[0s] - ELASTICSEARCH Found 389 results for 008
[2s] - ELASTICSEARCH Found 6403 results for acebookexternalhit
[2s] - ELASTICSEARCH Found 2 results for adbeat
[4s] - ELASTICSEARCH Found 325 results for ahrefs
[6s] - ELASTICSEARCH Found 47 results for ant.com
[8s] - ELASTICSEARCH Found 3 results for aria2
[11s] - ELASTICSEARCH Found 27 results for axios
[14s] - ELASTICSEARCH Found 2 results for bitlybot
[15s] - ELASTICSEARCH Found 4 results for bloglovin
[23s] - ELASTICSEARCH Found 44 results for coccoc
[26s] - ELASTICSEARCH Found 414 results for cortex
[26s] - ELASTICSEARCH Found 2764 results for craw
[28s] - ELASTICSEARCH Found 217 results for custo
[29s] - ELASTICSEARCH Found 2 results for dcrawl
[31s] - ELASTICSEARCH Found 26 results for disco
[33s] - ELASTICSEARCH Found 10000 results for dotbot
[36s] - ELASTICSEARCH Found 1 results for endo
[37s] - ELASTICSEARCH Found 309 results for e..ventures investment crawler
[37s] - ELASTICSEARCH Found 1 results for exabot
[39s] - ELASTICSEARCH Found 1 results for facebookscraper
[39s] - ELASTICSEARCH Found 309 results for fast enterprise crawler
[42s] - ELASTICSEARCH Found 318 results for fetch
[47s] - ELASTICSEARCH Found 309 results for gluten free crawler
[47s] - ELASTICSEARCH Found 10000 results for go
[51s] - ELASTICSEARCH Found 7 results for google-xrawler
[52s] - ELASTICSEARCH Found 1 results for go!zilla
[52s] - ELASTICSEARCH Found 2 results for grammarly
[55s] - ELASTICSEARCH Found 4 results for heritrix
[58s] - ELASTICSEARCH Found 119 results for httpx
[59s] - ELASTICSEARCH Found 11 results for hubspot
[59s] - ELASTICSEARCH Found 11 results for hubspot
[59s] - ELASTICSEARCH Found 309 results for ias crawler
[64s] - ELASTICSEARCH Found 2 results for ips
[64s] - ELASTICSEARCH Found 2 results for ips-agent
[65s] - ELASTICSEARCH Found 1456 results for iubenda-radar
[66s] - ELASTICSEARCH Found 2 results for jaunt
[66s] - ELASTICSEARCH Found 5 results for java
[67s] - ELASTICSEARCH Found 1 results for jobboerse
[70s] - ELASTICSEARCH Found 1474 results for khttp
[72s] - ELASTICSEARCH Found 1 results for libwww
[73s] - ELASTICSEARCH Found 37 results for liferea
[73s] - ELASTICSEARCH Found 2 results for linkbot
[74s] - ELASTICSEARCH Found 5 results for linkfluence
[77s] - ELASTICSEARCH Found 205 results for lua-resty-http
[78s] - ELASTICSEARCH Found 406 results for magpie-crawler
[78s] - ELASTICSEARCH Found 1172 results for mail
[83s] - ELASTICSEARCH Found 112 results for miniflux
[83s] - ELASTICSEARCH Found 435 results for mixdata dot com
[83s] - ELASTICSEARCH Found 1 results for moblie safari
[84s] - ELASTICSEARCH Found 40 results for monit
[122s] - ELASTICSEARCH Found 3 results for netcraft
[122s] - ELASTICSEARCH Found 309 results for netestate ne crawler
[123s] - ELASTICSEARCH Found 309 results for neticle crawler
[127s] - ELASTICSEARCH Found 40 results for nmap
[127s] - ELASTICSEARCH Found 3 results for not
[129s] - ELASTICSEARCH Found 4 results for nyu
[129s] - ELASTICSEARCH Found 10000 results for obot
[129s] - ELASTICSEARCH Found 1474 results for okhttp
[130s] - ELASTICSEARCH Found 80 results for omsc
[131s] - ELASTICSEARCH Found 3 results for open source rss
[132s] - ELASTICSEARCH Found 25 results for owler
[136s] - ELASTICSEARCH Found 14 results for pinterest.com
[136s] - ELASTICSEARCH Found 4 results for pip
[141s] - ELASTICSEARCH Found 119 results for python-httpx
[141s] - ELASTICSEARCH Found 70 results for python-urllib
[143s] - ELASTICSEARCH Found 2 results for radian6
[146s] - ELASTICSEARCH Found 309 results for rma
[148s] - ELASTICSEARCH Found 309 results for safesearch microdata crawler
[149s] - ELASTICSEARCH Found 3 results for scrapy
[152s] - ELASTICSEARCH Found 3822 results for semrush
[154s] - ELASTICSEARCH Found 1094 results for serpstatbot
[154s] - ELASTICSEARCH Found 149 results for seznam
[156s] - ELASTICSEARCH Found 10000 results for siteexplorer
[157s] - ELASTICSEARCH Found 21 results for slack
[158s] - ELASTICSEARCH Found 48 results for smtbot
[160s] - ELASTICSEARCH Found 64 results for spaziodati
[167s] - ELASTICSEARCH Found 5 results for theoldreader.com
[171s] - ELASTICSEARCH Found 1 results for twingly
[171s] - ELASTICSEARCH Found 651 results for ubermetrics-technologies
[181s] - ELASTICSEARCH Found 1484 results for webmeup-crawler
[182s] - ELASTICSEARCH Found 14 results for webprosbot
[182s] - ELASTICSEARCH Found 14 results for webpros.com
[187s] - ELASTICSEARCH Found 301 results for word
[188s] - ELASTICSEARCH Found 2 results for wpscan
[190s] - ELASTICSEARCH Found 361 results for y!j
[192s] - ELASTICSEARCH Found 131 results for zgrab
[193s] - ELASTICSEARCH Found 2 results for zoombot

@dune73
Copy link
Member Author
dune73 commented Apr 24, 2023

@theMiddleBlue I would be very interested in stuff that is being flagged at PL2 when it should not. There is a lot of manual work in that allow-list that separates the PL2 from the PL4 list.

If you happen to know the UA of a security scanner not included in the PL1 list, then please list it. Happy to expand.

@theMiddleBlue
Copy link
Contributor

@theMiddleBlue I would be very interested in stuff that is being flagged at PL2 when it should not. There is a lot of manual work in that allow-list that separates the PL2 from the PL4 list.

If you happen to know the UA of a security scanner not included in the PL1 list, then please list it. Happy to expand.

sure, I've just finished reviewing rules/user-agents-non-acceptable-automated-agents.data. Basically when I've added a review is because I would not block it at PL2 or there're FPs.

I can do the same with the other lists if needed

@dune73
Copy link
Member Author
dune73 commented Apr 24, 2023

OK, thanks for the list. Much needed review.

The user-agents-non-acceptable-automated-agents.data is the crucial file, since it's based on a more or less manual review of a 1.9K user-agents file.

semrush, ahrefs and go were meant to be eliminated from the file. Added to the allow-list again. Also added java based on your review.

What other UAs do you think should not be blocked / detected at PL2? I mean zgrab is usually not benign...

@theMiddleBlue
Copy link
Contributor
theMiddleBlue commented Apr 24, 2023

based on my really personal POV (running almost everything at PL2) all the 49 entries I've commented here should be removed from the non-acceptable list

https://github.com/coreruleset/coreruleset/pull/3202/files#diff-eb99983edad246a27f60b464ee0a6bad70ec1fcb6dc6c4233ff0c6101e585e4b

Ok, I can't find a way to link the list of review on github. Basically loading diff for user-agents-non-acceptable-automated-agents.data should show all 49 reviews on it.

@dune73
Copy link
Member Author
dune73 commented Apr 24, 2023

I can't find your review. All I see is your elastic excerpt and that's much more than 49 items.

(Adding the source files and the generator scripts to the PR as we speak.)

@theMiddleBlue
Copy link
Contributor

to see them, I click on "Files Changed" tab on this page,
then I scroll until user-agents-non-acceptable-automated-agents.data file
then I click on "load diff"

and you should see this:

image

@dune73
Copy link
Member Author
dune73 commented Apr 24, 2023

Negative. I do not see this. All I see is my new data file in green. Also tried to reload. Maybe you photoshopped this. :)

logdata:'Matched Data: %{TX.0} found within %{MATCHED_VAR_NAME}: %{MATCHED_VAR}',\
tag:'application-multi',\
tag:'language-multi',\
tag:'platform-multi',\
tag:'attack-reputation-scripting',\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we still have a "category" tag for this rule? Many integrators benefit from tag for stats and reports

quick-crawler
quiterss
quora link preview
radian6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RSS Reader

"Top 50 values of request.headers.user-agent","Count of records"
"R6_CommentReader(www.radian6.com/crawler)",1
"R6_FeedFetcher(www.radian6.com/crawler)",1

@theMiddleBlue
Copy link
Contributor

Negative. I do not see this. All I see is my new data file in green. Also tried to reload. Maybe you photoshopped this. :)

lol, you were right I think I forgot to click on "send review"
so sorry!

@dune73
Copy link
Member Author
dune73 commented Apr 26, 2023

@theMiddleBlue Please take a peek at the comments to your review remarks where the issues are still open.

@dune73
Copy link
Member Author
dune73 commented Apr 26, 2023

Now working on the tests. Here is what happened to the old tests:

913100-1 UA: Havij -> block, no longer blocked by PL1, equivalent to new test 913131-1
913100-2 UA: Arachni -> block, equivalent to new test 913130-1
913100-3 UA: w3af -> block, equivalent to new test 913130-1
913100-4 UA: nessus -> block, equivalent to new test 913130-1
913100-5 UA: urlgrabber -> not block, no problem since we have nothing blocking this
913100-6 UA: Grabber -> block, no longer blocked at PL1, equivalent to new test 913131-1
913100-7 UA: ecairn-grabber -> no block, no problem, since it's not blocked at PL1 anymore

913101-1 UA: libwww-perl -> block
913101-2 UA: "OWASP CRS test agent -> no block, replaced by new tests 913131-8 and 913132-8

913102-1 UA: blackwidow -> block, equivalent to new test 913131-1

@dune73
Copy link
Member Author
dune73 commented Apr 26, 2023

All new tests passed now.

Here are the new tests:

913130
        913130-1 Block Security Scanner
                - nikto
                        Mozilla/5.00 (Nikto/2.1.5) (Evasions:None) (Test:002942)
913131
        913131-1 Block Security Scanner
                - nikto
                        Mozilla/5.00 (Nikto/2.1.5) (Evasions:None) (Test:002942)
        913131-2 Block non-acceptable user agent from JayBizzle List
                - Goose
                        Goose/3.1.6 X-SiteSpeedApp-1
        913131-3 Block non-acceptable user agent from MichaelKrogza List
                - webbandit
                        webbandit/4.xx.0
        913131-4 Block non-acceptable user agent from MontPerrus List
                - RuxitSynthetic
                        Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81-1 Safari/537.36 RuxitSynthetic/1.0
        913131-5 Do not block acceptable user agent from CRS list
                - yisouspider
                        Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36
        913131-6 Do not block acceptable user agent from MichaelKrogza list
                - archive.org
                        Mozilla/5.0 (compatible; archive.org_bot +http://www.archive.org/details/archive.org_bot)
        913131-7 Do not block acceptable user agent from Cloudflare list
                - Seznambot
                        Mozilla/5.0 (compatible; SeznamBot/3.2-test3; +http://fulltext.sblog.cz/)
        913131-8 Do not block OWASP CRS test agent

913132
        913132-1 Block Security Scanner
                - nikto
                        Mozilla/5.00 (Nikto/2.1.5) (Evasions:None) (Test:002942)
        913132-2 Block non-acceptable user agent from JayBizzle List
                - Goose
                        Goose/3.1.6 X-SiteSpeedApp-1
        913132-3 Block non-acceptable user agent from MichaelKrogza List
                - webbandit
                        webbandit/4.xx.0
        913132-4 Block non-acceptable user agent from MontPerrus List
                - RuxitSynthetic
                        Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81-1 Safari/537.36 RuxitSynthetic/1.0
        913132-5 Block acceptable user agent from CRS list
                - yisouspider
                        Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36
        913132-6 Block acceptable user agent from MichaelKrogza list
                - archive.org
                        Mozilla/5.0 (compatible; archive.org_bot +http://www.archive.org/details/archive.org_bot)
        913132-7 Block acceptable user agent from Cloudflare list
                - Seznambot
                        Mozilla/5.0 (compatible; SeznamBot/3.2-test3; +http://fulltext.sblog.cz/)
        913132-8 Do not block OWASP CRS test agent

@@ -0,0 +1,123 @@
# This list is based on the following list:
# https://radar.cloudflare.com/traffic/verified-bots
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but I think integrators can't have this list in their service/products because of the license

https://creativecommons.org/licenses/by-nc/4.0/

Am I wrong?

@azurit
Copy link
Member
azurit commented May 8, 2023

I'm testing rule ID 913131 (data file user-agents-non-acceptable-automated-agents.data) for false positives but i noticed that some keywords can be removed because they are matched by other keywords (see below).

While it seems to be ok to remove them because of performance, i'm not 100% sure about it as we will lost some information (for example keywords icbot and epicbot: both are matched by icbot itself BUT both of them are matching completely different bots). I'm thinking about generating also data files for @pm where we get list of keywords as input (with all information) and generate a performance-optimal output (without duplicate matches).

aboundex is in aboundexbot
acoon is in acoonbot
adbeat is in adbeat_bot
aibot is in molokaibot
anyevent is in anyevent-http/
b0t is in siteshooter b0t
bandit is in webbandit
blow is in blowfish
chlooe is in bot-pge.chlooe.com
collector is in feedzcollector
collector is in webimagecollector
collector is in www-collector-e
copier is in webcopier
cosmos is in cosmos4j.feedback
dsearch is in addsearchbot
evil is in devil
extractor is in linkextractorpro
extractor is in websiteextractor
foobot is in infoobot
frontpage is in msfrontpage
fuzz is in fuzz faster
fuzz is in jbrofuzz
fuzz is in wfuzz/
gigablast is in gigablastopensource
google-adwords is in google-adwords-instant
grabber is in eirgrabber
grabber is in pagegrabber
harvest is in nlnz_iaharvester
httrack is in winhttrack
hubspot is in hubspot-link-resolver
hubspot is in hubspot-link-resolver
icbot is in epicbot
icbot is in semanticbot
idbot is in gridbot
idbot is in hybridbot
infegy is in collection@infegy.com
jobboerse is in jobboersebot
lexibot is in alexibot
lighthouse is in chrome-lighthouse
linkbot is in rankactivelinkbot
linkdex is in linkdexbot
magnet is in whynder magnet
mail/ is in polymail/
megaindex is in megaindex.ru
mixnode is in mixnodecache
mr.4x3 is in mr.4x3 powered
netcraft is in netcraftsurveyagent
netresearch is in netresearchserver
ninja is in internet ninja
ninja is in notifyninja
not is in annotate_google
not is in blocknote.net
not is in cispa vulnerability notification
not is in downnotifier
not is in notifixious
not is in notifyninja
npm/ is in pnpm/
oncrawl is in ioncrawl
pageanalyzer is in retrevopageanalyzer
pagethin is in pagething
plumanalytics is in com.plumanalytics
pr-cy.ru is in a.pr-cy.ru
rankactive is in rankactivelinkbot
reaper is in the drop reaper
reaper is in webreaper
re-re is in re-re studio
ripper is in stripper
ripper is in webstripper
ripz is in siteripz
rocketcrawler is in lssrocketcrawler
rssbot is in linqiarssbot
rssbot is in naver blog rssbot
scanbot is in interfaxscanbot
scrapy is in redesscrapy
screaming is in screaming frog seo spider
seobility is in seobilitybot
seokicks is in seokicks-robot
seostar is in seostar..co
siphon is in email siphon
sitemap is in ultimate_sitemap_parser
sonic is in ranksonicsiteauditor
stripper is in webstripper
sucker is in image sucker
sucker is in site sucker
sucker is in sitesucker
sucker is in web sucker
sucker is in websucker
teleport is in teleportpro
trendsmap is in trendsmapresolver
turnitin is in turnitinbot
vigil is in sitevigil
voil is in voilabot
wallpapers is in wallpapershd
webdav is in microsoft-webdav-miniredir
wesee is in wesee:search
whack is in webwhacker
whack is in whacker
whacker is in webwhacker
widow is in blackwidow
xenu is in xenu link sleuth
yacy is in yacybot

@azurit
Copy link
Member
azurit commented May 12, 2023

Sending info about false positives for rule 913131 (data file user-agents-non-acceptable-automated-agents.data).

I was gathering data on one of my servers for about a week - i got 2800+ matched requests which i processed by hand BUT i catched only 136 different keywords from data file. False positives are splitted into groups. I used this format:

keyword (explanation)
 - example1 of matched User-Agent header
 - example2 of matched User-Agent header
 - ...

Matches which i consider as FPs for sure

zabbix (monitoring software)

  • User-Agent: Zabbix

not (too generic)

  • User-Agent: Mozilla/5.0 (Linux; Android 12; Mi Note 10 Lite) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Mobile Safari/537.36

mediapartners (Google AdSense Bot)

  • User-Agent: Mediapartners-Google
  • User-Agent: Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/537.36 (KHTML, like Gecko; Mediapartners-Google) Chrome/112.0.5615.142 Mobile Safari/537.36

developers.google (Google web preview renderer)

Chrome-Lighthouse (Google PageSpeed Insights)

  • User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4590.2 Safari/537.36 Chrome-Lighthouse

lighthouse (Google PageSpeed Insights)

  • User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4590.2 Safari/537.36 Chrome-Lighthouse

lcc (too generic)

  • User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E) QQBrowser/6.9.11079.201
  • User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C; .NET4.0E; 360SE)

WhatsApp (link preview for WhatsApp)

  • User-Agent: WhatsApp/2.23.8.76 A

google web preview (web preview renderer for Chrome browser)

  • User-Agent: Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13

via ggpht.com googleimageproxy (Gmail image openings anonymizer)

  • User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)

microsoft office (MS Office link preview)

  • User-Agent: Microsoft Office/15.0 (Windows NT 6.3; Microsoft Outlook 15.0.5537; Pro)

microsoft outlook (MS Outlook link preview)

  • User-Agent: Microsoft Office/16.0 (Windows NT 10.0; Microsoft Outlook 16.0.16327; Pro)

googledocs (GoogleDocs link preview)

  • User-Agent: Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com/)
  • User-Agent: GoogleDocs

snap url preview service (Snap link preview)

wp rocket (WordPress cache plugin preloader)

  • User-Agent: WP Rocket/Partial_Preload

kinza (alternative web browser)

  • User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36 Kinza/4.7.2

excel/ (MS Excel link preview)

  • User-Agent: Microsoft Office Excel/16.72.409 (Mac OS/12.3; Desktop; sk-SK; NonAppStore; Apple/MacBookAir10,1)

macoutlook/ (Ms Outlook for Mac link preview)

  • User-Agent: MacOutlook/16.72.23043

outlook-ios (MS Outlook for iOS link preview)

  • User-Agent: Outlook-iOS-Android/1.0

minefield (Firefox beta version)

  • User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20070308 Minefield/3.0a1

mqqbrowser (alternative web browser, QQBrowser)

  • User-Agent: Mozilla/5.0 (Linux; U; Android 7.0; zh-cn; STF-AL00 Build/HUAWEISTF-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/8.6 Mobile Safari/537.36

dragonfly (Konqueror browser running on DragonFly BSD)

  • User-Agent: Mozilla/5.0 (compatible; Konqueror/4.1; DragonFly) KHTML/4.1.4 (like Gecko)

powerpoint/ (MS Powerpoint link preview)

  • User-Agent: Microsoft Office PowerPoint/16.72.409 (Mac OS/12.2.1; Desktop; sk-SK; NonAppStore; Apple/MacBookPro15,2)

discordbot (Discord app link preview)

ptst (SpeedCurve Speed Tester)

  • User-Agent: Mozilla/5.0 (Linux; Android 8.1.0; Moto G (4)) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Mobile Safari/537.36 PTST/230504.140356

leap (Roblox app link preview)

  • User-Agent: Mozilla/5.0 (Machintosh; Intel Mac OS X 10_15_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/9B176 ROBLOX iOS App 2.445.410643 Hybrid RobloxApp/2.445.41063 (GlobalDist; AppleAppStore)

micromessenger (Wechat link preview)

  • User-Agent: Mozilla/5.0 (Linux; Android 9; MHA-AL00 Build/HUAWEIMHA-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/71.0.3578.99 Mobile Safari/537.36 MMWEBID/9772 MicroMessenger/7.0.6.1460(0x27000634) Process/tools NetType/WIFI Language/zh_CN

viber (Viber link preview)

  • User-Agent: Mozilla/5.0 (Linux; Android 10; POT-LX1 Build/HUAWEIPOT-L21; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/85.0.4183.120 Mobile Safari/537.36 Viber/20.0.2.0
  • User-Agent: Mozilla/5.0 (Linux; Android 11; RMX3085 Build/RP1A.200720.011; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/87.0.4280.141 Mobile Safari/537.36 Viber/19.9.4.0

heritrix (Achive.org bot)

archivebot (Wikimedia link checker)

androiddownloadmanager (Download manager app for Android)

  • User-Agent: AndroidDownloadManager/10 (Linux; U; Android 10; MAR-LX1A Build/HUAWEIMAR-L21A)
  • User-Agent: AndroidDownloadManager/5.1 (Linux; U; Android 5.1; Z820 Build/LMY47D)

RSS feed readers

serendeputybot

simplepie

  • User-Agent: WPeMatico SimplePie/1.5.8 (Feed Parser; http://simplepie.org/; Allow like Gecko) Build/20220731093249
  • User-Agent: SimplePie/1.8.0 (Feed Parser; http://simplepie.org/; Allow like Gecko) Build/1683288940

feedburner

feedbot

universalfeedparser

flipboardrss

theoldreader.com

  • User-Agent: Mozilla/5.0 (compatible; theoldreader.com; 2 subscribers; feed-id=f8dce230858b42b5a6e11358)

HTTP libraries

fasthttp

  • User-Agent: fasthttp

faraday v

  • User-Agent: Faraday v0.17.4

python-urllib

  • User-Agent: Python-urllib/2.7

httpclient/

  • User-Agent: Apache-HttpClient/4.5.10 (Java/1.8.0_242)

Microsoft URL Control

  • User-Agent: Microsoft URL Control - 6.00.8862

jigsaw

  • User-Agent: Jigsaw/2.2.5 W3C_CSS_Validator_JFouffa/2.0

w3c_css_validator

  • User-Agent: Jigsaw/2.2.5 W3C_CSS_Validator_JFouffa/2.0

photon/

  • User-Agent: Photon/1.0

Various services

cert.at-statistics-survey

ghost inspector

  • User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0 Safari/537.36 Ghost Inspector

alligator

  • User-Agent: Alert_Alligator_webmonitoring_service_1.6.3_(alertalligator.com)

webmon

  • User-Agent: Alert_Alligator_webmonitoring_service_1.6.3_(alertalligator.com)

backupland

Tools

aria2 (somethings similar to curl)

  • User-Agent: aria2/1.35.0

@dune73
Copy link
Member Author
dune73 commented Jun 8, 2023

We agreed in the June meeting, that this overhaul would not work and we are now stripping the UA based scanner detection to the most malicious scanners announcing themselves in the UA in a PL1 rule. All the rest is being kicked out since we are not able to draw a line between benign, annoying and not so benign scanners and bots. Creating a plugin to provide this for those who really want it, would be an option though, but that is not a priority / depends on a volunteer.

This PR is thus closed.

For future reference, here are two scripts that might be useful for individuals:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
👀 Needs action ⚠️ do not merge Additional work or discussion is needed despite passing tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0