feat: scanner overhaul: new rules, new data files #3202

dune73 · 2023-04-24T13:59:30Z

It's been a lot of work, but here is the PR that brings new rules to inspect the user-agent of clients:

We have 3 existing rules focusing on the user agent. 913100 and two PL2 stricter siblings:

913100 PL1 Found User-Agent associated with security scanner, based on scanners-user-agents.data
913101 PL2 Found User-Agent associated with scripting/generic HTTP client, based on scripting-user-agents.data
913102 PL2 Found User-Agent associated with web crawler/bot, based on crawlers-user-agents.data

When we switch to base our rules on existing online keyword lists, then the distinction between scripting/generic, crawler/bots and other automatic agents is no longer possible. The existing sources are not making this distinction. And if they are grouping it's different groups, incompatible taxonomies between different sources, etc.

I tried to maintain certain levels of badness among the agents, but it's too much manual work. So the new approach is mostly automatic. The core is a list of automated agents that we merge from three different online sources. Out of this full list of automated agents, there is a manual list of security scanners like nmap, zgrab, nikto, etc. That list has to be maintained by hand. It's a replacement for 913100, but it's an overhaul since 913100 has been horribly outdated.

At PL2 we have the automated list mentioned above without a list of acceptable user agents. This list of acceptable user agents has two online sources and a manual list.

Then at PL4 we block all automated agents including the google bot and the let's encrypt agent.

Let me repeat: we no longer distinguish between crawlers, HTTP client libraries and site security audit services like SSLLabs. They are either benign and popular, then they are part of an allow-list at PL2 in order to reduce false positives. Or they are not overly popular or not benign, then they will trigger at PL2. I attempted to differenciate between scrapers and search engine bots first or to distinguish security service tools, but I gave up eventually. It's a list of 1900 UAs after all.

At PL4 finally, we trigger on every trace of an automated agent. This will hit GoogleBot as well as Let's Encrypt. We have to make this transparent.

Now problem: The new manual list of security scanners is a replacement for 913100. But 913101 and 913102 are simply gone. The new rules at PL2 and PL4 are not replacements. The new rules are still stricter siblings to a certain extent. But only to a certain extent. New rule IDs are not obvious.

I see four possible approaches:

Fuck you approach: 913100 PL1 stays, 913101 PL2 is replaced with list of non-acceptable agents, 913102 PL2 moves to PL4 to spot all automated agents.
Partial move 1: 913100 PL1 stays, 913101/2 disappear. New rules as 913103 PL2 and 913104 PL4
Partial move 2: 913100 PL1 stays, 913101/2 disappear. New rules as 913130 PL2 and 913131 PL4
Full move: 91310x are removed, new rules as 913130 PL1, 913131 PL2 and 913132 PL4

They all have their merits. For the PR, I opted for the full move option. I am open to discuss this though.

Automated agents sources:

The 3 online sources had to be transformed heavily in order to make them usable with the @pmFromFile operator and in order to avoid near-duplicates.

Benign agents sources:

Cloudflare (https://radar.cloudflare.com/traffic/verified-bots) (This list comes with names, but lacks user-agents. Had to come up with them individually)
good-bots list my Mitchell Krozga (https://raw.githubusercontent.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/master/Apache_2.4/custom.d/globalblacklist.conf)
Existing CRS rules

After retrieving the online sources and looking through them by hand several times, I also double-checked them with the commercial list of whatsmybrowser.com, a 78GB list of real world user-agents strings. In fact about 1/4 of the acceptable user agents are not in the whatsmybrowser database. I'm keeping them on the list nevertheless.

So here are the 3 new data files:

user-agents-security-scanners.data (PL1, 32 entries)
user-agents-non-acceptable-automated-agents.data (PL2, 1790 entries)
user-agents-automated-agents.data (PL4, 1933 entries)

And the new rules:

913130 PL1 Found User-Agent associated with security scanner
913131 PL2 Found User-Agent that is associated with non-acceptable automated user agent
913132 PL4 Found User-Agent that is associated with automated user agent

The PR is not yet ready to be merged. The whole scripting and manual source files
need to be cleaned up and added before it's ready. And then the tests of course. I'm labeling accordingly.

RedXanadu · 2023-04-24T14:18:19Z

Wow! What a list 😀

These changes appear to be large enough to deserve a new set of rule IDs. On the other hand, the meaning of PL 1 rule 913100 seems to have remained the same, so it's also arguable to keep that rule ID.

Both of those options seem sensible, IMO (full move or partial move).

theMiddleBlue · 2023-04-24T14:59:01Z

that's great!!

I'm writing a couple of review

theMiddleBlue · 2023-04-24T15:41:48Z

I've created a script to query my production over 9.922.403 logs for each entries of the file user-agents-non-acceptable-automated-agents.data looking for legitimate UA. I'm going trough this list and I'll open a review if I found some legit UA:

[0s] - ELASTICSEARCH Found 309 results for 007ac9 crawler
[0s] - ELASTICSEARCH Found 389 results for 008
[2s] - ELASTICSEARCH Found 6403 results for acebookexternalhit
[2s] - ELASTICSEARCH Found 2 results for adbeat
[4s] - ELASTICSEARCH Found 325 results for ahrefs
[6s] - ELASTICSEARCH Found 47 results for ant.com
[8s] - ELASTICSEARCH Found 3 results for aria2
[11s] - ELASTICSEARCH Found 27 results for axios
[14s] - ELASTICSEARCH Found 2 results for bitlybot
[15s] - ELASTICSEARCH Found 4 results for bloglovin
[23s] - ELASTICSEARCH Found 44 results for coccoc
[26s] - ELASTICSEARCH Found 414 results for cortex
[26s] - ELASTICSEARCH Found 2764 results for craw
[28s] - ELASTICSEARCH Found 217 results for custo
[29s] - ELASTICSEARCH Found 2 results for dcrawl
[31s] - ELASTICSEARCH Found 26 results for disco
[33s] - ELASTICSEARCH Found 10000 results for dotbot
[36s] - ELASTICSEARCH Found 1 results for endo
[37s] - ELASTICSEARCH Found 309 results for e..ventures investment crawler
[37s] - ELASTICSEARCH Found 1 results for exabot
[39s] - ELASTICSEARCH Found 1 results for facebookscraper
[39s] - ELASTICSEARCH Found 309 results for fast enterprise crawler
[42s] - ELASTICSEARCH Found 318 results for fetch
[47s] - ELASTICSEARCH Found 309 results for gluten free crawler
[47s] - ELASTICSEARCH Found 10000 results for go
[51s] - ELASTICSEARCH Found 7 results for google-xrawler
[52s] - ELASTICSEARCH Found 1 results for go!zilla
[52s] - ELASTICSEARCH Found 2 results for grammarly
[55s] - ELASTICSEARCH Found 4 results for heritrix
[58s] - ELASTICSEARCH Found 119 results for httpx
[59s] - ELASTICSEARCH Found 11 results for hubspot
[59s] - ELASTICSEARCH Found 11 results for hubspot
[59s] - ELASTICSEARCH Found 309 results for ias crawler
[64s] - ELASTICSEARCH Found 2 results for ips
[64s] - ELASTICSEARCH Found 2 results for ips-agent
[65s] - ELASTICSEARCH Found 1456 results for iubenda-radar
[66s] - ELASTICSEARCH Found 2 results for jaunt
[66s] - ELASTICSEARCH Found 5 results for java
[67s] - ELASTICSEARCH Found 1 results for jobboerse
[70s] - ELASTICSEARCH Found 1474 results for khttp
[72s] - ELASTICSEARCH Found 1 results for libwww
[73s] - ELASTICSEARCH Found 37 results for liferea
[73s] - ELASTICSEARCH Found 2 results for linkbot
[74s] - ELASTICSEARCH Found 5 results for linkfluence
[77s] - ELASTICSEARCH Found 205 results for lua-resty-http
[78s] - ELASTICSEARCH Found 406 results for magpie-crawler
[78s] - ELASTICSEARCH Found 1172 results for mail
[83s] - ELASTICSEARCH Found 112 results for miniflux
[83s] - ELASTICSEARCH Found 435 results for mixdata dot com
[83s] - ELASTICSEARCH Found 1 results for moblie safari
[84s] - ELASTICSEARCH Found 40 results for monit
[122s] - ELASTICSEARCH Found 3 results for netcraft
[122s] - ELASTICSEARCH Found 309 results for netestate ne crawler
[123s] - ELASTICSEARCH Found 309 results for neticle crawler
[127s] - ELASTICSEARCH Found 40 results for nmap
[127s] - ELASTICSEARCH Found 3 results for not
[129s] - ELASTICSEARCH Found 4 results for nyu
[129s] - ELASTICSEARCH Found 10000 results for obot
[129s] - ELASTICSEARCH Found 1474 results for okhttp
[130s] - ELASTICSEARCH Found 80 results for omsc
[131s] - ELASTICSEARCH Found 3 results for open source rss
[132s] - ELASTICSEARCH Found 25 results for owler
[136s] - ELASTICSEARCH Found 14 results for pinterest.com
[136s] - ELASTICSEARCH Found 4 results for pip
[141s] - ELASTICSEARCH Found 119 results for python-httpx
[141s] - ELASTICSEARCH Found 70 results for python-urllib
[143s] - ELASTICSEARCH Found 2 results for radian6
[146s] - ELASTICSEARCH Found 309 results for rma
[148s] - ELASTICSEARCH Found 309 results for safesearch microdata crawler
[149s] - ELASTICSEARCH Found 3 results for scrapy
[152s] - ELASTICSEARCH Found 3822 results for semrush
[154s] - ELASTICSEARCH Found 1094 results for serpstatbot
[154s] - ELASTICSEARCH Found 149 results for seznam
[156s] - ELASTICSEARCH Found 10000 results for siteexplorer
[157s] - ELASTICSEARCH Found 21 results for slack
[158s] - ELASTICSEARCH Found 48 results for smtbot
[160s] - ELASTICSEARCH Found 64 results for spaziodati
[167s] - ELASTICSEARCH Found 5 results for theoldreader.com
[171s] - ELASTICSEARCH Found 1 results for twingly
[171s] - ELASTICSEARCH Found 651 results for ubermetrics-technologies
[181s] - ELASTICSEARCH Found 1484 results for webmeup-crawler
[182s] - ELASTICSEARCH Found 14 results for webprosbot
[182s] - ELASTICSEARCH Found 14 results for webpros.com
[187s] - ELASTICSEARCH Found 301 results for word
[188s] - ELASTICSEARCH Found 2 results for wpscan
[190s] - ELASTICSEARCH Found 361 results for y!j
[192s] - ELASTICSEARCH Found 131 results for zgrab
[193s] - ELASTICSEARCH Found 2 results for zoombot

dune73 · 2023-04-24T16:01:33Z

@theMiddleBlue I would be very interested in stuff that is being flagged at PL2 when it should not. There is a lot of manual work in that allow-list that separates the PL2 from the PL4 list.

If you happen to know the UA of a security scanner not included in the PL1 list, then please list it. Happy to expand.

theMiddleBlue · 2023-04-24T17:38:02Z

@theMiddleBlue I would be very interested in stuff that is being flagged at PL2 when it should not. There is a lot of manual work in that allow-list that separates the PL2 from the PL4 list.

If you happen to know the UA of a security scanner not included in the PL1 list, then please list it. Happy to expand.

sure, I've just finished reviewing rules/user-agents-non-acceptable-automated-agents.data. Basically when I've added a review is because I would not block it at PL2 or there're FPs.

I can do the same with the other lists if needed

dune73 · 2023-04-24T19:18:57Z

OK, thanks for the list. Much needed review.

The user-agents-non-acceptable-automated-agents.data is the crucial file, since it's based on a more or less manual review of a 1.9K user-agents file.

semrush, ahrefs and go were meant to be eliminated from the file. Added to the allow-list again. Also added java based on your review.

What other UAs do you think should not be blocked / detected at PL2? I mean zgrab is usually not benign...

theMiddleBlue · 2023-04-24T19:29:44Z

based on my really personal POV (running almost everything at PL2) all the 49 entries I've commented ~~here~~ should be removed from the non-acceptable list

https://github.com/coreruleset/coreruleset/pull/3202/files#diff-eb99983edad246a27f60b464ee0a6bad70ec1fcb6dc6c4233ff0c6101e585e4b

Ok, I can't find a way to link the list of review on github. Basically loading diff for user-agents-non-acceptable-automated-agents.data should show all 49 reviews on it.

dune73 · 2023-04-24T19:49:22Z

I can't find your review. All I see is your elastic excerpt and that's much more than 49 items.

(Adding the source files and the generator scripts to the PR as we speak.)

theMiddleBlue · 2023-04-24T19:53:41Z

to see them, I click on "Files Changed" tab on this page,
then I scroll until user-agents-non-acceptable-automated-agents.data file
then I click on "load diff"

and you should see this:

dune73 · 2023-04-24T19:58:16Z

Negative. I do not see this. All I see is my new data file in green. Also tried to reload. Maybe you photoshopped this. :)

theMiddleBlue · 2023-04-24T14:59:51Z

rules/REQUEST-913-SCANNER-DETECTION.conf

    logdata:'Matched Data: %{TX.0} found within %{MATCHED_VAR_NAME}: %{MATCHED_VAR}',\
    tag:'application-multi',\
    tag:'language-multi',\
    tag:'platform-multi',\
-    tag:'attack-reputation-scripting',\


can we still have a "category" tag for this rule? Many integrators benefit from tag for stats and reports

rules/user-agents-non-acceptable-automated-agents.data

theMiddleBlue · 2023-04-24T17:24:45Z

rules/user-agents-non-acceptable-automated-agents.data

+quick-crawler
+quiterss
+quora link preview
+radian6


RSS Reader

"Top 50 values of request.headers.user-agent","Count of records" "R6_CommentReader(www.radian6.com/crawler)",1 "R6_FeedFetcher(www.radian6.com/crawler)",1

rules/user-agents-non-acceptable-automated-agents.data

theMiddleBlue · 2023-04-24T20:09:02Z

Negative. I do not see this. All I see is my new data file in green. Also tried to reload. Maybe you photoshopped this. :)

lol, you were right I think I forgot to click on "send review"
so sorry!

dune73 · 2023-04-26T11:43:43Z

dune73 commented

Apr 26, 2023

@theMiddleBlue Please take a peek at the comments to your review remarks where the issues are still open.

dune73 · 2023-04-26T11:44:07Z

Now working on the tests. Here is what happened to the old tests:

913100-1 UA: Havij -> block, no longer blocked by PL1, equivalent to new test 913131-1
913100-2 UA: Arachni -> block, equivalent to new test 913130-1
913100-3 UA: w3af -> block, equivalent to new test 913130-1
913100-4 UA: nessus -> block, equivalent to new test 913130-1
913100-5 UA: urlgrabber -> not block, no problem since we have nothing blocking this
913100-6 UA: Grabber -> block, no longer blocked at PL1, equivalent to new test 913131-1
913100-7 UA: ecairn-grabber -> no block, no problem, since it's not blocked at PL1 anymore

913101-1 UA: libwww-perl -> block
913101-2 UA: "OWASP CRS test agent -> no block, replaced by new tests 913131-8 and 913132-8

913102-1 UA: blackwidow -> block, equivalent to new test 913131-1

dune73 · 2023-04-26T15:33:02Z

All new tests passed now.

Here are the new tests:

913130
        913130-1 Block Security Scanner
                - nikto
                        Mozilla/5.00 (Nikto/2.1.5) (Evasions:None) (Test:002942)
913131
        913131-1 Block Security Scanner
                - nikto
                        Mozilla/5.00 (Nikto/2.1.5) (Evasions:None) (Test:002942)
        913131-2 Block non-acceptable user agent from JayBizzle List
                - Goose
                        Goose/3.1.6 X-SiteSpeedApp-1
        913131-3 Block non-acceptable user agent from MichaelKrogza List
                - webbandit
                        webbandit/4.xx.0
        913131-4 Block non-acceptable user agent from MontPerrus List
                - RuxitSynthetic
                        Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81-1 Safari/537.36 RuxitSynthetic/1.0
        913131-5 Do not block acceptable user agent from CRS list
                - yisouspider
                        Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36
        913131-6 Do not block acceptable user agent from MichaelKrogza list
                - archive.org
                        Mozilla/5.0 (compatible; archive.org_bot +http://www.archive.org/details/archive.org_bot)
        913131-7 Do not block acceptable user agent from Cloudflare list
                - Seznambot
                        Mozilla/5.0 (compatible; SeznamBot/3.2-test3; +http://fulltext.sblog.cz/)
        913131-8 Do not block OWASP CRS test agent

913132
        913132-1 Block Security Scanner
                - nikto
                        Mozilla/5.00 (Nikto/2.1.5) (Evasions:None) (Test:002942)
        913132-2 Block non-acceptable user agent from JayBizzle List
                - Goose
                        Goose/3.1.6 X-SiteSpeedApp-1
        913132-3 Block non-acceptable user agent from MichaelKrogza List
                - webbandit
                        webbandit/4.xx.0
        913132-4 Block non-acceptable user agent from MontPerrus List
                - RuxitSynthetic
                        Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81-1 Safari/537.36 RuxitSynthetic/1.0
        913132-5 Block acceptable user agent from CRS list
                - yisouspider
                        Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36
        913132-6 Block acceptable user agent from MichaelKrogza list
                - archive.org
                        Mozilla/5.0 (compatible; archive.org_bot +http://www.archive.org/details/archive.org_bot)
        913132-7 Block acceptable user agent from Cloudflare list
                - Seznambot
                        Mozilla/5.0 (compatible; SeznamBot/3.2-test3; +http://fulltext.sblog.cz/)
        913132-8 Do not block OWASP CRS test agent

theMiddleBlue · 2023-04-29T08:50:53Z

util/list-generators/91313x/acceptable-bots-cloudflare.src

@@ -0,0 +1,123 @@
+# This list is based on the following list:
+# https://radar.cloudflare.com/traffic/verified-bots


I'm not sure, but I think integrators can't have this list in their service/products because of the license

https://creativecommons.org/licenses/by-nc/4.0/

Am I wrong?

azurit · 2023-05-08T12:41:21Z

I'm testing rule ID 913131 (data file user-agents-non-acceptable-automated-agents.data) for false positives but i noticed that some keywords can be removed because they are matched by other keywords (see below).

While it seems to be ok to remove them because of performance, i'm not 100% sure about it as we will lost some information (for example keywords icbot and epicbot: both are matched by icbot itself BUT both of them are matching completely different bots). I'm thinking about generating also data files for @pm where we get list of keywords as input (with all information) and generate a performance-optimal output (without duplicate matches).

aboundex is in aboundexbot
acoon is in acoonbot
adbeat is in adbeat_bot
aibot is in molokaibot
anyevent is in anyevent-http/
b0t is in siteshooter b0t
bandit is in webbandit
blow is in blowfish
chlooe is in bot-pge.chlooe.com
collector is in feedzcollector
collector is in webimagecollector
collector is in www-collector-e
copier is in webcopier
cosmos is in cosmos4j.feedback
dsearch is in addsearchbot
evil is in devil
extractor is in linkextractorpro
extractor is in websiteextractor
foobot is in infoobot
frontpage is in msfrontpage
fuzz is in fuzz faster
fuzz is in jbrofuzz
fuzz is in wfuzz/
gigablast is in gigablastopensource
google-adwords is in google-adwords-instant
grabber is in eirgrabber
grabber is in pagegrabber
harvest is in nlnz_iaharvester
httrack is in winhttrack
hubspot is in hubspot-link-resolver
hubspot is in hubspot-link-resolver
icbot is in epicbot
icbot is in semanticbot
idbot is in gridbot
idbot is in hybridbot
infegy is in collection@infegy.com
jobboerse is in jobboersebot
lexibot is in alexibot
lighthouse is in chrome-lighthouse
linkbot is in rankactivelinkbot
linkdex is in linkdexbot
magnet is in whynder magnet
mail/ is in polymail/
megaindex is in megaindex.ru
mixnode is in mixnodecache
mr.4x3 is in mr.4x3 powered
netcraft is in netcraftsurveyagent
netresearch is in netresearchserver
ninja is in internet ninja
ninja is in notifyninja
not is in annotate_google
not is in blocknote.net
not is in cispa vulnerability notification
not is in downnotifier
not is in notifixious
not is in notifyninja
npm/ is in pnpm/
oncrawl is in ioncrawl
pageanalyzer is in retrevopageanalyzer
pagethin is in pagething
plumanalytics is in com.plumanalytics
pr-cy.ru is in a.pr-cy.ru
rankactive is in rankactivelinkbot
reaper is in the drop reaper
reaper is in webreaper
re-re is in re-re studio
ripper is in stripper
ripper is in webstripper
ripz is in siteripz
rocketcrawler is in lssrocketcrawler
rssbot is in linqiarssbot
rssbot is in naver blog rssbot
scanbot is in interfaxscanbot
scrapy is in redesscrapy
screaming is in screaming frog seo spider
seobility is in seobilitybot
seokicks is in seokicks-robot
seostar is in seostar..co
siphon is in email siphon
sitemap is in ultimate_sitemap_parser
sonic is in ranksonicsiteauditor
stripper is in webstripper
sucker is in image sucker
sucker is in site sucker
sucker is in sitesucker
sucker is in web sucker
sucker is in websucker
teleport is in teleportpro
trendsmap is in trendsmapresolver
turnitin is in turnitinbot
vigil is in sitevigil
voil is in voilabot
wallpapers is in wallpapershd
webdav is in microsoft-webdav-miniredir
wesee is in wesee:search
whack is in webwhacker
whack is in whacker
whacker is in webwhacker
widow is in blackwidow
xenu is in xenu link sleuth
yacy is in yacybot

azurit · 2023-05-12T16:15:55Z

Sending info about false positives for rule 913131 (data file user-agents-non-acceptable-automated-agents.data).

I was gathering data on one of my servers for about a week - i got 2800+ matched requests which i processed by hand BUT i catched only 136 different keywords from data file. False positives are splitted into groups. I used this format:

keyword (explanation)
 - example1 of matched User-Agent header
 - example2 of matched User-Agent header
 - ...

Matches which i consider as FPs for sure

zabbix (monitoring software)

User-Agent: Zabbix

not (too generic)

User-Agent: Mozilla/5.0 (Linux; Android 12; Mi Note 10 Lite) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Mobile Safari/537.36

mediapartners (Google AdSense Bot)

User-Agent: Mediapartners-Google
User-Agent: Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/537.36 (KHTML, like Gecko; Mediapartners-Google) Chrome/112.0.5615.142 Mobile Safari/537.36

developers.google (Google web preview renderer)

User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 Google-PageRenderer Google (+https://developers.google.com/+/web/snippet/)

Chrome-Lighthouse (Google PageSpeed Insights)

User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4590.2 Safari/537.36 Chrome-Lighthouse

lighthouse (Google PageSpeed Insights)

User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4590.2 Safari/537.36 Chrome-Lighthouse

lcc (too generic)

User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E) QQBrowser/6.9.11079.201
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C; .NET4.0E; 360SE)

WhatsApp (link preview for WhatsApp)

User-Agent: WhatsApp/2.23.8.76 A

google web preview (web preview renderer for Chrome browser)

User-Agent: Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13

via ggpht.com googleimageproxy (Gmail image openings anonymizer)

User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)

microsoft office (MS Office link preview)

User-Agent: Microsoft Office/15.0 (Windows NT 6.3; Microsoft Outlook 15.0.5537; Pro)

microsoft outlook (MS Outlook link preview)

User-Agent: Microsoft Office/16.0 (Windows NT 10.0; Microsoft Outlook 16.0.16327; Pro)

googledocs (GoogleDocs link preview)

User-Agent: Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com/)
User-Agent: GoogleDocs

snap url preview service (Snap link preview)

User-Agent: Snap URL Preview Service; bot; snapchat; https://developers.snap.com/robots

wp rocket (WordPress cache plugin preloader)

User-Agent: WP Rocket/Partial_Preload

kinza (alternative web browser)

User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36 Kinza/4.7.2

excel/ (MS Excel link preview)

User-Agent: Microsoft Office Excel/16.72.409 (Mac OS/12.3; Desktop; sk-SK; NonAppStore; Apple/MacBookAir10,1)

macoutlook/ (Ms Outlook for Mac link preview)

User-Agent: MacOutlook/16.72.23043

outlook-ios (MS Outlook for iOS link preview)

User-Agent: Outlook-iOS-Android/1.0

minefield (Firefox beta version)

User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20070308 Minefield/3.0a1

mqqbrowser (alternative web browser, QQBrowser)

User-Agent: Mozilla/5.0 (Linux; U; Android 7.0; zh-cn; STF-AL00 Build/HUAWEISTF-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/8.6 Mobile Safari/537.36

dragonfly (Konqueror browser running on DragonFly BSD)

User-Agent: Mozilla/5.0 (compatible; Konqueror/4.1; DragonFly) KHTML/4.1.4 (like Gecko)

powerpoint/ (MS Powerpoint link preview)

User-Agent: Microsoft Office PowerPoint/16.72.409 (Mac OS/12.2.1; Desktop; sk-SK; NonAppStore; Apple/MacBookPro15,2)

discordbot (Discord app link preview)

user-agent: Mozilla/5.0 (compatible; Discordbot/2.0; +https://discordapp.com/)

ptst (SpeedCurve Speed Tester)

User-Agent: Mozilla/5.0 (Linux; Android 8.1.0; Moto G (4)) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Mobile Safari/537.36 PTST/230504.140356

leap (Roblox app link preview)

User-Agent: Mozilla/5.0 (Machintosh; Intel Mac OS X 10_15_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/9B176 ROBLOX iOS App 2.445.410643 Hybrid RobloxApp/2.445.41063 (GlobalDist; AppleAppStore)

micromessenger (Wechat link preview)

User-Agent: Mozilla/5.0 (Linux; Android 9; MHA-AL00 Build/HUAWEIMHA-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/71.0.3578.99 Mobile Safari/537.36 MMWEBID/9772 MicroMessenger/7.0.6.1460(0x27000634) Process/tools NetType/WIFI Language/zh_CN

viber (Viber link preview)

User-Agent: Mozilla/5.0 (Linux; Android 10; POT-LX1 Build/HUAWEIPOT-L21; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/85.0.4183.120 Mobile Safari/537.36 Viber/20.0.2.0
User-Agent: Mozilla/5.0 (Linux; Android 11; RMX3085 Build/RP1A.200720.011; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/87.0.4280.141 Mobile Safari/537.36 Viber/19.9.4.0

heritrix (Achive.org bot)

User-Agent: Mozilla/5.0 (compatible; heritrix/3.3.0-SNAPSHOT-20140702-2247 +http://archive.org/details/archive.org_bot)

archivebot (Wikimedia link checker)

User-Agent: IABot/2.0 (+https://meta.wikimedia.org/wiki/InternetArchiveBot/FAQ_for_sysadmins) (Checking if link from Wikipedia is broken and needs removal)

androiddownloadmanager (Download manager app for Android)

User-Agent: AndroidDownloadManager/10 (Linux; U; Android 10; MAR-LX1A Build/HUAWEIMAR-L21A)
User-Agent: AndroidDownloadManager/5.1 (Linux; U; Android 5.1; Z820 Build/LMY47D)

RSS feed readers

serendeputybot

User-agent: SerendeputyBot/0.8.6 (http://serendeputy.com/about/serendeputy-bot)

simplepie

User-Agent: WPeMatico SimplePie/1.5.8 (Feed Parser; http://simplepie.org/; Allow like Gecko) Build/20220731093249
User-Agent: SimplePie/1.8.0 (Feed Parser; http://simplepie.org/; Allow like Gecko) Build/1683288940

feedburner

User-Agent: FeedBurner/1.0 (http://www.feedburner.com/)

feedbot

User-Agent: wp.com feedbot/1.0 (+https://wp.com/)

universalfeedparser

User-Agent: UniversalFeedParser/5.1.3 +https://code.google.com/p/feedparser/

flipboardrss

User-Agent: Mozilla/5.0 (compatible; FlipboardRSS/1.2; +http://flipboard.com/browserproxy)

theoldreader.com

User-Agent: Mozilla/5.0 (compatible; theoldreader.com; 2 subscribers; feed-id=f8dce230858b42b5a6e11358)

HTTP libraries

fasthttp

User-Agent: fasthttp

faraday v

User-Agent: Faraday v0.17.4

python-urllib

User-Agent: Python-urllib/2.7

httpclient/

User-Agent: Apache-HttpClient/4.5.10 (Java/1.8.0_242)

Microsoft URL Control

User-Agent: Microsoft URL Control - 6.00.8862

jigsaw

User-Agent: Jigsaw/2.2.5 W3C_CSS_Validator_JFouffa/2.0

w3c_css_validator

User-Agent: Jigsaw/2.2.5 W3C_CSS_Validator_JFouffa/2.0

photon/

User-Agent: Photon/1.0

Various services

cert.at-statistics-survey

User-Agent: CERT.at-Statistics-Survey/1.0 (+http://www.cert.at/about/consec/content.html)

ghost inspector

User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0 Safari/537.36 Ghost Inspector

alligator

User-Agent: Alert_Alligator_webmonitoring_service_1.6.3_(alertalligator.com)

webmon

User-Agent: Alert_Alligator_webmonitoring_service_1.6.3_(alertalligator.com)

backupland

User-Agent: Mozilla/5.0 (compatible; BackupLand/1.0; https://go.backupland.com/; Domain check for viruses;)

Tools

aria2 (somethings similar to curl)

User-Agent: aria2/1.35.0

dune73 · 2023-06-08T10:19:06Z

We agreed in the June meeting, that this overhaul would not work and we are now stripping the UA based scanner detection to the most malicious scanners announcing themselves in the UA in a PL1 rule. All the rest is being kicked out since we are not able to draw a line between benign, annoying and not so benign scanners and bots. Creating a plugin to provide this for those who really want it, would be an option though, but that is not a priority / depends on a volunteer.

This PR is thus closed.

For future reference, here are two scripts that might be useful for individuals:

scanner overhaul: new rules, new data files

5317e13

dune73 added 👀 Needs action ⚠️ do not merge Additional work or discussion is needed despite passing tests labels Apr 24, 2023

Remove securityheaders, ssllabs from user-agents-automated-agents.data

a98660b

Allowing ahrefs, semrush, go and java in 913131

949bf25

Added retrieval and creating scripts

15e0db7

theMiddleBlue reviewed Apr 24, 2023

View reviewed changes

dune73 added 4 commits April 26, 2023 10:26

Expanding list acceptable-bots-crs.src

6eaa529

Remodelling output of scripts

5371514

Cosmetic change

6f61789

Removed old tests

611879c

dune73 added 6 commits April 26, 2023 13:58

Added test 913130-1

af184b6

Set CAPEC category 170 for 913131 and 913132

78a8e14

Whitespace

6718a5e

Added 8 tests for 913131

b29dd46

Fixing tests and minor adjustments

abe5807

Added tests for 913132; adjustments to tests of 931131

5828624

Test fixes

651de07

theMiddleBlue reviewed Apr 29, 2023

View reviewed changes

fzipi mentioned this pull request May 1, 2023

Monthly Chat Agenda May 2023 (2023-05-01 and 2023-05-15) #3204

Closed

RedXanadu mentioned this pull request May 22, 2023

Monthly Chat Agenda June 2023 (2023-06-05 and 2023-06-19) #3221

Closed

fzipi mentioned this pull request Jun 5, 2023

chore: review doc hyperlinks #3232

Merged

11 tasks

dune73 closed this Jun 8, 2023

dune73 mentioned this pull request Jun 8, 2023

feat(91310x): Overhaul 91310x family #3236

Merged

RedXanadu mentioned this pull request Aug 26, 2023

Add more user agents to scanner user agent list #3285

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: scanner overhaul: new rules, new data files #3202

feat: scanner overhaul: new rules, new data files #3202

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		@@ -0,0 +1,123 @@
		# This list is based on the following list:
		# https://radar.cloudflare.com/traffic/verified-bots

Uh oh!

feat: scanner overhaul: new rules, new data files #3202

feat: scanner overhaul: new rules, new data files #3202

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Matches which i consider as FPs for sure

RSS feed readers

HTTP libraries

Various services

Tools

Uh oh!

Uh oh!

Uh oh!