-
-
Notifications
You must be signed in to change notification settings - Fork 402
feat: scanner overhaul: new rules, new data files #3202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Wow! What a list 😀 These changes appear to be large enough to deserve a new set of rule IDs. On the other hand, the meaning of PL 1 rule 913100 seems to have remained the same, so it's also arguable to keep that rule ID. Both of those options seem sensible, IMO (full move or partial move). |
that's great!! I'm writing a couple of review |
I've created a script to query my production over 9.922.403 logs for each entries of the file
|
@theMiddleBlue I would be very interested in stuff that is being flagged at PL2 when it should not. There is a lot of manual work in that allow-list that separates the PL2 from the PL4 list. If you happen to know the UA of a security scanner not included in the PL1 list, then please list it. Happy to expand. |
sure, I've just finished reviewing I can do the same with the other lists if needed |
OK, thanks for the list. Much needed review. The
What other UAs do you think should not be blocked / detected at PL2? I mean |
based on my really personal POV (running almost everything at PL2) all the 49 entries I've commented Ok, I can't find a way to link the list of review on github. Basically loading diff for |
I can't find your review. All I see is your elastic excerpt and that's much more than 49 items. (Adding the source files and the generator scripts to the PR as we speak.) |
Negative. I do not see this. All I see is my new data file in green. Also tried to reload. Maybe you photoshopped this. :) |
logdata:'Matched Data: %{TX.0} found within %{MATCHED_VAR_NAME}: %{MATCHED_VAR}',\ | ||
tag:'application-multi',\ | ||
tag:'language-multi',\ | ||
tag:'platform-multi',\ | ||
tag:'attack-reputation-scripting',\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we still have a "category" tag for this rule? Many integrators benefit from tag for stats and reports
quick-crawler | ||
quiterss | ||
quora link preview | ||
radian6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RSS Reader
"Top 50 values of request.headers.user-agent","Count of records"
"R6_CommentReader(www.radian6.com/crawler)",1
"R6_FeedFetcher(www.radian6.com/crawler)",1
lol, you were right I think I forgot to click on "send review" |
@theMiddleBlue Please take a peek at the comments to your review remarks where the issues are still open. |
Now working on the tests. Here is what happened to the old tests: 913100-1 UA: Havij -> block, no longer blocked by PL1, equivalent to new test 913131-1 913101-1 UA: libwww-perl -> block 913102-1 UA: blackwidow -> block, equivalent to new test 913131-1 |
All new tests passed now. Here are the new tests:
|
@@ -0,0 +1,123 @@ | |||
# This list is based on the following list: | |||
# https://radar.cloudflare.com/traffic/verified-bots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, but I think integrators can't have this list in their service/products because of the license
https://creativecommons.org/licenses/by-nc/4.0/
Am I wrong?
I'm testing rule ID While it seems to be ok to remove them because of performance, i'm not 100% sure about it as we will lost some information (for example keywords
|
Sending info about false positives for rule I was gathering data on one of my servers for about a week - i got 2800+ matched requests which i processed by hand BUT i catched only 136 different keywords from data file. False positives are splitted into groups. I used this format:
Matches which i consider as FPs for surezabbix (monitoring software)
not (too generic)
mediapartners (Google AdSense Bot)
developers.google (Google web preview renderer)
Chrome-Lighthouse (Google PageSpeed Insights)
lighthouse (Google PageSpeed Insights)
lcc (too generic)
WhatsApp (link preview for WhatsApp)
google web preview (web preview renderer for Chrome browser)
via ggpht.com googleimageproxy (Gmail image openings anonymizer)
microsoft office (MS Office link preview)
microsoft outlook (MS Outlook link preview)
googledocs (GoogleDocs link preview)
snap url preview service (Snap link preview)
wp rocket (WordPress cache plugin preloader)
kinza (alternative web browser)
excel/ (MS Excel link preview)
macoutlook/ (Ms Outlook for Mac link preview)
outlook-ios (MS Outlook for iOS link preview)
minefield (Firefox beta version)
mqqbrowser (alternative web browser, QQBrowser)
dragonfly (Konqueror browser running on DragonFly BSD)
powerpoint/ (MS Powerpoint link preview)
discordbot (Discord app link preview)
ptst (SpeedCurve Speed Tester)
leap (Roblox app link preview)
micromessenger (Wechat link preview)
viber (Viber link preview)
heritrix (Achive.org bot)
archivebot (Wikimedia link checker)
androiddownloadmanager (Download manager app for Android)
RSS feed readersserendeputybot
simplepie
feedburner
feedbot
universalfeedparser
flipboardrss
theoldreader.com
HTTP librariesfasthttp
faraday v
python-urllib
httpclient/
Microsoft URL Control
jigsaw
w3c_css_validator
photon/
Various servicescert.at-statistics-survey
ghost inspector
alligator
webmon
backupland
Toolsaria2 (somethings similar to
|
We agreed in the June meeting, that this overhaul would not work and we are now stripping the UA based scanner detection to the most malicious scanners announcing themselves in the UA in a PL1 rule. All the rest is being kicked out since we are not able to draw a line between benign, annoying and not so benign scanners and bots. Creating a plugin to provide this for those who really want it, would be an option though, but that is not a priority / depends on a volunteer. This PR is thus closed. For future reference, here are two scripts that might be useful for individuals: |
It's been a lot of work, but here is the PR that brings new rules to inspect the user-agent of clients:
We have 3 existing rules focusing on the user agent. 913100 and two PL2 stricter siblings:
When we switch to base our rules on existing online keyword lists, then the distinction between scripting/generic, crawler/bots and other automatic agents is no longer possible. The existing sources are not making this distinction. And if they are grouping it's different groups, incompatible taxonomies between different sources, etc.
I tried to maintain certain levels of badness among the agents, but it's too much manual work. So the new approach is mostly automatic. The core is a list of automated agents that we merge from three different online sources. Out of this full list of automated agents, there is a manual list of security scanners like nmap, zgrab, nikto, etc. That list has to be maintained by hand. It's a replacement for 913100, but it's an overhaul since 913100 has been horribly outdated.
At PL2 we have the automated list mentioned above without a list of acceptable user agents. This list of acceptable user agents has two online sources and a manual list.
Then at PL4 we block all automated agents including the google bot and the let's encrypt agent.
Let me repeat: we no longer distinguish between crawlers, HTTP client libraries and site security audit services like SSLLabs. They are either benign and popular, then they are part of an allow-list at PL2 in order to reduce false positives. Or they are not overly popular or not benign, then they will trigger at PL2. I attempted to differenciate between scrapers and search engine bots first or to distinguish security service tools, but I gave up eventually. It's a list of 1900 UAs after all.
At PL4 finally, we trigger on every trace of an automated agent. This will hit GoogleBot as well as Let's Encrypt. We have to make this transparent.
Now problem: The new manual list of security scanners is a replacement for 913100. But 913101 and 913102 are simply gone. The new rules at PL2 and PL4 are not replacements. The new rules are still stricter siblings to a certain extent. But only to a certain extent. New rule IDs are not obvious.
I see four possible approaches:
They all have their merits. For the PR, I opted for the full move option. I am open to discuss this though.
Automated agents sources:
The 3 online sources had to be transformed heavily in order to make them usable with the @pmFromFile operator and in order to avoid near-duplicates.
Benign agents sources:
After retrieving the online sources and looking through them by hand several times, I also double-checked them with the commercial list of whatsmybrowser.com, a 78GB list of real world user-agents strings. In fact about 1/4 of the acceptable user agents are not in the whatsmybrowser database. I'm keeping them on the list nevertheless.
So here are the 3 new data files:
And the new rules:
The PR is not yet ready to be merged. The whole scripting and manual source files
need to be cleaned up and added before it's ready. And then the tests of course. I'm labeling accordingly.