8000 Proposal for a new plugin: Machine Learning on ModSecurity by fggillie · Pull Request #2067 · coreruleset/coreruleset · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Proposal for a new plugin: Machine Learning on ModSecurity #2067

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

fggillie
Copy link
@fggillie fggillie commented May 7, 2021

This PR is a proposal for a plugin to ease the integration of Machine Learning (ML) in ModSecurity.

How it works

The core idea is to use ML in combination with the CRS, by having a double-check on suspicious requests. For this reason, ML is triggered only for requests for which the anomaly score exceeds the threshold. This is performed by chaining rule 949110 with a "ML rule".

The "ML rule" is a Lua script calling the ML model running on a server in an external container/pod. This obviously adds latency due to communication overhead but has the advantage of only loading and instantiating the ML model once at server start and not for each incoming request.

Attached is an example of a server running the ML model that can be reached by the Lua script. It uses the Flask library, definitely not the fastest option.
dummy_app.txt

What needs to be discussed

I'd be glad to discuss the following (and any other suggestion you may have) with you:

  • With this framework, the anomaly score is set but can be ignored if ML decides not to block the request...can bring confusion in the logs
  • This framework is for the case where we would like to call ML for suspicious requests only. Maybe we would like something more general?
  • Is there a way and would it be interesting to parallelize the ML and CRS execution?

@azurit
Copy link
Member
azurit commented May 7, 2021

My very first, maybe dump, question: What is ML?

@airween
Copy link
Contributor
airween commented May 7, 2021

My very first, maybe dump, question: What is ML?

same question from me :)

@fggillie fggillie changed the title Proposal for a new plugin: ML on ModSecurity Proposal for a new plugin: Machine Learning on ModSecurity May 7, 2021
@fggillie
Copy link
Author
fggillie commented May 7, 2021

ML = machine learning, oops :)
I fixed the name in the PR

@dune73
Copy link
Member
dune73 commented May 7, 2021

Thank you for your submission @fggillie. That was fast.

For the record: I asked @fggillie to submit her work (-> EPFL Master Thesis) here so it could be polished into an official plugin. We do not yet have a proper process to discuss / refine new plugins, so I thought it fitting to draft the PR here and when we deem it done, we create a separate repo for the new plugin.

This is a good base for a start and it has been a long standing desire to take the pain out of Machine Learning an CRS. The problem is that people need to learn ModSec/CRS first before they can concentrate on ML despite their interest being with ML.

As listed above, there are several open questions around this functionality.

Here are a few thoughts

The extension to 949110 (that we need to pluginize somehow) sets the terminal tx.inbound_anomaly_score, yet in a chained rule it decides it is not good enough to actually block the request, since the ML rig counseled against it. It's like combining CRS with an external / 3rd party rule (set) and we're only blocking when both agree it should be blocked. This is a means to reduce false positives: It is an AND connection between CRS AND ML.

This is an interesting concept, yet it is a new concept and we have to think it through. As Floriane mentioned it will lead to situations where the anomaly threshold is reached, yet the request is not blocked. In fact there won't be a trace in the log in such a case the way it is done in the PR now.

An alternative way to call the external ML would be to integrate a scoring rule that does the ML call. That would then not be used to fight false positives, but simply an additional detection rule that would score. I think the plugin should allow this as well, or primarily this option. That would result in an OR connection between CRS and ML: CRS scores, or the new plugin scores and if one or the combination of the two result in a high anomaly score, we have a hit. Performancewise, this is a big difference, though, and it is possible that executing lua for every request is too heavy. (Thought: Only execute ML for certain requests, like POST requests, or those requests where the anomaly threshold is within reach. No need to execute ML, when the anomaly score is 0, this is the last rule to execute and it can only score 5 with 10 being the limit. In this situation you should only execute the ML rule if the score is already greater or equal to 5 since only in this situation there is the potential to actually hit or exceed the anomaly threshold.)

A big potential of ML is to take the context / the session into consideration when examining a request. It is possible to do this with ModSecurity, but it's really tedious. However, with ML you can follow the flow of the application really easily. Piece of cake. However, the way the lua script presents itself, it does not forward the client's cookies, nor the client's IP address. So that would be a useful addition.

It might be interesting from a performance perspective to parallelize the execution. Right now, we branch into the lua script after the threshold is reached. Maybe we could branch very early in the request, execution continues and then in 949110, we poll for the result of the ML call.

We ought to look into the scanning of the responses too, since there is a substantial potential for ML to detect data leakages, since these cases will differ significantly from the standard RESPONSE_BODY.

@azurit
Copy link
Member
azurit commented May 7, 2021

I think most of problems you mentioned can be easily resolved inside a lua script.

@azurit
Copy link
Member
azurit commented May 7, 2021

Are we talking about any concrete ML solution or is it supposed to be some kind of generic ML integration?

@dune73
Copy link
Member
dune73 commented May 7, 2021

Ideally generic. What we see so far is generic. One could then accompany it with a reference use case in the form of a blog post / tutorial.

@fzipi
Copy link
Member
fzipi commented Aug 2, 2021

JFYI: we have a similar project that would generate a generic plugin by extending modsecurity with a new operator. Nothing productive yet, but I'm expecting to have something in the next quarter.

@vloup
Copy link
vloup commented Nov 8, 2021

My 2 cents on this PR that I saw way too late.

client.lua could use cjson to parse json. You just need to ensure that lua-cjson is available (on top of lua-socket). The inside of the lua main could look like this:

-- client.lua
local ltn12 = require("ltn12")
local http = require("socket.http")
local cjson = require("cjson")

local url = 'http://ml-server-name:5000/'
function main()
  local method = m.getvar("REQUEST_METHOD")
  local path = m.getvar("REQUEST_FILENAME")
  local hour = m.getvar("TIME_HOUR")
  local day = m.getvar("TIME_DAY")
  local args = m.getvars("ARGS")

  local args_dict = {}
  -- transform the args array into a string following JSON format
  if args ~= nil then
    for k,v in pairs(args) do
      name = v["name"]
      value = v["value"]
      value = value:gsub('"', "$#$") -- not a fan of this, but if it depends on the external service, don't change it.
      args_dict[name] = value
    end
  end
  local args_str = cjson.encode(args_dict)

--... rest of the function stays identical...

@dune73
Copy link
Member
dune73 commented Nov 8, 2021

Thank you for chiming in. Are you working on ML as well?

@vloup
Copy link
vloup commented Nov 20, 2021

@dune73 No. @fggillie plugged her work on the modsecurity instance I manage and it lacked cjson. Since then, I installed this lua dependency. It really took me 6 months to discover this PR and talk about this change.

@dune73
Copy link
Member
dune73 commented Nov 22, 2021

Thanks for the info, @vloup. Care contacting me via DM? folini@netnea.com

@franbuehler
Copy link
Contributor

@deepshikha-s created the Machine Learning Integration Plugin as CRS plugin as a Google Summer of Code 2022 project.
The public and available work has been integrated into this referenced plugin. Unfortunately, the model used in this PR here was not public data.

This PR can be closed. All work has been integrated into the referenced CRS Machine Learning Integration Plugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0