-
-
Notifications
You must be signed in to change notification settings - Fork 407
feat: PHP functions generator #3273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
097440b
to
1938a9e
Compare
Thank you very much. Very good progress here. A few remarks. frequency stats fileI would prefer not to commit 933160The second item in your bullet list is not clear to me. I think that anything marked as English word in 933161 should be added manually to 933160, but this is probably what you had in mind. 933161Agreed 933150 and 933151The limit of 100 is completely arbitrary. Adjusting it seems feasible to me. I'd aim for a similar size as we have now for 933150. Am I getting you right, that your description brings the PHP function list that would be removed from 933150 if we raise the frequency limit from 100 to 150 and then further down to 180? |
Do you have an update here for the meeting tonight? We should really get this done now. |
1938a9e
to
9715299
Compare
Here is an update and a sort of recap raising key questions for each rule: frequency stats file
I agree, I added it to the draft PR only if anyone wanted to play with it. I will remove it before marking the PR as ready for review unless further decisions
The
933160
Are we sure about this? 933161
933150 and 933151
|
frequency stats fileAs of this writing, I think somebody should run the script every couple of month. The latency is acceptable. Setting up multiple tokens could be seen as misuse. I would rather refrain from that approach. 933160Correcting my remark: Anything in 933161 and on the high-risk list should be added. But that's a given anyways. Forget my remark. I see an overlap between current 933150 and 933160. 933160 contains non-English words, stuff that would better suit 933150, even if it does not pass the frequency check. So how about adding a 2nd source to 933150: A manual list of high-risk php functions not being an English word and not necessarily being used frequently? This would clean up 933160 and the entire rule group I think. 933161Yes, the new spell.sh has been merged in June. #3242 933160 items should be added to 933161as well. See mermaid. 933150 and 933151The drastic change of 933150 can be avoided via the proposed 2nd manual source for this rule. This file would then take on those function currently in the 933150 source file, add the stuff that no longer matches 933160 but is still considered dangerous. The frequency can not be the only measurement. We could add a 3rd separate rule family for the risky functions, but it's getting tedious, better is to combine this with 933150 (-> frequently used function names plus highly risky function names; both groups not being English words). I have this in mind:
|
Latest commit:
Next steps / elements to be considered:
|
Responses to your questions
I do not understand. Please elaborate.
I confirm, every function name removed from
Got you. I see how this would help. If we do not get this improvement for this PR, can we still make it work with the existing/new manual lists? If yes, please go ahead and we'll keep it in mind for future consolidation.
Please do
Happy to help as soon as the conceptual discussions are over and rules fully implemented. 933150 (wordlist)
Test situationAre all the previous tests passing? Thanks god we have a lot of tests, namely for 933160 (44) and also quite a few for 933150 (16). Apparently a lot of the tests have to move to a different rule like the keywords. Maybe we need a map of old test-ID and new test-ID. Having this here in the conversation would be good enough, I guess. |
Sorry, I messed up the filenames. I meant ✅ enforced alphabetic order |
I think there is going to be a consolidation project after CRS v4 comes out where we try to integrate all these automated word lists into the CRS toolchain. All work into developing an advanced scheme to minimize work in the future will have to be re-evaluated and streamlined. Likely being thrown out in favor of a new approach. Personally, I would rather leave it dumb and describe the individual files, their role and their location. But you do the work, so you get to decide. |
I agree that we don't need it. However, I'd like for the lists and assembly files to include the script parameters in a header comment, so that when they are regenerated, it is clear which parameters were used.
I think you should just try to run the toolchain and then report success or failure. I don't fully understand what you are doing to the rules. From the comments in the rules file I gather that:
From that I gather that the symbols in all rules should be mutually exclusive (in the script you include symbols from 933160 in 933161). If this is not the case, then we need to update the rule comment (which needs an update anyway, IMO). I tried to follow your discussion but it already took me quite some time to just read what you wrote. I trust that you've thought this through :) |
BTW: I appreciate the enormous amount of work you put into this! 🙇 |
Here is a full recap (it might still change):
Edit: a bad looking, but updated mermaid is the following: flowchart TD
idlist((Manual list \n of high-risk \n English words))
id933160[Rule 933160 PL1]
idlist---->id2
id1{Is word an English word? \n WD + Extended}
id1--yes-->id2
id0((PHP Function \n names from \n PHP source code))
id0---->id1
id2{Is it on the manual high-risk list?}
idlistfunc((Manual list \n of high-risk \n PHP Funcs))
id3{Is it a frequently used function name?}
id933161[Rule 933161 PL3]
id933150[Rule 933150 PL1]
id933151[Rule 933151 PL2]
id1--no-->id3
id2--yes-->id933160
id2--yes+no-->id933161
idlistfunc---->id933150
id3--yes-->id933150
id3--no-->id933151
|
Thanks for your concerns @theseion. You are quite right, this is very complicated and the approach changed during the discussions. The problem we discovered was that the existing rules were not doing what they were describing to do. So this is not only about automation, but also overhauling existing rules, hence the move of high-risk function names from 933160 to 933150 (double checked, nothing was lost). The new distinction with regex 93316x for English words and 93315x for non-English words seems useful to me. 933150 and 933151 are covering all existing PHP function names and they are distinguished by the manually assessed risk as well as the new idea introduced by @jptosso to use frequency of use as a guidance. What helps us is the good coverage of tests for 933150 and 933160. So any grave omission should be caught. @M4tteoP : Thank you for your good writeup of the rule's functionality above. This is a good base for the new description of the rules. I can write the full description on this base if you want. Do you have a solution for the problem with PHP function |
006ee2e
to
eec943d
Compare
We still had two problems:
To address both of them at once, I went for implementing the extended spell.sh, I hope that the implementation is kind o what you had in mind (cc @theseion). Adding to this list partial english words and common words like I squashed commits and rebased to make it easier to review and to address a conflict with the just merged #2302 (affecting
Dealing with tests, I found several duplicated tests for
I think it is mostly ready, trying to deal with the other failing tests now |
8985b32
to
31514ed
Compare
Conversion list of tests: There is one left failing test: If we are in a hurry to merge this PR, I would not consider it a blocker. Edit: test commented out. |
31514ed
to
ce68ffd
Compare
Wow, you've really put in a lot of effort here! Great job! I'm just a bit worried about the regex causing a bunch of false positives. It's kind of a bummer since it even catches non-existent PHP function names. Like, for example, $ curl -s -A 'test' 'http://sandbox.coreruleset.org/?a=preventivo123' \
-H 'x-backend: nginx' \
-H 'x-format-output: txt-matched-rules' \
-H 'x-crs-paranoia-level: 1' \
-H 'x-crs-version: pullrequest-3273'
933150 PL1 PHP Injection Attack: High-Risk PHP Function Name Found
949110 PL1 Inbound Anomaly Score Exceeded (Total Score: 5) Allowing these sorts of false positives to linger at PL1 would turn into an absolute nightmare, and you know how I always mention that we need to be more stringent with injection context syntax. Remember, there's no way we can safeguard against something like I believe we could include a solution to tackle the most notable false positives by appending word boundaries IMO, this would be just a workaround. I think we should match an injection syntax before matching PHP function, at least at lower PLs. |
I do not think we are accepting FPs here. We are doing all we can within this architecture to avoid them. The basic strategy is: PHP functions names that do not exist in the English language are checked via PM and PHP function names that do exist in the English language either as word or as component of an English word are checked via regex. What is the advantage of regex over pm when we make sure that the pm only gets those keywords that do not appear in the English language (and as we found out thanks to you: as snippets within words in the English language). I think this really addresses the core of the FP problem with the PM approach. pm has the advantage that it's easier to read, easier to write and allegedly faster as well. If we stuff everything into a regex (like with the unix commands), we end up with a huge rule. Doable, but I would prefer to avoid it. Finally, this is one of the roadblocks for CRS v4. Development is dragging along and we can't seem to finish things. I'd rather merge an optimized version of this than change the architecture before the goal and start over. If we optimize it and it still does not work, then I will agree with you. |
For the record, I went through the php-function-names-933150.data in the PR and asked chatgpt for English words containing certain snippets where I was not sure. I added |
I can't agree @dune73, For example: experience curl -v 'http://sandbox.coreruleset.org/?a=explosion' \
-H 'x-backend: nginx' \
-H 'x-format-output: txt-matched-rules' \
-H 'x-crs-paranoia-level: 1' \
-H 'x-crs-version: pullrequest-3273'
933150 PL1 PHP Injection Attack: High-Risk PHP Function Name Found
949110 PL1 Inbound Anomaly Score Exceeded (Total Score: 5) I can find numerous examples similar to this one: Here we're blocking $ curl -s 'http://sandbox.coreruleset.org/?a=preview' -H 'x-backend: nginx' -H 'x-format-output: txt-matched-rules' -H 'x-crs-paranoia-level: 1' -H 'x-crs-version: pullrequest-3273'
933150 PL1 PHP Injection Attack: High-Risk PHP Function Name Found
949110 PL1 Inbound Anomaly Score Exceeded (Total Score: 5) Furthermore, we're unintentionally blocking non-PHP functions, which is not our intended outcome. |
Instead of continuing the conversation ad nauseam, @theMiddleBlue and I hopped on a call. We agreed on the course of action laid down in #3273 (comment). This should address known or expected FPs. If more FPs pop up with 933150, we plan to shift those keywords over to 933160 via the manual extended list. This includes stuff before we merge this PR and stuff that pops up during RC or after the release. Redesigning the four rule of this PR, maybe consolidating into 1-2 regex rules could be an option. But @theMiddleBlue and I agreed this would throw us back at least another month. And we would be trusting our pattern identifying a php function call via the use of parenthesis etc. And we never know if there is a bypass in this regard. The pm operator in 933150 has its merits even if it is more prone to FP. @M4tteoP promised to deliver an updated PR during the weekend. @theMiddleBlue and I will review immediately afterwards, namely @theMiddleBlue will try out additional payloads to see if he can provoke more FPs. If things are back to green, I hope to be able to merge before the Monday meeting. |
fbb03e3
to
81e79d4
Compare
Done, the following terms have been moved out from
Considering that PTAL 🙏 |
38ee793
to
d5486b2
Compare
For the record: There is a probability > 0 that random payloads (-> base64!) will hit 933150. The longer the keywords the less likely this is. Yet we have several 5 letter keywords in the list ( |
… updates php-high-risk-functions.txt
6fb0fdf
to
26208a3
Compare
Continues #3228.
Opening this draft PR to show the current status of the work and get some feedback based on the diffs on
933161.ra
,php-function-names-933150.data
, andphp-function-names-933151.data
files.The Script
frequencylist.txt
, still committed it if anyone wants to play with it (Run command example./php-dictionary-creator.sh --frequencylist frequencylist.txt --verbose
).933161.ra
requires running the toolchain. The script currently only prints a reminder about it. If we feel that it is in the scope of this script, this call could be also automated.Rule 933160
933160
. 🚧 It requires some manual work to update it.933160
the PHP functions interpreted as English words (933161
) into933160
but it does not seem right to me considering that933161
is the stricter sibling of the two. We should maybe do it the other way around (adding to933161
also the manual list we have in933160
)Rule 933161
spell.sh extended
933160
list has also to be included in933161
Rules 933150 and 933151
933150
would become way bigger with lots of933151
words moved into933150
. The decision about putting a word under one or the other is based on theFREQUENCY_LIMIT
, set by default at100
. 🚧 It might have to be fine-tuned.For example, setting it at
150
moves out of933150
:180
further moves out: