-
-
Notifications
You must be signed in to change notification settings - Fork 401
feat(933151): update PHP fn to latest version #3228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks @jptosso for this one! It was a long-standing list that needed updating. This approach you took is reproducible also, so this is a gem here. Let's get more people involved. @theMiddleBlue @airween @dune73 @theseion What do you think about the list splitting? Also, the idea can be ported to do the same with the code on the first ~ 40 functions used in the top PHP web shells with code available. |
The test is failing because we somehow classify functions like
|
Great stuff @jptosso!
I don't understand the question. What would we have after splitting? Excluding |
This is very cool. Thank you @jptosso. Also not understanding the splitting question. But I think there is some discussion due how this greatly improved rule really blends in with the existing PHP function names rules, namely the following matrix:
I think you try to stay in the right quadrant, but I am not sure the automatic dictionary cuts it. We probably have to take snippets of words into consideration (-> Probably we need an option to define manual allow- and deny-lists. But this left aside: Great work. Thank you very much. |
Co-authored-by: Max Leske <th3s3ion@gmail.com>
Signed-off-by: Felipe Zipitria <felipe.zipitria@owasp.org>
Signed-off-by: Felipe Zipitria <felipe.zipitria@owasp.org>
Signed-off-by: Felipe Zipitria <felipe.zipitria@owasp.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a requirement but I think it would be useful to have a Dockerfile + script to build and run. Not all of us use a *nix environment and you wouldn't have to explain that apt install dict
is required.
That chained rule could even be improved to not just find any opening parenthesis but match something like Another option could be to take out the partial english words like |
I like all the proposed solutions to the As for the the proposed docker to create rules: Is not this something that goes beyond this rule alone? I'd rather not take such a fundamental decision on an individual or we end up with very heterogeneous build instructions. |
Not necessarily. You can treat a Docker container as part of a script. You let something run inside the container, grab the output and throw the container away. It's like running a script on another host that has capabilities that you don't have on your own machine. I wasn't proposing a complete containerized infrastructure for scripting lists. |
Sure, I see how docker would be isolated, but I still think it's overkill if all you want to do is avoiding to explain |
My concern is not that worried about the docker vs. native. I think how to avoid doing manual work unless strictly necessary is the difference. If things need to be manual, PL1 lists will take another 10 years to update. And we want to avoid that situation. |
I think you are generally right, @fzipi. That should be the main concern. Yet I think a unix box can be assumed. It's not the |
I'll be updating the python script, no worries. Let's swarm over the func splitting problem. |
Co-authored-by: Max Leske <th3s3ion@gmail.com>
filter_dict.py should probably check for existence of dict binary instead of hiding all errors. |
Getting 401s when running bot.py despite having a working TOKEN. Any clue? |
Will take a look and reply. |
OK, I think I got this. @jptosso proposes 3 scripts to automate 933151. I think we can use the scripts to create all the rules in the 922150/51/60/61 rule group. New scripts in use:
The existing rules take risk and existence in English language into consideration. All manual. This new approach automates the English language check and brings frequency of use on GH on the table. This is very useful. I think we can not let go of the risk completely. But we can limit the risk to the English language terms that are also a PHP function name. And that list is relatively small, and new high risk additions are rare or never. With this approach, I think we can catch all 4 PHP function name issues. Here is a mermaid diagram of what I have in mind: flowchart TD
id0((PHP Function \n names from \n PHP source code))
idlist((Manual list \n of high-risk \n English words))
id1{Is word an English word?}
id2{Is it on the manual high-risk list?}
id3{Is it a frequently used function name?}
id933160[Rule 933160]
id933161[Rule 933161]
id933150[Rule 933150]
id933151[Rule 933151]
id0---->id1
id1--yes-->id2
id1--no-->id3
id2--yes-->id933160
id2--yes+no-->id933161
id3--yes-->id933150
id3--no-->id933151
idlist---->id2
|
Simplifying the setup a bit. New mermaid: flowchart TD
idlist((Manual list \n of high-risk \n English words))
id933160[Rule 933160\nregex]
idlist------>id933160
id0((PHP Function \n names from \n PHP source code))
id1{Is word an English word?}
id2{Is it a frequently\nused function\nname?}
id933161[Rule 933161\nregex]
id933150[Rule 933150\npmFromFile]
id933151[Rule 933151\npmFromFile]
id0---->id1
id1--yes--->id933161
id1--no-->id2
id2--yes-->id933150
id2--no-->id933151
|
@fzipi and @M4tteoP : I still can't get the token thing to work properly. Script fails for me. Does this work for you? If I can get it working, I'll be happy to come up with a cleaner base script and Matteo can then polish it into a new PR. Also looking at 933160 by hand. The script should be able to output 933161, 933150 and 933151 sources. |
Let me try it... |
Signed-off-by: Felipe Zipitria <felipe.zipitria@owasp.org>
Signed-off-by: Felipe Zipitria <felipe.zipitria@owasp.org>
@dune73 Ok, cleaned up a bit.
It works for me with the token properly, hopefully it is cleaner now. |
PR for |
OK. I got it. There was a problem with my API token. Solved now. Thank you for the cleanup. Script is running as we speak, but I think the filenames are botched. The gen.sh cds into the php source and then it wants to run the filter and the bot scripts as local scripts. I created softlink, so it works, but is this only me? |
For the record: The script checking for occurrences of php functions across github takes 16+ hours here. I guess that's the cost of doing business. Correct? |
Yes. This is because of the rate limit in GitHub. We can only have 1 query per second I think, there is a sleep in the code for that reason. Then the total time would be # of words to check / 60. |
Got you. Thanks. Makes sense. I'm currently reimplementing the bot.py. I'm aiming for saving into a local file and adding a time-stamp. That way the script can run daily and the update via github calls only happens after timeout. |
Existing 933160 is described as follows:
But this is how it actually looks like:
Maybe this is only me, but I do not see how 10000 Here is existing 933161
I'm now a bit stuck, but I think we should move to the new recipe outlined above. Running new spell.sh (-> WordNet spell.sh) against the PHP functions is meant to give us 933161. Here is this list:
This is substantially smaller than the existing 933160, but also substantially smaller than existing 933161. I like how the automatic list does not come with What do we do? |
I agree. We might have to start maintaining a list of words we want to treat as common. These words will be hard to detect automatically since they are domain specific but they would lead to false positives. Examples: basename, unset, syslog, symlink. We could then check against that list as well in the |
Ah, what you propose is like making the spell.sh check 2-fold: First against English language via WordNet (or aspell, ispell whatever) and Second against custom list of widespread terms that are not official Enlish language, but used casually and they are also keyword in different domains (so not only PHP, but bash, JS, whatever). I would suggest to incorporate this into spell.sh and expose it by default (with accompanying cmd line options where you can chose to get only first or second as described above. |
Yes, precisely. |
This PR adds a new setlist of PHP functions based on @fzipi command
grep -o --no-file -R 'ZEND_FUNCTION(.*)' | cut -f2 -d\( | cut -f1 -d\) | sort > input.txt
to extract a list of functions from the PHP source code.This PR includes instructions and scripts to build this list using the following criteria:
Apparently, previous criteria were to ignore non-unsafe PHP functions, but this includes all.
Fixes #2685.