10000 feat: PHP functions generator by M4tteoP · Pull Request #3273 · coreruleset/coreruleset · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

feat: PHP functions generator #3273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Sep 4, 2023
Merged

Conversation

M4tteoP
Copy link
Member
@M4tteoP M4tteoP commented Jul 28, 2023

Continues #3228.

Opening this draft PR to show the current status of the work and get some feedback based on the diffs on 933161.ra, php-function-names-933150.data, and php-function-names-933151.data files.

The Script

  • The script takes about 5 hours to retrieve all the frequencies. I don't think we want to merge the frequencylist.txt, still committed it if anyone wants to play with it (Run command example ./php-dictionary-creator.sh --frequencylist frequencylist.txt --verbose).
  • Updating 933161.ra requires running the toolchain. The script currently only prints a reminder about it. If we feel that it is in the scope of this script, this call could be also automated.

Rule 933160

  • The script currently does basically nothing about 933160. 🚧 It requires some manual work to update it.
  • I was taking into account adding to 933160 the PHP functions interpreted as English words (933161) into 933160 but it does not seem right to me considering that 933161 is the stricter sibling of the two. We should maybe do it the other way around (adding to 933161 also the manual list we have in 933160)

Rule 933161

  • As already reported by Christian, currently this list is way smaller than the current one. But it will be increased by:
    • 🚧 Running the script with the spell.sh extended
    • If 933160 list has also to be included in 933161

Rules 933150 and 933151

  • 933150 would become way bigger with lots of 933151words moved into 933150. The decision about putting a word under one or the other is based on the FREQUENCY_LIMIT, set by default at 100. 🚧 It might have to be fine-tuned.
    For example, setting it at 150 moves out of 933150:
-collator_compare
-curl_pause
-curl_share_init
-curl_share_setopt
-curl_share_strerror
-curl_unescape
-gmp_binomial
-imageaffine
-imageavif
-imagegetclip
-inflate_get_read_len
-ldap_exop_refresh
-ldap_parse_exop
-locale_get_display_language
-locale_get_display_name
-locale_get_display_region
-mysqli_get_client_stats
-mysqli_refresh
-mysqli_stmt_more_results
-mysqli_stmt_next_result
-net_get_interfaces
-normalizer_get_raw_decomposition
-openssl_cms_sign
-openssl_get_curve_names
-openssl_pkcs7_read
-openssl_pkey_derive
-openssl_x509_verify
-pcntl_unshare
-pg_lo_truncate
-phpdbg_get_executable
-pspell_config_data_dir
-pspell_config_dict_dir
-pspell_config_repl
-pspell_new_config
-realpath_cache_get
-sapi_windows_cp_conv
-sapi_windows_cp_get
-sapi_windows_set_ctrl_handler
-socket_addrinfo_bind
-socket_addrinfo_explain
-sodium_crypto_aead_aes256gcm_keygen
-timezone_location_get

180 further moves out:

-curl_share_close
-curl_share_errno
-datefmt_set_timezone
-dns_check_record
-enchant_broker_dict_exists
-enchant_broker_free
-enchant_broker_free_dict
-enchant_broker_get_error
-enchant_broker_set_dict_path
-enchant_broker_set_ordering
-enchant_dict_add_to_session
-enchant_dict_check
-enchant_dict_get_error
-enchant_dict_store_replacement
-fdatasync
-get_mangled_object_vars
-gmp_popcount
-gmp_rootrem
-hash_hmac_algos
-imageaffinematrixconcat
-imageaffinematrixget
-imagecolorclosesthwb
-imagecropauto
-imageopenpolygon
-imagesetinterpolation
-imap_fetchmime
-imap_gc
-imap_getacl
-imap_getsubscribed
-imap_lsub
-imap_rfc822_write_address
-imap_set_quota
-imap_setacl
-imap_undelete
-inflate_add
-inflate_get_status
-inflate_init
-json_validate
-ldap_exop
-ldap_parse_reference
-locale_get_default
-locale_get_primary_language
-locale_get_region
-msg_queue_exists
-msg_set_queue
-msgfmt_format
-mysqli_get_connection_stats
-mysqli_get_links_stats
-mysqli_get_warnings
-mysqli_savepoint
-mysqli_stmt_attr_set
-mysqli_stmt_error_list
-mysqli_savepoint
-mysqli_stmt_attr_set
-mysqli_stmt_error_list
-normalizer_is_normalized
-openssl_get_cert_locations
-openssl_pbkdf2
-password_algos
-pcntl_setpriority
-pg_escape_identifier
-phpdbg_start_oplog
-posix_access
-posix_mknod
-posix_setrlimit
-pspell_add_to_personal
-pspell_config_create
-pspell_config_personal
-pspell_save_wordlist
-readline_callback_handler_install
-readline_callback_handler_remove
-readline_callback_read_char
-snmp2_getnext
-snmp2_walk
-snmp3_get
-snmp3_real_walk
-snmp_read_mib
-snmp_set_oid_output_format
-snmpgetnext
-socket_addrinfo_connect
-sodium_base642bin
-sodium_bin2base64
-sodium_crypto_aead_aes256gcm_decrypt
-sodium_crypto_aead_aes256gcm_encrypt
-sodium_crypto_aead_chacha20poly1305_encrypt
-sodium_crypto_aead_chacha20poly1305_ietf_decrypt
-sodium_crypto_aead_chacha20poly1305_ietf_encrypt
-sodium_crypto_aead_xchacha20poly1305_ietf_decrypt
-sodium_crypto_aead_xchacha20poly1305_ietf_encrypt
-sodium_crypto_box_keypair_from_secretkey_and_publickey
-sodium_crypto_box_open
-sodium_crypto_box_seal_open
-sodium_crypto_box_secretkey
-sodium_crypto_pwhash_str_verify
-sodium_crypto_sign_publickey_from_secretkey
-sodium_crypto_sign_seed_keypair
-sodium_crypto_stream
-stream_supports_lock
-timezone_version_get
-variant_cat
-variant_int
-variant_mul
-variant_sub
-zend_leak_variable
-zend_test_func

@M4tteoP M4tteoP force-pushed the php_dict_creator branch from 097440b to 1938a9e Compare July 28, 2023 15:12
@dune73
Copy link
Member
dune73 commented Aug 2, 2023

Thank you very much. Very good progress here.

A few remarks.

frequency stats file

I would prefer not to commit frequencylist.txt, but 5h is a very long time. What options do you see?

933160

The second item in your bullet list is not clear to me. I think that anything marked as English word in 933161 should be added manually to 933160, but this is probably what you had in mind.

933161

Agreed

933150 and 933151

The limit of 100 is completely arbitrary. Adjusting it seems feasible to me. I'd aim for a similar size as we have now for 933150.

Am I getting you right, that your description brings the PHP function list that would be removed from 933150 if we raise the frequency limit from 100 to 150 and then further down to 180?

@dune73
Copy link
Member
dune73 commented Aug 7, 2023

Do you have an update here for the meeting tonight? We should really get this done now.

@M4tteoP
Copy link
Member Author
M4tteoP commented Aug 7, 2023

Here is an update and a sort of recap raising key questions for each rule:

frequency stats file

I would prefer not to commit frequencylist.txt

I agree, I added it to the draft PR only if anyone wanted to play with it. I will remove it before marking the PR as ready for review unless further decisions

5h is a very long time. What options do you see?

The /search/code endpoint has stricter limits compared to other endpoints: This endpoint requires you to authenticate and limits you to 10 requests per minute..

  1. How frequently do we expect the script to run?
  2. I might try to circle over multiple tokens 😅

933160

  • 🚧 It is a manual list, I honestly don't know how to manually update it. As stated here, most of the words seem not to be natural English words, but removing them all seems too drastic.

I think that anything marked as English word in 933161 should be added manually to 933160

Are we sure about this? 933161 is the stricter sibling (PL3) of 933160 I was expecting 933161 to contain 933160 words, not vice-versa.

933161

  • 🚧 Run with the extended spell.sh. Is it available? I might have missed some updates on that point.
  • 🚧 Agree on the above-mentioned point ( 933160 into 933161 or vice-versa)

933150 and 933151

  • Default frequency limit moved to 90000. With this value, we have the following number of entries:
Rule Before After
933150 44 56
933151 1304 2213
  • ⚠️ What is worth to be noted is how 933150 entries drastically changed. With a total of 42 Removals and 55 Additions, only 2 entries are still in this file (call_user_func and file_get_contents). See diff.
    Are we sure that the High Risk can just be correlated to the high frequency, or we have also to consider some manual entries for 933150?

@dune73
Copy link
Member
dune73 commented Aug 8, 2023

frequency stats file

As of this writing, I think somebody should run the script every couple of month. The latency is acceptable. Setting up multiple tokens could be seen as misuse. I would rather refrain from that approach.

933160

Correcting my remark: Anything in 933161 and on the high-risk list should be added. But that's a given anyways. Forget my remark.

I see an overlap between current 933150 and 933160. 933160 contains non-English words, stuff that would better suit 933150, even if it does not pass the frequency check.

So how about adding a 2nd source to 933150: A manual list of high-risk php functions not being an English word and not necessarily being used frequently? This would clean up 933160 and the entire rule group I think.

933161

Yes, the new spell.sh has been merged in June. #3242

933160 items should be added to 933161as well. See mermaid.

933150 and 933151

The drastic change of 933150 can be avoided via the proposed 2nd manual source for this rule. This file would then take on those function currently in the 933150 source file, add the stuff that no longer matches 933160 but is still considered dangerous.

The frequency can not be the only measurement. We could add a 3rd separate rule family for the risky functions, but it's getting tedious, better is to combine this with 933150 (-> frequently used function names plus highly risky function names; both groups not being English words).

I have this in mind:

  • 933150 wordlist: High-risk + frequently used function names not being English words
  • 933160 regex: English words as function names

@M4tteoP
Copy link
Member Author
M4tteoP commented Aug 9, 2023

Latest commit:

  1. Adds php-high-risk-functions.txt. A manual list of high-risk functions becomes the second source to 933150.
  2. php-high-risk-functions.txt is made of lots of 933160 entries, that are now moved from 933160 to 933150. They are non-English words that we still want to detect at PL1.
  3. Removes leftovers about 933150 in the script. It is a manual list.
  4. Adds code to iterate over 933160 entries and add them to the stricter sibling 933161 if not already present.
  5. Adds code to iterate over php-high-risk-functions.txt and adds them to 933150.

Next steps / elements to be considered:

  • We might want to create another file alongside php-high-risk-functions.txt and let the script generate 933150.ra 933160.ra from it. The two files with manual lists would be closer, and these two .ra files would become fully generated.
  • Double check 933160.ra and php-high-risk-functions.txt entries
  • The spell checker based on WordNet is already used by this PR. With extended spell.sh I meant the spell.sh script with also the hardcoded list of terms added to the output of spell.sh suggested by @theseion. The examples were basename, unset, syslog, symlink. Now, not being matched by spell.sh:
    • basename: moved from 933161 to 933150 (high frequency) (PL3 -> PL1).
    • unset: removed. It is not found anymore as a PHP function 🧐.
    • syslog: moved from 933161 to 933151 (PL3 -> PL2).
    • symlink: moved from 933161 to 933151 (PL3 -> PL2).
  • remove / gitignore frequencylist.txt
  • Update rules description - clearly state that some files are generated automatically

@dune73
Copy link
Member
dune73 commented Aug 10, 2023

Responses to your questions

We might want to create another file alongside ...

I do not understand. Please elaborate.

Double check 933160.ra and php-high-risk-functions.txt entries

I confirm, every function name removed from 933160.ra is now part of php-high-risk-functions.txt.

The spell checker based on WordNet ...

Got you. I see how this would help. If we do not get this improvement for this PR, can we still make it work with the existing/new manual lists? If yes, please go ahead and we'll keep it in mind for future consolidation.

remove / gitignore frequencylist.txt

Please do

Update rules description

Happy to help as soon as the conceptual discussions are over and rules fully implemented.

933150 (wordlist)

  • There is a new problem: PHP function names that are not English words, but they are part of English words. Example: dir (part of English word directory). It's currently part of php-function-names-933150.data. We can not have that at PL1. Do you have an idea how to sort this out?

Test situation

Are all the previous tests passing? Thanks god we have a lot of tests, namely for 933160 (44) and also quite a few for 933150 (16). Apparently a lot of the tests have to move to a different rule like the keywords. Maybe we need a map of old test-ID and new test-ID. Having this here in the conversation would be good enough, I guess.

@M4tteoP
Copy link
Member Author
M4tteoP commented Aug 10, 2023

I do not understand. Please elaborate.

Sorry, I messed up the filenames. I meant 933160.ra not 933150.ra.
933160.ra is a manual list that we have to maintain, just like the newly created php-high-risk-functions.txt. I was evaluating the idea of creating a file called high-risk-english-words.txt keeping it inside the same folder of php-high-risk-functions.txt and letting the script generating 933160.ra by just adding the prefix and the whole content of high-risk-english-words.txt. It would permit us to 8000 keep both lists that have to be maintained inside the php-dictionary-gen folder.

✅ enforced alphabetic order
✅ removed frequencylist.txt

@dune73
Copy link
Member
dune73 commented Aug 10, 2023

I think there is going to be a consolidation project after CRS v4 comes out where we try to integrate all these automated word lists into the CRS toolchain. All work into developing an advanced scheme to minimize work in the future will have to be re-evaluated and streamlined. Likely being thrown out in favor of a new approach.

Personally, I would rather leave it dumb and describe the individual files, their role and their location.

But you do the work, so you get to decide.

@theseion
Copy link
Contributor

The script takes about 5 hours to retrieve all the frequencies. I don't think we want to merge the frequencylist.txt, still committed it if anyone wants to play with it (Run command example ./php-dictionary-creator.sh --frequencylist frequencylist.txt --verbose).

I agree that we don't need it. However, I'd like for the lists and assembly files to include the script parameters in a header comment, so that when they are regenerated, it is clear which parameters were used.

Updating 933161.ra requires running the toolchain. The script currently only prints a reminder about it. If we feel that it is in the scope of this script, this call could be also automated.

I think you should just try to run the toolchain and then report success or failure.

I don't fully understand what you are doing to the rules. From the comments in the rules file I gather that:

  • 933150 includes symbols often used in attacks
  • 933160 includes frequently used symbols
  • 933151 includes low frequency symbols
  • 933161 should include symbols excluded from the other 3 rules due to FP risk

From that I gather that the symbols in all rules should be mutually exclusive (in the script you include symbols from 933160 in 933161). If this is not the case, then we need to update the rule comment (which needs an update anyway, IMO).

I tried to follow your discussion but it already took me quite some time to just read what you wrote. I trust that you've thought this through :)

@theseion
Copy link
Contributor

BTW: I appreciate the enormous amount of work you put into this! 🙇

@M4tteoP
Copy link
Member Author
M4tteoP commented Aug 10, 2023

I don't fully understand what you are doing to the rules.

Here is a full recap (it might still change):

  • 933160 (PL1, regex): List of high-risk English (and english-like) words
    • source: manual list of entries added to 933160.ra
    • note: Most of the 933160 entries are now moved to 933150 (via php-high-risk-functions.txt) being words now compatible with an English context.
  • 933161 (PL3, regex): List of PHP function names detected as English words by spell.sh
    • source: PHP source code, automated by the script (php-dictionary-creator.sh)
    • source2: 933161 also contains 933160 entries
  • 933150 (PL1, parallel match): List of PHP functions NOT detected as English words with HIGH frequency + PHP high-risk functions
    • source: PHP source code, automated by the script
    • source2: manual list of high risk functions (php-high-risk-functions.txt) all extracted from the old 933160 entries.
  • 933151 (PL2, parallel match): List of PHP functions NOT detected as English words with LOW frequency
    • source: PHP source code, automated by the script
    • note: mutually exclusive with 933150

Edit: a bad looking, but updated mermaid is the following:

flowchart TD
    idlist((Manual list \n of high-risk \n English words))
    id933160[Rule 933160 PL1]
    idlist---->id2
    id1{Is word an English word? \n WD + Extended}
    id1--yes-->id2
    id0((PHP Function \n names from \n PHP source code))
    id0---->id1
    id2{Is it on the manual high-risk list?}
    idlistfunc((Manual list \n of high-risk \n PHP Funcs))
    id3{Is it a frequently used function name?}
    id933161[Rule 933161 PL3]
    id933150[Rule 933150 PL1]
    id933151[Rule 933151 PL2]
    id1--no-->id3
    id2--yes-->id933160
    id2--yes+no-->id933161
    idlistfunc---->id933150
    id3--yes-->id933150
    id3--no-->id933151
Loading

@dune73
Copy link
Member
dune73 commented Aug 11, 2023

Thanks for your concerns @theseion. You are quite right, this is very complicated and the approach changed during the discussions. The problem we discovered was that the existing rules were not doing what they were describing to do. So this is not only about automation, but also overhauling existing rules, hence the move of high-risk function names from 933160 to 933150 (double checked, nothing was lost). The new distinction with regex 93316x for English words and 93315x for non-English words seems useful to me. 933150 and 933151 are covering all existing PHP function names and they are distinguished by the manually assessed risk as well as the new idea introduced by @jptosso to use frequency of use as a guidance.

What helps us is the good coverage of tests for 933150 and 933160. So any grave omission should be caught.

@M4tteoP : Thank you for your good writeup of the rule's functionality above. This is a good base for the new description of the rules. I can write the full description on this base if you want.
Also: Do you need anything else to finish the PR?

Do you have a solution for the problem with PHP function dir within English word directory mentioned above?

@M4tteoP M4tteoP changed the title [WIP] PHP Functions generator feat: PHP functions generator Aug 11, 2023
@M4tteoP
Copy link
Member Author
M4tteoP commented Aug 11, 2023

We still had two problems:

  1. Lack of extended spell.sh that was leading to entries moving from 933161 to 933150 or 933151 (that means also moving from PL3 to PL1/PL2
  2. problem with PHP function dir within English word directory. They were part of 933161 but now moved to 933150 (PL1):

To address both of them at once, I went for implementing the extended spell.sh, I hope that the implementation is kind o what you had in mind (cc @theseion). Adding to this list partial english words and common words like basename permits to not evaluate their frequency (typically very high) and end up in 933150, but they rather take the other path of the mermaid and they end up in 933161, were they used to be.

I squashed commits and rebased to make it easier to review and to address a conflict with the just merged #2302 (affecting 933161 regex). Please take a look at:

  • eec943d if you want to see/review the spell.sh extended implementation. Any suggestion about elements to be added is very welcomed.

Dealing with tests, I found several duplicated tests for 933150. Please take a look at:

I think it is mostly ready, trying to deal with the other failing tests now

@M4tteoP M4tteoP force-pushed the php_dict_creator branch 4 times, most recently from 8985b32 to 31514ed Compare August 11, 2023 22:27
@M4tteoP
Copy link
Member Author
M4tteoP commented Aug 11, 2023

Conversion list of tests:
933160-2 --> 933150-17
933160-25 --> 933150-18
933160-26 --> 933150-19
933160-27 --> 933150-20
933160-28 --> 933150-21
933160-29 --> 933150-22
933160-30 --> 933150-23
933160-31 --> 933150-24
933160-32 --> 933150-25
933160-33 --> 933150-26
933160-34 --> 933150-27
933160-35 --> 933150-28
933160-36 --> 933150-29
933160-37 --> 933150-30
933160-38 --> 933150-31
933160-39 --> 933150-32
933160-40 --> 933160-24
933160-41 --> 933160-25
933160-42 --> 933160-26
933160-43 --> 933160-27
933160-44 --> 933160-28

There is one left failing test: 933161-4. It is one that I fixed because the payload was a duplicate, but the description provided what I think the test was meant to match (dl/*foo*/()).
it seems that dl function in recent PHP versions is not allowed in web environments (ref1, ref2), but I found the occurrence in the PHP repo not inside ZEND_FUNCTION(), but PHP_FUNCTION().
I'm mainly wondering if we should look for also PHP_FUNCTION(*).
I'm running out of time for this week, but I will give it a go and see how many missing functions will be found.
If you have more knowledge about the PHP repo, please let me know.

If we are in a hurry to merge this PR, I would not consider it a blocker.

Edit: test commented out.

@M4tteoP M4tteoP marked this pull request as ready for review August 11, 2023 23:06
@theMiddleBlue
Copy link
Contributor

Wow, you've really put in a lot of effort here! Great job!

I'm just a bit worried about the regex causing a bunch of false positives. It's kind of a bummer since it even catches non-existent PHP function names.

Like, for example, preventivo123 (which means "estimate/quotation" in Italian). There's no injection stuff going on, just a simple [a-z0-9]+ string, but it still ends up matching 933150 at PL1 - PHP Injection Attack: High-Risk PHP Function Name Found.

$ curl -s -A 'test' 'http://sandbox.coreruleset.org/?a=preventivo123' \
   -H 'x-backend: nginx' \
   -H 'x-format-output: txt-matched-rules' \
   -H 'x-crs-paranoia-level: 1' \
   -H 'x-crs-version: pullrequest-3273'

933150 PL1 PHP Injection Attack: High-Risk PHP Function Name Found
949110 PL1 Inbound Anomaly Score Exceeded (Total Score: 5)

Allowing these sorts of false positives to linger at PL1 would turn into an absolute nightmare, and you know how I always mention that we need to be more stringent with injection context syntax. Remember, there's no way we can safeguard against something like eval($_GET[foo]), especially when we're talking about PL1.

I believe we could include a solution to tackle the most notable false positives by appending word boundaries \b after the PHP function name. What do you think about it?

IMO, this would be just a workaround. I think we should match an injection syntax before matching PHP function, at least at lower PLs.

@theMiddleBlue theMiddleBlue added the ⚠️ do not merge Additional work or discussion is needed despite passing tests label Aug 22, 2023
@theMiddleBlue theMiddleBlue self-assigned this Aug 22, 2023
@dune73
Copy link
Member
dune73 commented Sep 1, 2023

I do not think we are accepting FPs here. We are doing all we can within this architecture to avoid them. The basic strategy is: PHP functions names that do not exist in the English language are checked via PM and PHP function names that do exist in the English language either as word or as component of an English word are checked via regex.

What is the advantage of regex over pm when we make sure that the pm only gets those keywords that do not appear in the English language (and as we found out thanks to you: as snippets within words in the English language). I think this really addresses the core of the FP problem with the PM approach.

pm has the advantage that it's easier to read, easier to write and allegedly faster as well. If we stuff everything into a regex (like with the unix commands), we end up with a huge rule. Doable, but I would prefer to avoid it.

Finally, this is one of the roadblocks for CRS v4. Development is dragging along and we can't seem to finish things. I'd rather merge an optimized version of this than change the architecture before the goal and start over. If we optimize it and it still does not work, then I will agree with you.

@dune73
Copy link
Member
dune73 commented Sep 1, 2023

For the record, I went through the php-function-names-933150.data in the PR and asked chatgpt for English words containing certain snippets where I was not sure. I added unlink and substr to he list above because of unlinked and substring. They seem to be used outside of IT contexts too.

@theMiddleBlue
Copy link
Contributor

I can't agree @dune73, pm is not the right operator for doing what we're trying to do here.
pm operator, is inadvertently matching English words containing PHP function names.

For example: exp PHP function is included in a tons of English words like

experience
expertise
exploration
exponential
expenditure
etc...

curl -v 'http://sandbox.coreruleset.org/?a=explosion' \
   -H 'x-backend: nginx' \
   -H 'x-format-output: txt-matched-rules' \
   -H 'x-crs-paranoia-level: 1' \
   -H 'x-crs-version: pullrequest-3273'

933150 PL1 PHP Injection Attack: High-Risk PHP Function Name Found
949110 PL1 Inbound Anomaly Score Exceeded (Total Score: 5)

I can find numerous examples similar to this one: prev or ord are other examples

Here we're blocking ?a=preview at PL1!

$ curl -s 'http://sandbox.coreruleset.org/?a=preview' -H 'x-backend: nginx' -H 'x-format-output: txt-matched-rules' -H 'x-crs-paranoia-level: 1' -H 'x-crs-version: pullrequest-3273'

933150 PL1 PHP Injection Attack: High-Risk PHP Function Name Found
949110 PL1 Inbound Anomaly Score Exceeded (Total Score: 5)

Furthermore, we're unintentionally blocking non-PHP functions, which is not our intended outcome.

@dune73
Copy link
Member
dune73 commented Sep 1, 2023

Instead of continuing the conversation ad nauseam, @theMiddleBlue and I hopped on a call.

We agreed on the course of action laid down in #3273 (comment).

This should address known or expected FPs. If more FPs pop up with 933150, we plan to shift those keywords over to 933160 via the manual extended list. This includes stuff before we merge this PR and stuff that pops up during RC or after the release.

Redesigning the four rule of this PR, maybe consolidating into 1-2 regex rules could be an option. But @theMiddleBlue and I agreed this would throw us back at least another month. And we would be trusting our pattern identifying a php function call via the use of parenthesis etc. And we never know if there is a bypass in this regard. The pm operator in 933150 has its merits even if it is more prone to FP.

@M4tteoP promised to deliver an updated PR during the weekend. @theMiddleBlue and I will review immediately afterwards, namely @theMiddleBlue will try out additional payloads to see if he can provoke more FPs. If things are back to green, I hope to be able to merge before the Monday meeting.

@M4tteoP M4tteoP force-pushed the php_dict_creator branch 2 times, most recently from fbb03e3 to 81e79d4 Compare September 3, 2023 20:54
@M4tteoP
Copy link
Member Author
M4tteoP commented Sep 3, 2023

Done, the following terms have been moved out from 933150 and they are now part of 933160 and 933161 (being a stricter sibling).

  • exp
  • ord
  • prev
  • stat
  • substr
  • system
  • unlink

Considering that system is back to 933160, also the related tests have been moved back. The conversion list of tests has been updated accordingly.

PTAL 🙏

@dune73
Copy link
Member
dune73 commented Sep 4, 2023

For the record: There is a probability > 0 that random payloads (-> base64!) will hit 933150. The longer the keywords the less likely this is. Yet we have several 5 letter keywords in the list (mkdir, fopen, fputs, rtrim) as well as the 4 letter is_a. While _ is not in the character set for base64, the pure letter words are. We are currently OK with 5 letter words, though. If we start to see random hits on 933150 because of base64 or some other random input, we might have to shift all those keywords out or we we really re-design this entire rule group.

@theMiddleBlue theMiddleBlue marked this pull request as ready for review September 4, 2023 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0