Description
Describe the bug
We need to agree on a consistent way to reference multi-byte characters in regular expression patterns (and we can enforce this via rules-check.py
.)
Our current approach to detect multi-byte characters (i.e. Unicode code points above \xFF
) is not working as intended, in the few rules where we do this. Our current approach of simply putting the multi-byte UTF-8 character in a regular expression pattern in a rule file causes false positives with non-Latin scripts (e.g. rule 942430: see #3284 for a real user's false positive example).
Example: Let's say that we want to match using the pattern [abc’]
(that last character is Unicode character U+2019, "RIGHT SINGLE QUOTATION MARK"):
@rx [abc’]
That pattern, as saved to the rule file, is saved as (byte for byte):
[abc\xE2\x80\x99]
and is seemingly interpreted one byte at a time (so the multibyte char looses its meaning). So, suddenly, any content containing, for example, the byte \xE2 will match (which is many, many UTF-8 encoded Unicode characters).
- Possible approach 1:
SecRule ARGS "@rx (*UTF8)[abc\x{2019}]"
- Probably PCRE-specific? Coraza and other engines might hate it. Without the UTF8 'verb' it isn't possible to use
\x{2019}
(limited to max of\x{ff}
).
- Probably PCRE-specific? Coraza and other engines might hate it. Without the UTF8 'verb' it isn't possible to use
- Possible approach 2:
SecRule ARGS "@rx [abc]|%u2019" ... t:none,t:utf8toUnicode...
- The portable option. We could agree to always use
t:utf8toUnicode
for any rules that need to match Unicode characters above\xFF
.- Team Coraza confirms that
t:utf8toUnicode
has been implemented, so this approach is Coraza-friendly too.
- Team Coraza confirms that
- The portable option. We could agree to always use
Further discussion that was had on Slack: https://owasp.slack.com/archives/CBKGH8A5P/p1694012219559839