Fix regex patterns that look for multi-byte characters

Describe the bug

We need to agree on a consistent way to reference multi-byte characters in regular expression patterns (and we can enforce this via rules-check.py.)

Our current approach to detect multi-byte characters (i.e. Unicode code points above \xFF) is not working as intended, in the few rules where we do this. Our current approach of simply putting the multi-byte UTF-8 character in a regular expression pattern in a rule file causes false positives with non-Latin scripts (e.g. rule 942430: see #3284 for a real user's false positive example).

Example: Let's say that we want to match using the pattern [abc’] (that last character is Unicode character U+2019, "RIGHT SINGLE QUOTATION MARK"):

@rx [abc’]

That pattern, as saved to the rule file, is saved as (byte for byte):

[abc\xE2\x80\x99]

and is seemingly interpreted one byte at a time (so the multibyte char looses its meaning). So, suddenly, any content containing, for example, the byte \xE2 will match (which is many, many UTF-8 encoded Unicode characters).

Possible approach 1: SecRule ARGS "@rx (*UTF8)[abc\x{2019}]"
- Probably PCRE-specific? Coraza and other engines might hate it. Without the UTF8 'verb' it isn't possible to use \x{2019} (limited to max of \x{ff}).
Possible approach 2: SecRule ARGS "@rx [abc]|%u2019" ... t:none,t:utf8toUnicode...
- The portable option. We could agree to always use t:utf8toUnicode for any rules that need to match Unicode characters above \xFF.
  - Team Coraza confirms that t:utf8toUnicode has been implemented, so this approach is Coraza-friendly too.

Further discussion that was had on Slack: https://owasp.slack.com/archives/CBKGH8A5P/p1694012219559839

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions