Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Regex::searchAll behaviour wrt empty capturing groups #2393

Closed

Conversation

WGH-
Copy link
Contributor

@WGH- WGH- commented Sep 5, 2020

Previously, searchAll would stop search when it encountered an empty
matching group in any position. This means that, for example,
regular expression "(a)(b?)(c)" would match string "ac", but the
resulting group list would be ["ac", "a"].

After this change, the resulting list for the aforementioned regular
expression becomes ["ac", "a", "", "c"] like it should've been.

Additionally, this also changes behaviour for multiple matches. For
example, when "aaa00bbb" is matched by "[a-z]*", previously only "aaa"
would be returned. Now the matching list is ["aaa", "", "", "bbb", ""].

The old behaviour was confusing and almost certainly a bug. The new
behaviour is the same as in Python's re.findall.

For reference, though, Go does it somewhat differently: empty matches
at the end of non-empty matches are ignored, so in Go above example is
["aaa", "", "bbb"] instead.

This is the root cause of issue #2336 which has been already fixed by
replacing searchAll call there with a new function.

Cherry-picked from #2012

Previously, searchAll would stop search when it encountered an empty
matching group in any position. This means that, for example,
regular expression "(a)(b?)(c)" would match string "ac", but the
resulting group list would be ["ac", "a"].

After this change, the resulting list for the aforementioned regular
expression becomes ["ac", "a", "", "c"] like it should've been.

Additionally, this also changes behaviour for multiple matches. For
example, when "aaa00bbb" is matched by "[a-z]*", previously only "aaa"
would be returned. Now the matching list is ["aaa", "", "", "bbb", ""].

The old behaviour was confusing and almost certainly a bug. The new
behaviour is the same as in Python's re.findall.

For reference, though, Go does it somewhat differently: empty matches
at the end of non-empty matches are ignored, so in Go above example is
["aaa", "", "bbb"] instead.

This is the root cause of issue owasp-modsecurity#2336 which has been already fixed by
replacing searchAll call there with a new function.
@zimmerle
Copy link
Contributor

zimmerle commented Sep 8, 2020

@martinhsv you may want to have a look at this one as well.

@martinhsv
Copy link
Contributor

You are quite right that the underlying problem with searchAll, that was one of the initial triggers for creating the new function now used by the rx operator, still exists in searchAll for that other functionality (other than the rx operator) that still use it.

However, I believe there is a correctness issue with the particular solution in this pull request.

Consider a scenario where the regex is "^(|abc|def)" and the input is "abc".

In this case there should be two successful full matches. The first match should be an empty string. The second match should be "abc".

With the code in this pull request, the first match of an empty string will occur successfully, but then the code will advance the offset, and on the second iteration of the loop, "abc" can no longer be produced as a match.

(Note: There is a new searchGlobal function is another recent pull request (to support a new rxGlobal operator). I had a notion to implement that and then, after some stabilization time, look to migrate the remaining users of searchAll to the new function.
You may wish to wait for that process to play out.)

@WGH-
Copy link
Contributor Author

WGH- commented Sep 18, 2020

Interesting, I never thought about that. It's also interesting to note that Python used to have "my" behaviour (re.findall(r"^(|abc|def)", "abc") == [""]) until Python 3.7, where it was changed to allow non-empty matches right after empty match (re.findall(r"^(|abc|def)", "abc") == ["", "abc"]). python/cpython@70d56fb

@zimmerle
Copy link
Contributor

Hi @WGH- I am closing this and #2394 in favor of #2396. Please let me know if I am missing something.

@zimmerle zimmerle closed this Oct 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants