8000 It seems that the regex method return empty string inside a loop or a spark UDF · Issue #2 · ermanh/trieregex · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
It seems that the regex method return empty string inside a loop or a spark UDF #2
Open
@davidegazze

Description

@davidegazze

Hi,
TrieRegex really help me to improve my regex performance.
However, I try to use TrieRegex inside a spark UDF and seem that sometime TrieRegex return an empty string.

To simplify, I understand that the same behaviour is present if I use the TrieRegex inside a loop:

# Del
i = 0
VALUES = ["REGEX1", "REGEX2"]
while i < 20:
    TRIE_VALUES = TrieRegEx(*VALUES)
    i = i + 1
    if len(TRIE_VALUES.regex()) < 1:
        print(f"ERROR on loop i:{i}")
        print(f"TRIE_VALUES: '{TRIE_VALUES.regex()}' (len: {len(TRIE_VALUES.regex())})")
        break

I have:

ERROR on loop i:5 # Where the number can change
TRIE_VALUES: '' (len: 0)

My workaround for this case is to add a del like this:

# Del
i = 0
VALUES = ["REGEX1", "REGEX2"]
while i < 20:
    TRIE_VALUES = TrieRegEx(*VALUES)
    i = i + 1
    if len(TRIE_VALUES.regex()) < 1:
        print(f"ERROR on loop i:{i}")
        print(f"TRIE_VALUES: '{TRIE_VALUES.regex()}' (len: {len(TRIE_VALUES.regex())})")
        break
    del TRIE_VALUES

With the code above it works well.
However, if I use TrieRegex inside a PandasUDF, I have the same bug.

My pandas udf is something like this:

def trieregex_udf(df):
     # Read source
     values = ### read_values()
     trie = TrieRegex(*patterns)
     regex = trie.regex()
     # Apply regex to DF
     output = .....
     return output

output = df.groupby("id").applyInPandas(trieregex_udf, schema="v string").toPandas()

sometimes the trie.regex() return an empty string
It seems that the problem is present only in case of instance the TrieRegex inside the udf, if I pass the result regex everything work well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0