Open
Description
Hi,
TrieRegex really help me to improve my regex performance.
However, I try to use TrieRegex inside a spark UDF and seem that sometime TrieRegex return an empty string.
To simplify, I understand that the same behaviour is present if I use the TrieRegex inside a loop:
# Del
i = 0
VALUES = ["REGEX1", "REGEX2"]
while i < 20:
TRIE_VALUES = TrieRegEx(*VALUES)
i = i + 1
if len(TRIE_VALUES.regex()) < 1:
print(f"ERROR on loop i:{i}")
print(f"TRIE_VALUES: '{TRIE_VALUES.regex()}' (len: {len(TRIE_VALUES.regex())})")
break
I have:
ERROR on loop i:5 # Where the number can change
TRIE_VALUES: '' (len: 0)
My workaround for this case is to add a del like this:
# Del
i = 0
VALUES = ["REGEX1", "REGEX2"]
while i < 20:
TRIE_VALUES = TrieRegEx(*VALUES)
i = i + 1
if len(TRIE_VALUES.regex()) < 1:
print(f"ERROR on loop i:{i}")
print(f"TRIE_VALUES: '{TRIE_VALUES.regex()}' (len: {len(TRIE_VALUES.regex())})")
break
del TRIE_VALUES
With the code above it works well.
However, if I use TrieRegex inside a PandasUDF, I have the same bug.
My pandas udf is something like this:
def trieregex_udf(df):
# Read source
values = ### read_values()
trie = TrieRegex(*patterns)
regex = trie.regex()
# Apply regex to DF
output = .....
return output
output = df.groupby("id").applyInPandas(trieregex_udf, schema="v string").toPandas()
sometimes the trie.regex() return an empty string
It seems that the problem is present only in case of instance the TrieRegex inside the udf, if I pass the result regex everything work well.
Metadata
Metadata
Assignees
Labels
No labels