Regex for longest matching sequence between two strings
up vote
1
down vote
favorite
I've searched Google for my use-case but didn't find anything much useful.
I am not an expert in regular expression so I would appreciate if anybody in the community could help.
Question:
Given a text file, I want to capture the longest string between two substrings (prefix and suffix) using regex. Note that those two substrings will always be at the start of any lines of the text. Please see the below example.
Substrings:
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
Example 1:
Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ....
Expected Result:
Item 1a ....
....
....
....
....
Why this result?
Because prefix of Item 1a
and suffix of Item 2b
matches the longest string in the text between them of all other prefix-suffix pair.
Example 2:
Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2
....
Item 1 ....
Item 2
Item 1a ....
....
....
....
.... Item 2b
....
Expected result:
Item 1 ....
....
....
Why this result?
This is because this is the largest string between two strings (prefix and suffix pair) where both prefix and suffix starts at the beginning of the line. Note that there's another pair (Item 1a
-Item 2b
) but since Item 2b
does not comes at the beginning of the line, we cannot consider this longest sequence.
What I have tried with regex:
I have tried with below regex for each prefix-suffix pair in my above list, but this didn't work.
regexs = [r'^' + re.escape(pre) + '(.*?)' + re.escape(suf) for pre in prefixes for suf in suffixes]
for regex in regexs:
re.findall(regex, text, re.MULTLINE)
What I have tried using non-regex (Python string functions):
def extract_longest_match(text, prefixes, suffixes):
longest_match = ''
for line in text.splitlines():
if line.startswith(tuple(prefixes)):
beg_index = text.index(line)
for suf in suffixes:
end_index = text.find(suf, beg_index+len(line))
match = text[beg_index:end_index]
if len(match) > len(longest_match ):
longest_match = match
return longest_match
Am I missing anything?
python regex string pattern-matching string-matching
|
show 4 more comments
up vote
1
down vote
favorite
I've searched Google for my use-case but didn't find anything much useful.
I am not an expert in regular expression so I would appreciate if anybody in the community could help.
Question:
Given a text file, I want to capture the longest string between two substrings (prefix and suffix) using regex. Note that those two substrings will always be at the start of any lines of the text. Please see the below example.
Substrings:
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
Example 1:
Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ....
Expected Result:
Item 1a ....
....
....
....
....
Why this result?
Because prefix of Item 1a
and suffix of Item 2b
matches the longest string in the text between them of all other prefix-suffix pair.
Example 2:
Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2
....
Item 1 ....
Item 2
Item 1a ....
....
....
....
.... Item 2b
....
Expected result:
Item 1 ....
....
....
Why this result?
This is because this is the largest string between two strings (prefix and suffix pair) where both prefix and suffix starts at the beginning of the line. Note that there's another pair (Item 1a
-Item 2b
) but since Item 2b
does not comes at the beginning of the line, we cannot consider this longest sequence.
What I have tried with regex:
I have tried with below regex for each prefix-suffix pair in my above list, but this didn't work.
regexs = [r'^' + re.escape(pre) + '(.*?)' + re.escape(suf) for pre in prefixes for suf in suffixes]
for regex in regexs:
re.findall(regex, text, re.MULTLINE)
What I have tried using non-regex (Python string functions):
def extract_longest_match(text, prefixes, suffixes):
longest_match = ''
for line in text.splitlines():
if line.startswith(tuple(prefixes)):
beg_index = text.index(line)
for suf in suffixes:
end_index = text.find(suf, beg_index+len(line))
match = text[beg_index:end_index]
if len(match) > len(longest_match ):
longest_match = match
return longest_match
Am I missing anything?
python regex string pattern-matching string-matching
1
Regex can't match the longest substring. Use the regex or a non-regex solution to find the substrings and find the longest using the common language means.
– Wiktor Stribiżew
Nov 19 at 21:06
Yes, that is a possibility to check length of string returned with regex. But do you know how to get text between prefix and suffix where both prefix and suffix start at the left side of sentence? Is my regex correct?
– sgokhales
Nov 19 at 21:07
@WiktorStribiżew Regex can match the longest substring. The problem is that with(.*?)
OP is explicitly using a non-greedy regex. Should probably only be(.*)
instead of(.*?)
.
– quant
Nov 19 at 21:07
@quant That.*
won't yield the longest substring, that is just matching from the leftmost occurrence of the leading delimiter till the rightmost occurrence of trailing delimiter. And that is not the same. It is a common confusion of greediness and longest/shortest substring extraction.
– Wiktor Stribiżew
Nov 19 at 21:09
@WiktorStribiżew What am I missing here then? What's the correct regex to match string between prefix and suffix. I can then later check length and maintain largest string in a variable.
– sgokhales
Nov 19 at 21:19
|
show 4 more comments
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I've searched Google for my use-case but didn't find anything much useful.
I am not an expert in regular expression so I would appreciate if anybody in the community could help.
Question:
Given a text file, I want to capture the longest string between two substrings (prefix and suffix) using regex. Note that those two substrings will always be at the start of any lines of the text. Please see the below example.
Substrings:
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
Example 1:
Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ....
Expected Result:
Item 1a ....
....
....
....
....
Why this result?
Because prefix of Item 1a
and suffix of Item 2b
matches the longest string in the text between them of all other prefix-suffix pair.
Example 2:
Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2
....
Item 1 ....
Item 2
Item 1a ....
....
....
....
.... Item 2b
....
Expected result:
Item 1 ....
....
....
Why this result?
This is because this is the largest string between two strings (prefix and suffix pair) where both prefix and suffix starts at the beginning of the line. Note that there's another pair (Item 1a
-Item 2b
) but since Item 2b
does not comes at the beginning of the line, we cannot consider this longest sequence.
What I have tried with regex:
I have tried with below regex for each prefix-suffix pair in my above list, but this didn't work.
regexs = [r'^' + re.escape(pre) + '(.*?)' + re.escape(suf) for pre in prefixes for suf in suffixes]
for regex in regexs:
re.findall(regex, text, re.MULTLINE)
What I have tried using non-regex (Python string functions):
def extract_longest_match(text, prefixes, suffixes):
longest_match = ''
for line in text.splitlines():
if line.startswith(tuple(prefixes)):
beg_index = text.index(line)
for suf in suffixes:
end_index = text.find(suf, beg_index+len(line))
match = text[beg_index:end_index]
if len(match) > len(longest_match ):
longest_match = match
return longest_match
Am I missing anything?
python regex string pattern-matching string-matching
I've searched Google for my use-case but didn't find anything much useful.
I am not an expert in regular expression so I would appreciate if anybody in the community could help.
Question:
Given a text file, I want to capture the longest string between two substrings (prefix and suffix) using regex. Note that those two substrings will always be at the start of any lines of the text. Please see the below example.
Substrings:
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
Example 1:
Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ....
Expected Result:
Item 1a ....
....
....
....
....
Why this result?
Because prefix of Item 1a
and suffix of Item 2b
matches the longest string in the text between them of all other prefix-suffix pair.
Example 2:
Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2
....
Item 1 ....
Item 2
Item 1a ....
....
....
....
.... Item 2b
....
Expected result:
Item 1 ....
....
....
Why this result?
This is because this is the largest string between two strings (prefix and suffix pair) where both prefix and suffix starts at the beginning of the line. Note that there's another pair (Item 1a
-Item 2b
) but since Item 2b
does not comes at the beginning of the line, we cannot consider this longest sequence.
What I have tried with regex:
I have tried with below regex for each prefix-suffix pair in my above list, but this didn't work.
regexs = [r'^' + re.escape(pre) + '(.*?)' + re.escape(suf) for pre in prefixes for suf in suffixes]
for regex in regexs:
re.findall(regex, text, re.MULTLINE)
What I have tried using non-regex (Python string functions):
def extract_longest_match(text, prefixes, suffixes):
longest_match = ''
for line in text.splitlines():
if line.startswith(tuple(prefixes)):
beg_index = text.index(line)
for suf in suffixes:
end_index = text.find(suf, beg_index+len(line))
match = text[beg_index:end_index]
if len(match) > len(longest_match ):
longest_match = match
return longest_match
Am I missing anything?
python regex string pattern-matching string-matching
python regex string pattern-matching string-matching
edited Nov 19 at 22:10
asked Nov 19 at 21:03
sgokhales
35k26105138
35k26105138
1
Regex can't match the longest substring. Use the regex or a non-regex solution to find the substrings and find the longest using the common language means.
– Wiktor Stribiżew
Nov 19 at 21:06
Yes, that is a possibility to check length of string returned with regex. But do you know how to get text between prefix and suffix where both prefix and suffix start at the left side of sentence? Is my regex correct?
– sgokhales
Nov 19 at 21:07
@WiktorStribiżew Regex can match the longest substring. The problem is that with(.*?)
OP is explicitly using a non-greedy regex. Should probably only be(.*)
instead of(.*?)
.
– quant
Nov 19 at 21:07
@quant That.*
won't yield the longest substring, that is just matching from the leftmost occurrence of the leading delimiter till the rightmost occurrence of trailing delimiter. And that is not the same. It is a common confusion of greediness and longest/shortest substring extraction.
– Wiktor Stribiżew
Nov 19 at 21:09
@WiktorStribiżew What am I missing here then? What's the correct regex to match string between prefix and suffix. I can then later check length and maintain largest string in a variable.
– sgokhales
Nov 19 at 21:19
|
show 4 more comments
1
Regex can't match the longest substring. Use the regex or a non-regex solution to find the substrings and find the longest using the common language means.
– Wiktor Stribiżew
Nov 19 at 21:06
Yes, that is a possibility to check length of string returned with regex. But do you know how to get text between prefix and suffix where both prefix and suffix start at the left side of sentence? Is my regex correct?
– sgokhales
Nov 19 at 21:07
@WiktorStribiżew Regex can match the longest substring. The problem is that with(.*?)
OP is explicitly using a non-greedy regex. Should probably only be(.*)
instead of(.*?)
.
– quant
Nov 19 at 21:07
@quant That.*
won't yield the longest substring, that is just matching from the leftmost occurrence of the leading delimiter till the rightmost occurrence of trailing delimiter. And that is not the same. It is a common confusion of greediness and longest/shortest substring extraction.
– Wiktor Stribiżew
Nov 19 at 21:09
@WiktorStribiżew What am I missing here then? What's the correct regex to match string between prefix and suffix. I can then later check length and maintain largest string in a variable.
– sgokhales
Nov 19 at 21:19
1
1
Regex can't match the longest substring. Use the regex or a non-regex solution to find the substrings and find the longest using the common language means.
– Wiktor Stribiżew
Nov 19 at 21:06
Regex can't match the longest substring. Use the regex or a non-regex solution to find the substrings and find the longest using the common language means.
– Wiktor Stribiżew
Nov 19 at 21:06
Yes, that is a possibility to check length of string returned with regex. But do you know how to get text between prefix and suffix where both prefix and suffix start at the left side of sentence? Is my regex correct?
– sgokhales
Nov 19 at 21:07
Yes, that is a possibility to check length of string returned with regex. But do you know how to get text between prefix and suffix where both prefix and suffix start at the left side of sentence? Is my regex correct?
– sgokhales
Nov 19 at 21:07
@WiktorStribiżew Regex can match the longest substring. The problem is that with
(.*?)
OP is explicitly using a non-greedy regex. Should probably only be (.*)
instead of (.*?)
.– quant
Nov 19 at 21:07
@WiktorStribiżew Regex can match the longest substring. The problem is that with
(.*?)
OP is explicitly using a non-greedy regex. Should probably only be (.*)
instead of (.*?)
.– quant
Nov 19 at 21:07
@quant That
.*
won't yield the longest substring, that is just matching from the leftmost occurrence of the leading delimiter till the rightmost occurrence of trailing delimiter. And that is not the same. It is a common confusion of greediness and longest/shortest substring extraction.– Wiktor Stribiżew
Nov 19 at 21:09
@quant That
.*
won't yield the longest substring, that is just matching from the leftmost occurrence of the leading delimiter till the rightmost occurrence of trailing delimiter. And that is not the same. It is a common confusion of greediness and longest/shortest substring extraction.– Wiktor Stribiżew
Nov 19 at 21:09
@WiktorStribiżew What am I missing here then? What's the correct regex to match string between prefix and suffix. I can then later check length and maintain largest string in a variable.
– sgokhales
Nov 19 at 21:19
@WiktorStribiżew What am I missing here then? What's the correct regex to match string between prefix and suffix. I can then later check length and maintain largest string in a variable.
– sgokhales
Nov 19 at 21:19
|
show 4 more comments
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
You need to
- Build a regex that matches strings from the leftmost starting delimiter to the leftmost trailing delimiter (see Match text between two strings with regular expression)
- Make sure the delimiters are matches at the line start positions only
- Make sure the
.
matches the line break chars by usingre.DOTALL
or equivalent options (see Python regex, matching pattern over multiple lines) - Make sure the regex matches overlapping substrings (see Python regex find all overlapping matches)
- Find all matches in the text (see How can I find all matches to a regular expression in Python?)
- Get the longest one (see Python's most efficient way to choose longest string in list?).
Python demo:
import re
s="""Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ...."""
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))
# Or, a version with word boundaries:
# rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))
all_matches = re.findall(rx, s, re.S | re.M)
print(max(all_matches, key=len))
Output:
Item 1a ....
....
....
....
....
Item 2
The regex looks like
(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))
With word boundaries
(?sm)(?=^((?:Item 1|Item 1a|Item 1b)b.*?^(?:Item 2|Item 2a|Item 2b)b))
See the regex demo.
Details
(?sm)
-re.S
andre.M
flags
(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))
- a positive lookahead that matches at any location that is immediately followed with a sequence of patterns:
^
- start of a line
((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))
- Group 1 (this value is returned withre.findall
)
(?:Item 1|Item 1a|Item 1b)
- any of the items in the alternation (probably, it makes sense to addb
word boundary after)
here)
.*?
- any 0+ chars, as few as possible
^
- start of a line
(?:Item 2|Item 2a|Item 2b)
- any alternative from the list (probably, it also makes sense to addb
word boundary after)
here).
Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
– sgokhales
Nov 19 at 22:05
I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
– sgokhales
Nov 19 at 22:10
Awesome. Thank you very much!
– sgokhales
Nov 19 at 22:15
@sgokhales Perhaps, you actually wantrx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))
to match the delimiters as whole words. See this Python demo.
– Wiktor Stribiżew
Nov 19 at 22:15
With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
– sgokhales
Nov 19 at 22:17
|
show 3 more comments
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
You need to
- Build a regex that matches strings from the leftmost starting delimiter to the leftmost trailing delimiter (see Match text between two strings with regular expression)
- Make sure the delimiters are matches at the line start positions only
- Make sure the
.
matches the line break chars by usingre.DOTALL
or equivalent options (see Python regex, matching pattern over multiple lines) - Make sure the regex matches overlapping substrings (see Python regex find all overlapping matches)
- Find all matches in the text (see How can I find all matches to a regular expression in Python?)
- Get the longest one (see Python's most efficient way to choose longest string in list?).
Python demo:
import re
s="""Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ...."""
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))
# Or, a version with word boundaries:
# rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))
all_matches = re.findall(rx, s, re.S | re.M)
print(max(all_matches, key=len))
Output:
Item 1a ....
....
....
....
....
Item 2
The regex looks like
(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))
With word boundaries
(?sm)(?=^((?:Item 1|Item 1a|Item 1b)b.*?^(?:Item 2|Item 2a|Item 2b)b))
See the regex demo.
Details
(?sm)
-re.S
andre.M
flags
(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))
- a positive lookahead that matches at any location that is immediately followed with a sequence of patterns:
^
- start of a line
((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))
- Group 1 (this value is returned withre.findall
)
(?:Item 1|Item 1a|Item 1b)
- any of the items in the alternation (probably, it makes sense to addb
word boundary after)
here)
.*?
- any 0+ chars, as few as possible
^
- start of a line
(?:Item 2|Item 2a|Item 2b)
- any alternative from the list (probably, it also makes sense to addb
word boundary after)
here).
Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
– sgokhales
Nov 19 at 22:05
I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
– sgokhales
Nov 19 at 22:10
Awesome. Thank you very much!
– sgokhales
Nov 19 at 22:15
@sgokhales Perhaps, you actually wantrx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))
to match the delimiters as whole words. See this Python demo.
– Wiktor Stribiżew
Nov 19 at 22:15
With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
– sgokhales
Nov 19 at 22:17
|
show 3 more comments
up vote
1
down vote
accepted
You need to
- Build a regex that matches strings from the leftmost starting delimiter to the leftmost trailing delimiter (see Match text between two strings with regular expression)
- Make sure the delimiters are matches at the line start positions only
- Make sure the
.
matches the line break chars by usingre.DOTALL
or equivalent options (see Python regex, matching pattern over multiple lines) - Make sure the regex matches overlapping substrings (see Python regex find all overlapping matches)
- Find all matches in the text (see How can I find all matches to a regular expression in Python?)
- Get the longest one (see Python's most efficient way to choose longest string in list?).
Python demo:
import re
s="""Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ...."""
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))
# Or, a version with word boundaries:
# rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))
all_matches = re.findall(rx, s, re.S | re.M)
print(max(all_matches, key=len))
Output:
Item 1a ....
....
....
....
....
Item 2
The regex looks like
(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))
With word boundaries
(?sm)(?=^((?:Item 1|Item 1a|Item 1b)b.*?^(?:Item 2|Item 2a|Item 2b)b))
See the regex demo.
Details
(?sm)
-re.S
andre.M
flags
(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))
- a positive lookahead that matches at any location that is immediately followed with a sequence of patterns:
^
- start of a line
((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))
- Group 1 (this value is returned withre.findall
)
(?:Item 1|Item 1a|Item 1b)
- any of the items in the alternation (probably, it makes sense to addb
word boundary after)
here)
.*?
- any 0+ chars, as few as possible
^
- start of a line
(?:Item 2|Item 2a|Item 2b)
- any alternative from the list (probably, it also makes sense to addb
word boundary after)
here).
Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
– sgokhales
Nov 19 at 22:05
I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
– sgokhales
Nov 19 at 22:10
Awesome. Thank you very much!
– sgokhales
Nov 19 at 22:15
@sgokhales Perhaps, you actually wantrx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))
to match the delimiters as whole words. See this Python demo.
– Wiktor Stribiżew
Nov 19 at 22:15
With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
– sgokhales
Nov 19 at 22:17
|
show 3 more comments
up vote
1
down vote
accepted
up vote
1
down vote
accepted
You need to
- Build a regex that matches strings from the leftmost starting delimiter to the leftmost trailing delimiter (see Match text between two strings with regular expression)
- Make sure the delimiters are matches at the line start positions only
- Make sure the
.
matches the line break chars by usingre.DOTALL
or equivalent options (see Python regex, matching pattern over multiple lines) - Make sure the regex matches overlapping substrings (see Python regex find all overlapping matches)
- Find all matches in the text (see How can I find all matches to a regular expression in Python?)
- Get the longest one (see Python's most efficient way to choose longest string in list?).
Python demo:
import re
s="""Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ...."""
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))
# Or, a version with word boundaries:
# rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))
all_matches = re.findall(rx, s, re.S | re.M)
print(max(all_matches, key=len))
Output:
Item 1a ....
....
....
....
....
Item 2
The regex looks like
(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))
With word boundaries
(?sm)(?=^((?:Item 1|Item 1a|Item 1b)b.*?^(?:Item 2|Item 2a|Item 2b)b))
See the regex demo.
Details
(?sm)
-re.S
andre.M
flags
(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))
- a positive lookahead that matches at any location that is immediately followed with a sequence of patterns:
^
- start of a line
((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))
- Group 1 (this value is returned withre.findall
)
(?:Item 1|Item 1a|Item 1b)
- any of the items in the alternation (probably, it makes sense to addb
word boundary after)
here)
.*?
- any 0+ chars, as few as possible
^
- start of a line
(?:Item 2|Item 2a|Item 2b)
- any alternative from the list (probably, it also makes sense to addb
word boundary after)
here).
You need to
- Build a regex that matches strings from the leftmost starting delimiter to the leftmost trailing delimiter (see Match text between two strings with regular expression)
- Make sure the delimiters are matches at the line start positions only
- Make sure the
.
matches the line break chars by usingre.DOTALL
or equivalent options (see Python regex, matching pattern over multiple lines) - Make sure the regex matches overlapping substrings (see Python regex find all overlapping matches)
- Find all matches in the text (see How can I find all matches to a regular expression in Python?)
- Get the longest one (see Python's most efficient way to choose longest string in list?).
Python demo:
import re
s="""Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ...."""
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))
# Or, a version with word boundaries:
# rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))
all_matches = re.findall(rx, s, re.S | re.M)
print(max(all_matches, key=len))
Output:
Item 1a ....
....
....
....
....
Item 2
The regex looks like
(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))
With word boundaries
(?sm)(?=^((?:Item 1|Item 1a|Item 1b)b.*?^(?:Item 2|Item 2a|Item 2b)b))
See the regex demo.
Details
(?sm)
-re.S
andre.M
flags
(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))
- a positive lookahead that matches at any location that is immediately followed with a sequence of patterns:
^
- start of a line
((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))
- Group 1 (this value is returned withre.findall
)
(?:Item 1|Item 1a|Item 1b)
- any of the items in the alternation (probably, it makes sense to addb
word boundary after)
here)
.*?
- any 0+ chars, as few as possible
^
- start of a line
(?:Item 2|Item 2a|Item 2b)
- any alternative from the list (probably, it also makes sense to addb
word boundary after)
here).
edited Nov 19 at 22:22
answered Nov 19 at 21:58
Wiktor Stribiżew
303k16123199
303k16123199
Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
– sgokhales
Nov 19 at 22:05
I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
– sgokhales
Nov 19 at 22:10
Awesome. Thank you very much!
– sgokhales
Nov 19 at 22:15
@sgokhales Perhaps, you actually wantrx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))
to match the delimiters as whole words. See this Python demo.
– Wiktor Stribiżew
Nov 19 at 22:15
With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
– sgokhales
Nov 19 at 22:17
|
show 3 more comments
Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
– sgokhales
Nov 19 at 22:05
I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
– sgokhales
Nov 19 at 22:10
Awesome. Thank you very much!
– sgokhales
Nov 19 at 22:15
@sgokhales Perhaps, you actually wantrx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))
to match the delimiters as whole words. See this Python demo.
– Wiktor Stribiżew
Nov 19 at 22:15
With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
– sgokhales
Nov 19 at 22:17
Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
– sgokhales
Nov 19 at 22:05
Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
– sgokhales
Nov 19 at 22:05
I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
– sgokhales
Nov 19 at 22:10
I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
– sgokhales
Nov 19 at 22:10
Awesome. Thank you very much!
– sgokhales
Nov 19 at 22:15
Awesome. Thank you very much!
– sgokhales
Nov 19 at 22:15
@sgokhales Perhaps, you actually want
rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))
to match the delimiters as whole words. See this Python demo.– Wiktor Stribiżew
Nov 19 at 22:15
@sgokhales Perhaps, you actually want
rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))
to match the delimiters as whole words. See this Python demo.– Wiktor Stribiżew
Nov 19 at 22:15
With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
– sgokhales
Nov 19 at 22:17
With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
– sgokhales
Nov 19 at 22:17
|
show 3 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53382579%2fregex-for-longest-matching-sequence-between-two-strings%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Regex can't match the longest substring. Use the regex or a non-regex solution to find the substrings and find the longest using the common language means.
– Wiktor Stribiżew
Nov 19 at 21:06
Yes, that is a possibility to check length of string returned with regex. But do you know how to get text between prefix and suffix where both prefix and suffix start at the left side of sentence? Is my regex correct?
– sgokhales
Nov 19 at 21:07
@WiktorStribiżew Regex can match the longest substring. The problem is that with
(.*?)
OP is explicitly using a non-greedy regex. Should probably only be(.*)
instead of(.*?)
.– quant
Nov 19 at 21:07
@quant That
.*
won't yield the longest substring, that is just matching from the leftmost occurrence of the leading delimiter till the rightmost occurrence of trailing delimiter. And that is not the same. It is a common confusion of greediness and longest/shortest substring extraction.– Wiktor Stribiżew
Nov 19 at 21:09
@WiktorStribiżew What am I missing here then? What's the correct regex to match string between prefix and suffix. I can then later check length and maintain largest string in a variable.
– sgokhales
Nov 19 at 21:19