Regex for longest matching sequence between two strings











up vote
1
down vote

favorite












I've searched Google for my use-case but didn't find anything much useful.



I am not an expert in regular expression so I would appreciate if anybody in the community could help.



Question:



Given a text file, I want to capture the longest string between two substrings (prefix and suffix) using regex. Note that those two substrings will always be at the start of any lines of the text. Please see the below example.



Substrings:




prefixes = ['Item 1', 'Item 1a', 'Item 1b']

suffixes = ['Item 2', 'Item 2a', 'Item 2b']




Example 1:




Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2 ....

Item 1 ....

Item 2

Item 1a ....

....

....

....

....

Item 2b ....




Expected Result:




Item 1a ....

....

....

....

....




Why this result?



Because prefix of Item 1a and suffix of Item 2b matches the longest string in the text between them of all other prefix-suffix pair.



Example 2:




Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2

....
Item 1 ....

Item 2

Item 1a ....
....

....

....

.... Item 2b

....




Expected result:




Item 1 ....

....

....




Why this result?



This is because this is the largest string between two strings (prefix and suffix pair) where both prefix and suffix starts at the beginning of the line. Note that there's another pair (Item 1a-Item 2b) but since Item 2b does not comes at the beginning of the line, we cannot consider this longest sequence.



What I have tried with regex:



I have tried with below regex for each prefix-suffix pair in my above list, but this didn't work.



regexs = [r'^' + re.escape(pre) + '(.*?)' + re.escape(suf) for pre in prefixes for suf in suffixes]
for regex in regexs:
re.findall(regex, text, re.MULTLINE)


What I have tried using non-regex (Python string functions):



def extract_longest_match(text, prefixes, suffixes):
longest_match = ''
for line in text.splitlines():
if line.startswith(tuple(prefixes)):
beg_index = text.index(line)
for suf in suffixes:
end_index = text.find(suf, beg_index+len(line))
match = text[beg_index:end_index]
if len(match) > len(longest_match ):
longest_match = match
return longest_match


Am I missing anything?










share|improve this question




















  • 1




    Regex can't match the longest substring. Use the regex or a non-regex solution to find the substrings and find the longest using the common language means.
    – Wiktor Stribiżew
    Nov 19 at 21:06












  • Yes, that is a possibility to check length of string returned with regex. But do you know how to get text between prefix and suffix where both prefix and suffix start at the left side of sentence? Is my regex correct?
    – sgokhales
    Nov 19 at 21:07












  • @WiktorStribiżew Regex can match the longest substring. The problem is that with (.*?) OP is explicitly using a non-greedy regex. Should probably only be (.*) instead of (.*?).
    – quant
    Nov 19 at 21:07










  • @quant That .* won't yield the longest substring, that is just matching from the leftmost occurrence of the leading delimiter till the rightmost occurrence of trailing delimiter. And that is not the same. It is a common confusion of greediness and longest/shortest substring extraction.
    – Wiktor Stribiżew
    Nov 19 at 21:09












  • @WiktorStribiżew What am I missing here then? What's the correct regex to match string between prefix and suffix. I can then later check length and maintain largest string in a variable.
    – sgokhales
    Nov 19 at 21:19















up vote
1
down vote

favorite












I've searched Google for my use-case but didn't find anything much useful.



I am not an expert in regular expression so I would appreciate if anybody in the community could help.



Question:



Given a text file, I want to capture the longest string between two substrings (prefix and suffix) using regex. Note that those two substrings will always be at the start of any lines of the text. Please see the below example.



Substrings:




prefixes = ['Item 1', 'Item 1a', 'Item 1b']

suffixes = ['Item 2', 'Item 2a', 'Item 2b']




Example 1:




Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2 ....

Item 1 ....

Item 2

Item 1a ....

....

....

....

....

Item 2b ....




Expected Result:




Item 1a ....

....

....

....

....




Why this result?



Because prefix of Item 1a and suffix of Item 2b matches the longest string in the text between them of all other prefix-suffix pair.



Example 2:




Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2

....
Item 1 ....

Item 2

Item 1a ....
....

....

....

.... Item 2b

....




Expected result:




Item 1 ....

....

....




Why this result?



This is because this is the largest string between two strings (prefix and suffix pair) where both prefix and suffix starts at the beginning of the line. Note that there's another pair (Item 1a-Item 2b) but since Item 2b does not comes at the beginning of the line, we cannot consider this longest sequence.



What I have tried with regex:



I have tried with below regex for each prefix-suffix pair in my above list, but this didn't work.



regexs = [r'^' + re.escape(pre) + '(.*?)' + re.escape(suf) for pre in prefixes for suf in suffixes]
for regex in regexs:
re.findall(regex, text, re.MULTLINE)


What I have tried using non-regex (Python string functions):



def extract_longest_match(text, prefixes, suffixes):
longest_match = ''
for line in text.splitlines():
if line.startswith(tuple(prefixes)):
beg_index = text.index(line)
for suf in suffixes:
end_index = text.find(suf, beg_index+len(line))
match = text[beg_index:end_index]
if len(match) > len(longest_match ):
longest_match = match
return longest_match


Am I missing anything?










share|improve this question




















  • 1




    Regex can't match the longest substring. Use the regex or a non-regex solution to find the substrings and find the longest using the common language means.
    – Wiktor Stribiżew
    Nov 19 at 21:06












  • Yes, that is a possibility to check length of string returned with regex. But do you know how to get text between prefix and suffix where both prefix and suffix start at the left side of sentence? Is my regex correct?
    – sgokhales
    Nov 19 at 21:07












  • @WiktorStribiżew Regex can match the longest substring. The problem is that with (.*?) OP is explicitly using a non-greedy regex. Should probably only be (.*) instead of (.*?).
    – quant
    Nov 19 at 21:07










  • @quant That .* won't yield the longest substring, that is just matching from the leftmost occurrence of the leading delimiter till the rightmost occurrence of trailing delimiter. And that is not the same. It is a common confusion of greediness and longest/shortest substring extraction.
    – Wiktor Stribiżew
    Nov 19 at 21:09












  • @WiktorStribiżew What am I missing here then? What's the correct regex to match string between prefix and suffix. I can then later check length and maintain largest string in a variable.
    – sgokhales
    Nov 19 at 21:19













up vote
1
down vote

favorite









up vote
1
down vote

favorite











I've searched Google for my use-case but didn't find anything much useful.



I am not an expert in regular expression so I would appreciate if anybody in the community could help.



Question:



Given a text file, I want to capture the longest string between two substrings (prefix and suffix) using regex. Note that those two substrings will always be at the start of any lines of the text. Please see the below example.



Substrings:




prefixes = ['Item 1', 'Item 1a', 'Item 1b']

suffixes = ['Item 2', 'Item 2a', 'Item 2b']




Example 1:




Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2 ....

Item 1 ....

Item 2

Item 1a ....

....

....

....

....

Item 2b ....




Expected Result:




Item 1a ....

....

....

....

....




Why this result?



Because prefix of Item 1a and suffix of Item 2b matches the longest string in the text between them of all other prefix-suffix pair.



Example 2:




Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2

....
Item 1 ....

Item 2

Item 1a ....
....

....

....

.... Item 2b

....




Expected result:




Item 1 ....

....

....




Why this result?



This is because this is the largest string between two strings (prefix and suffix pair) where both prefix and suffix starts at the beginning of the line. Note that there's another pair (Item 1a-Item 2b) but since Item 2b does not comes at the beginning of the line, we cannot consider this longest sequence.



What I have tried with regex:



I have tried with below regex for each prefix-suffix pair in my above list, but this didn't work.



regexs = [r'^' + re.escape(pre) + '(.*?)' + re.escape(suf) for pre in prefixes for suf in suffixes]
for regex in regexs:
re.findall(regex, text, re.MULTLINE)


What I have tried using non-regex (Python string functions):



def extract_longest_match(text, prefixes, suffixes):
longest_match = ''
for line in text.splitlines():
if line.startswith(tuple(prefixes)):
beg_index = text.index(line)
for suf in suffixes:
end_index = text.find(suf, beg_index+len(line))
match = text[beg_index:end_index]
if len(match) > len(longest_match ):
longest_match = match
return longest_match


Am I missing anything?










share|improve this question















I've searched Google for my use-case but didn't find anything much useful.



I am not an expert in regular expression so I would appreciate if anybody in the community could help.



Question:



Given a text file, I want to capture the longest string between two substrings (prefix and suffix) using regex. Note that those two substrings will always be at the start of any lines of the text. Please see the below example.



Substrings:




prefixes = ['Item 1', 'Item 1a', 'Item 1b']

suffixes = ['Item 2', 'Item 2a', 'Item 2b']




Example 1:




Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2 ....

Item 1 ....

Item 2

Item 1a ....

....

....

....

....

Item 2b ....




Expected Result:




Item 1a ....

....

....

....

....




Why this result?



Because prefix of Item 1a and suffix of Item 2b matches the longest string in the text between them of all other prefix-suffix pair.



Example 2:




Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2

....
Item 1 ....

Item 2

Item 1a ....
....

....

....

.... Item 2b

....




Expected result:




Item 1 ....

....

....




Why this result?



This is because this is the largest string between two strings (prefix and suffix pair) where both prefix and suffix starts at the beginning of the line. Note that there's another pair (Item 1a-Item 2b) but since Item 2b does not comes at the beginning of the line, we cannot consider this longest sequence.



What I have tried with regex:



I have tried with below regex for each prefix-suffix pair in my above list, but this didn't work.



regexs = [r'^' + re.escape(pre) + '(.*?)' + re.escape(suf) for pre in prefixes for suf in suffixes]
for regex in regexs:
re.findall(regex, text, re.MULTLINE)


What I have tried using non-regex (Python string functions):



def extract_longest_match(text, prefixes, suffixes):
longest_match = ''
for line in text.splitlines():
if line.startswith(tuple(prefixes)):
beg_index = text.index(line)
for suf in suffixes:
end_index = text.find(suf, beg_index+len(line))
match = text[beg_index:end_index]
if len(match) > len(longest_match ):
longest_match = match
return longest_match


Am I missing anything?







python regex string pattern-matching string-matching






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 19 at 22:10

























asked Nov 19 at 21:03









sgokhales

35k26105138




35k26105138








  • 1




    Regex can't match the longest substring. Use the regex or a non-regex solution to find the substrings and find the longest using the common language means.
    – Wiktor Stribiżew
    Nov 19 at 21:06












  • Yes, that is a possibility to check length of string returned with regex. But do you know how to get text between prefix and suffix where both prefix and suffix start at the left side of sentence? Is my regex correct?
    – sgokhales
    Nov 19 at 21:07












  • @WiktorStribiżew Regex can match the longest substring. The problem is that with (.*?) OP is explicitly using a non-greedy regex. Should probably only be (.*) instead of (.*?).
    – quant
    Nov 19 at 21:07










  • @quant That .* won't yield the longest substring, that is just matching from the leftmost occurrence of the leading delimiter till the rightmost occurrence of trailing delimiter. And that is not the same. It is a common confusion of greediness and longest/shortest substring extraction.
    – Wiktor Stribiżew
    Nov 19 at 21:09












  • @WiktorStribiżew What am I missing here then? What's the correct regex to match string between prefix and suffix. I can then later check length and maintain largest string in a variable.
    – sgokhales
    Nov 19 at 21:19














  • 1




    Regex can't match the longest substring. Use the regex or a non-regex solution to find the substrings and find the longest using the common language means.
    – Wiktor Stribiżew
    Nov 19 at 21:06












  • Yes, that is a possibility to check length of string returned with regex. But do you know how to get text between prefix and suffix where both prefix and suffix start at the left side of sentence? Is my regex correct?
    – sgokhales
    Nov 19 at 21:07












  • @WiktorStribiżew Regex can match the longest substring. The problem is that with (.*?) OP is explicitly using a non-greedy regex. Should probably only be (.*) instead of (.*?).
    – quant
    Nov 19 at 21:07










  • @quant That .* won't yield the longest substring, that is just matching from the leftmost occurrence of the leading delimiter till the rightmost occurrence of trailing delimiter. And that is not the same. It is a common confusion of greediness and longest/shortest substring extraction.
    – Wiktor Stribiżew
    Nov 19 at 21:09












  • @WiktorStribiżew What am I missing here then? What's the correct regex to match string between prefix and suffix. I can then later check length and maintain largest string in a variable.
    – sgokhales
    Nov 19 at 21:19








1




1




Regex can't match the longest substring. Use the regex or a non-regex solution to find the substrings and find the longest using the common language means.
– Wiktor Stribiżew
Nov 19 at 21:06






Regex can't match the longest substring. Use the regex or a non-regex solution to find the substrings and find the longest using the common language means.
– Wiktor Stribiżew
Nov 19 at 21:06














Yes, that is a possibility to check length of string returned with regex. But do you know how to get text between prefix and suffix where both prefix and suffix start at the left side of sentence? Is my regex correct?
– sgokhales
Nov 19 at 21:07






Yes, that is a possibility to check length of string returned with regex. But do you know how to get text between prefix and suffix where both prefix and suffix start at the left side of sentence? Is my regex correct?
– sgokhales
Nov 19 at 21:07














@WiktorStribiżew Regex can match the longest substring. The problem is that with (.*?) OP is explicitly using a non-greedy regex. Should probably only be (.*) instead of (.*?).
– quant
Nov 19 at 21:07




@WiktorStribiżew Regex can match the longest substring. The problem is that with (.*?) OP is explicitly using a non-greedy regex. Should probably only be (.*) instead of (.*?).
– quant
Nov 19 at 21:07












@quant That .* won't yield the longest substring, that is just matching from the leftmost occurrence of the leading delimiter till the rightmost occurrence of trailing delimiter. And that is not the same. It is a common confusion of greediness and longest/shortest substring extraction.
– Wiktor Stribiżew
Nov 19 at 21:09






@quant That .* won't yield the longest substring, that is just matching from the leftmost occurrence of the leading delimiter till the rightmost occurrence of trailing delimiter. And that is not the same. It is a common confusion of greediness and longest/shortest substring extraction.
– Wiktor Stribiżew
Nov 19 at 21:09














@WiktorStribiżew What am I missing here then? What's the correct regex to match string between prefix and suffix. I can then later check length and maintain largest string in a variable.
– sgokhales
Nov 19 at 21:19




@WiktorStribiżew What am I missing here then? What's the correct regex to match string between prefix and suffix. I can then later check length and maintain largest string in a variable.
– sgokhales
Nov 19 at 21:19












1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










You need to




  • Build a regex that matches strings from the leftmost starting delimiter to the leftmost trailing delimiter (see Match text between two strings with regular expression)

  • Make sure the delimiters are matches at the line start positions only

  • Make sure the . matches the line break chars by using re.DOTALL or equivalent options (see Python regex, matching pattern over multiple lines)

  • Make sure the regex matches overlapping substrings (see Python regex find all overlapping matches)

  • Find all matches in the text (see How can I find all matches to a regular expression in Python?)

  • Get the longest one (see Python's most efficient way to choose longest string in list?).


Python demo:



import re
s="""Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ...."""
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))
# Or, a version with word boundaries:
# rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))
all_matches = re.findall(rx, s, re.S | re.M)
print(max(all_matches, key=len))


Output:



Item 1a ....
....
....
....
....
Item 2


The regex looks like



(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))


With word boundaries



(?sm)(?=^((?:Item 1|Item 1a|Item 1b)b.*?^(?:Item 2|Item 2a|Item 2b)b))


See the regex demo.



Details





  • (?sm) - re.S and re.M flags


  • (?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))) - a positive lookahead that matches at any location that is immediately followed with a sequence of patterns:



    • ^ - start of a line


    • ((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)) - Group 1 (this value is returned with re.findall)


    • (?:Item 1|Item 1a|Item 1b) - any of the items in the alternation (probably, it makes sense to add b word boundary after ) here)


    • .*? - any 0+ chars, as few as possible


    • ^ - start of a line


    • (?:Item 2|Item 2a|Item 2b) - any alternative from the list (probably, it also makes sense to add b word boundary after ) here).








share|improve this answer























  • Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
    – sgokhales
    Nov 19 at 22:05












  • I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
    – sgokhales
    Nov 19 at 22:10










  • Awesome. Thank you very much!
    – sgokhales
    Nov 19 at 22:15










  • @sgokhales Perhaps, you actually want rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes)) to match the delimiters as whole words. See this Python demo.
    – Wiktor Stribiżew
    Nov 19 at 22:15












  • With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
    – sgokhales
    Nov 19 at 22:17











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53382579%2fregex-for-longest-matching-sequence-between-two-strings%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










You need to




  • Build a regex that matches strings from the leftmost starting delimiter to the leftmost trailing delimiter (see Match text between two strings with regular expression)

  • Make sure the delimiters are matches at the line start positions only

  • Make sure the . matches the line break chars by using re.DOTALL or equivalent options (see Python regex, matching pattern over multiple lines)

  • Make sure the regex matches overlapping substrings (see Python regex find all overlapping matches)

  • Find all matches in the text (see How can I find all matches to a regular expression in Python?)

  • Get the longest one (see Python's most efficient way to choose longest string in list?).


Python demo:



import re
s="""Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ...."""
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))
# Or, a version with word boundaries:
# rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))
all_matches = re.findall(rx, s, re.S | re.M)
print(max(all_matches, key=len))


Output:



Item 1a ....
....
....
....
....
Item 2


The regex looks like



(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))


With word boundaries



(?sm)(?=^((?:Item 1|Item 1a|Item 1b)b.*?^(?:Item 2|Item 2a|Item 2b)b))


See the regex demo.



Details





  • (?sm) - re.S and re.M flags


  • (?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))) - a positive lookahead that matches at any location that is immediately followed with a sequence of patterns:



    • ^ - start of a line


    • ((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)) - Group 1 (this value is returned with re.findall)


    • (?:Item 1|Item 1a|Item 1b) - any of the items in the alternation (probably, it makes sense to add b word boundary after ) here)


    • .*? - any 0+ chars, as few as possible


    • ^ - start of a line


    • (?:Item 2|Item 2a|Item 2b) - any alternative from the list (probably, it also makes sense to add b word boundary after ) here).








share|improve this answer























  • Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
    – sgokhales
    Nov 19 at 22:05












  • I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
    – sgokhales
    Nov 19 at 22:10










  • Awesome. Thank you very much!
    – sgokhales
    Nov 19 at 22:15










  • @sgokhales Perhaps, you actually want rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes)) to match the delimiters as whole words. See this Python demo.
    – Wiktor Stribiżew
    Nov 19 at 22:15












  • With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
    – sgokhales
    Nov 19 at 22:17















up vote
1
down vote



accepted










You need to




  • Build a regex that matches strings from the leftmost starting delimiter to the leftmost trailing delimiter (see Match text between two strings with regular expression)

  • Make sure the delimiters are matches at the line start positions only

  • Make sure the . matches the line break chars by using re.DOTALL or equivalent options (see Python regex, matching pattern over multiple lines)

  • Make sure the regex matches overlapping substrings (see Python regex find all overlapping matches)

  • Find all matches in the text (see How can I find all matches to a regular expression in Python?)

  • Get the longest one (see Python's most efficient way to choose longest string in list?).


Python demo:



import re
s="""Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ...."""
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))
# Or, a version with word boundaries:
# rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))
all_matches = re.findall(rx, s, re.S | re.M)
print(max(all_matches, key=len))


Output:



Item 1a ....
....
....
....
....
Item 2


The regex looks like



(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))


With word boundaries



(?sm)(?=^((?:Item 1|Item 1a|Item 1b)b.*?^(?:Item 2|Item 2a|Item 2b)b))


See the regex demo.



Details





  • (?sm) - re.S and re.M flags


  • (?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))) - a positive lookahead that matches at any location that is immediately followed with a sequence of patterns:



    • ^ - start of a line


    • ((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)) - Group 1 (this value is returned with re.findall)


    • (?:Item 1|Item 1a|Item 1b) - any of the items in the alternation (probably, it makes sense to add b word boundary after ) here)


    • .*? - any 0+ chars, as few as possible


    • ^ - start of a line


    • (?:Item 2|Item 2a|Item 2b) - any alternative from the list (probably, it also makes sense to add b word boundary after ) here).








share|improve this answer























  • Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
    – sgokhales
    Nov 19 at 22:05












  • I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
    – sgokhales
    Nov 19 at 22:10










  • Awesome. Thank you very much!
    – sgokhales
    Nov 19 at 22:15










  • @sgokhales Perhaps, you actually want rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes)) to match the delimiters as whole words. See this Python demo.
    – Wiktor Stribiżew
    Nov 19 at 22:15












  • With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
    – sgokhales
    Nov 19 at 22:17













up vote
1
down vote



accepted







up vote
1
down vote



accepted






You need to




  • Build a regex that matches strings from the leftmost starting delimiter to the leftmost trailing delimiter (see Match text between two strings with regular expression)

  • Make sure the delimiters are matches at the line start positions only

  • Make sure the . matches the line break chars by using re.DOTALL or equivalent options (see Python regex, matching pattern over multiple lines)

  • Make sure the regex matches overlapping substrings (see Python regex find all overlapping matches)

  • Find all matches in the text (see How can I find all matches to a regular expression in Python?)

  • Get the longest one (see Python's most efficient way to choose longest string in list?).


Python demo:



import re
s="""Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ...."""
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))
# Or, a version with word boundaries:
# rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))
all_matches = re.findall(rx, s, re.S | re.M)
print(max(all_matches, key=len))


Output:



Item 1a ....
....
....
....
....
Item 2


The regex looks like



(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))


With word boundaries



(?sm)(?=^((?:Item 1|Item 1a|Item 1b)b.*?^(?:Item 2|Item 2a|Item 2b)b))


See the regex demo.



Details





  • (?sm) - re.S and re.M flags


  • (?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))) - a positive lookahead that matches at any location that is immediately followed with a sequence of patterns:



    • ^ - start of a line


    • ((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)) - Group 1 (this value is returned with re.findall)


    • (?:Item 1|Item 1a|Item 1b) - any of the items in the alternation (probably, it makes sense to add b word boundary after ) here)


    • .*? - any 0+ chars, as few as possible


    • ^ - start of a line


    • (?:Item 2|Item 2a|Item 2b) - any alternative from the list (probably, it also makes sense to add b word boundary after ) here).








share|improve this answer














You need to




  • Build a regex that matches strings from the leftmost starting delimiter to the leftmost trailing delimiter (see Match text between two strings with regular expression)

  • Make sure the delimiters are matches at the line start positions only

  • Make sure the . matches the line break chars by using re.DOTALL or equivalent options (see Python regex, matching pattern over multiple lines)

  • Make sure the regex matches overlapping substrings (see Python regex find all overlapping matches)

  • Find all matches in the text (see How can I find all matches to a regular expression in Python?)

  • Get the longest one (see Python's most efficient way to choose longest string in list?).


Python demo:



import re
s="""Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ...."""
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))
# Or, a version with word boundaries:
# rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))
all_matches = re.findall(rx, s, re.S | re.M)
print(max(all_matches, key=len))


Output:



Item 1a ....
....
....
....
....
Item 2


The regex looks like



(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))


With word boundaries



(?sm)(?=^((?:Item 1|Item 1a|Item 1b)b.*?^(?:Item 2|Item 2a|Item 2b)b))


See the regex demo.



Details





  • (?sm) - re.S and re.M flags


  • (?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))) - a positive lookahead that matches at any location that is immediately followed with a sequence of patterns:



    • ^ - start of a line


    • ((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)) - Group 1 (this value is returned with re.findall)


    • (?:Item 1|Item 1a|Item 1b) - any of the items in the alternation (probably, it makes sense to add b word boundary after ) here)


    • .*? - any 0+ chars, as few as possible


    • ^ - start of a line


    • (?:Item 2|Item 2a|Item 2b) - any alternative from the list (probably, it also makes sense to add b word boundary after ) here).









share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 19 at 22:22

























answered Nov 19 at 21:58









Wiktor Stribiżew

303k16123199




303k16123199












  • Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
    – sgokhales
    Nov 19 at 22:05












  • I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
    – sgokhales
    Nov 19 at 22:10










  • Awesome. Thank you very much!
    – sgokhales
    Nov 19 at 22:15










  • @sgokhales Perhaps, you actually want rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes)) to match the delimiters as whole words. See this Python demo.
    – Wiktor Stribiżew
    Nov 19 at 22:15












  • With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
    – sgokhales
    Nov 19 at 22:17


















  • Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
    – sgokhales
    Nov 19 at 22:05












  • I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
    – sgokhales
    Nov 19 at 22:10










  • Awesome. Thank you very much!
    – sgokhales
    Nov 19 at 22:15










  • @sgokhales Perhaps, you actually want rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes)) to match the delimiters as whole words. See this Python demo.
    – Wiktor Stribiżew
    Nov 19 at 22:15












  • With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
    – sgokhales
    Nov 19 at 22:17
















Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
– sgokhales
Nov 19 at 22:05






Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
– sgokhales
Nov 19 at 22:05














I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
– sgokhales
Nov 19 at 22:10




I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
– sgokhales
Nov 19 at 22:10












Awesome. Thank you very much!
– sgokhales
Nov 19 at 22:15




Awesome. Thank you very much!
– sgokhales
Nov 19 at 22:15












@sgokhales Perhaps, you actually want rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes)) to match the delimiters as whole words. See this Python demo.
– Wiktor Stribiżew
Nov 19 at 22:15






@sgokhales Perhaps, you actually want rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes)) to match the delimiters as whole words. See this Python demo.
– Wiktor Stribiżew
Nov 19 at 22:15














With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
– sgokhales
Nov 19 at 22:17




With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
– sgokhales
Nov 19 at 22:17


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53382579%2fregex-for-longest-matching-sequence-between-two-strings%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Wiesbaden

Marschland

Dieringhausen