Regex for longest matching sequence between two strings

up vote
1
down vote

favorite

I've searched Google for my use-case but didn't find anything much useful.

I am not an expert in regular expression so I would appreciate if anybody in the community could help.

Question:

Given a text file, I want to capture the longest string between two substrings (prefix and suffix) using regex. Note that those two substrings will always be at the start of any lines of the text. Please see the below example.

Substrings:

prefixes = ['Item 1', 'Item 1a', 'Item 1b']

suffixes = ['Item 2', 'Item 2a', 'Item 2b']

Example 1:

Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2 ....

Item 1 ....

Item 2

Item 1a ....

....

....

....

....

Item 2b ....

Expected Result:

Item 1a ....

....

....

....

....

Why this result?

Because prefix of Item 1a and suffix of Item 2b matches the longest string in the text between them of all other prefix-suffix pair.

Example 2:

Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2

....
Item 1 ....

Item 2

Item 1a ....
....

....

....

.... Item 2b

....

Expected result:

Item 1 ....

....

....

Why this result?

This is because this is the largest string between two strings (prefix and suffix pair) where both prefix and suffix starts at the beginning of the line. Note that there's another pair (Item 1a-Item 2b) but since Item 2b does not comes at the beginning of the line, we cannot consider this longest sequence.

What I have tried with regex:

I have tried with below regex for each prefix-suffix pair in my above list, but this didn't work.

regexs = [r'^' + re.escape(pre) + '(.*?)' + re.escape(suf) for pre in prefixes for suf in suffixes]

for regex in regexs:

    re.findall(regex, text, re.MULTLINE)

What I have tried using non-regex (Python string functions):

def extract_longest_match(text, prefixes, suffixes):

    longest_match = ''

    for line in text.splitlines():

        if line.startswith(tuple(prefixes)):

            beg_index = text.index(line)

            for suf in suffixes:

                end_index = text.find(suf, beg_index+len(line))

                match = text[beg_index:end_index]

                if len(match) > len(longest_match ):

                    longest_match = match

    return longest_match

Am I missing anything?

edited Nov 19 at 22:10

asked Nov 19 at 21:03

sgokhales

35k26105138

1

Regex can't match the longest substring. Use the regex or a non-regex solution to find the substrings and find the longest using the common language means.
– Wiktor Stribiżew
Nov 19 at 21:06

Yes, that is a possibility to check length of string returned with regex. But do you know how to get text between prefix and suffix where both prefix and suffix start at the left side of sentence? Is my regex correct?
– sgokhales
Nov 19 at 21:07

@WiktorStribiżew Regex can match the longest substring. The problem is that with (.*?) OP is explicitly using a non-greedy regex. Should probably only be (.*) instead of (.*?).
– quant
Nov 19 at 21:07

@quant That .* won't yield the longest substring, that is just matching from the leftmost occurrence of the leading delimiter till the rightmost occurrence of trailing delimiter. And that is not the same. It is a common confusion of greediness and longest/shortest substring extraction.
– Wiktor Stribiżew
Nov 19 at 21:09

@WiktorStribiżew What am I missing here then? What's the correct regex to match string between prefix and suffix. I can then later check length and maintain largest string in a variable.
– sgokhales
Nov 19 at 21:19

|
show 4 more comments

up vote
1
down vote

favorite

I've searched Google for my use-case but didn't find anything much useful.

I am not an expert in regular expression so I would appreciate if anybody in the community could help.

Question:

Substrings:

prefixes = ['Item 1', 'Item 1a', 'Item 1b']

suffixes = ['Item 2', 'Item 2a', 'Item 2b']

Example 1:

Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2 ....

Item 1 ....

Item 2

Item 1a ....

....

....

....

....

Item 2b ....

Expected Result:

Item 1a ....

....

....

....

....

Why this result?

Because prefix of Item 1a and suffix of Item 2b matches the longest string in the text between them of all other prefix-suffix pair.

Example 2:

Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2

....
Item 1 ....

Item 2

Item 1a ....
....

....

....

.... Item 2b

....

Expected result:

Item 1 ....

....

....

Why this result?

What I have tried with regex:

I have tried with below regex for each prefix-suffix pair in my above list, but this didn't work.

regexs = [r'^' + re.escape(pre) + '(.*?)' + re.escape(suf) for pre in prefixes for suf in suffixes]

for regex in regexs:

    re.findall(regex, text, re.MULTLINE)

What I have tried using non-regex (Python string functions):

def extract_longest_match(text, prefixes, suffixes):

    longest_match = ''

    for line in text.splitlines():

        if line.startswith(tuple(prefixes)):

            beg_index = text.index(line)

            for suf in suffixes:

                end_index = text.find(suf, beg_index+len(line))

                match = text[beg_index:end_index]

                if len(match) > len(longest_match ):

                    longest_match = match

    return longest_match

Am I missing anything?

edited Nov 19 at 22:10

asked Nov 19 at 21:03

sgokhales

35k26105138

1

Regex can't match the longest substring. Use the regex or a non-regex solution to find the substrings and find the longest using the common language means.
– Wiktor Stribiżew
Nov 19 at 21:06

Yes, that is a possibility to check length of string returned with regex. But do you know how to get text between prefix and suffix where both prefix and suffix start at the left side of sentence? Is my regex correct?
– sgokhales
Nov 19 at 21:07

@WiktorStribiżew Regex can match the longest substring. The problem is that with (.*?) OP is explicitly using a non-greedy regex. Should probably only be (.*) instead of (.*?).
– quant
Nov 19 at 21:07

@quant That .* won't yield the longest substring, that is just matching from the leftmost occurrence of the leading delimiter till the rightmost occurrence of trailing delimiter. And that is not the same. It is a common confusion of greediness and longest/shortest substring extraction.
– Wiktor Stribiżew
Nov 19 at 21:09

@WiktorStribiżew What am I missing here then? What's the correct regex to match string between prefix and suffix. I can then later check length and maintain largest string in a variable.
– sgokhales
Nov 19 at 21:19

|
show 4 more comments

up vote
1
down vote

favorite

I've searched Google for my use-case but didn't find anything much useful.

I am not an expert in regular expression so I would appreciate if anybody in the community could help.

Question:

Substrings:

prefixes = ['Item 1', 'Item 1a', 'Item 1b']

suffixes = ['Item 2', 'Item 2a', 'Item 2b']

Example 1:

Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2 ....

Item 1 ....

Item 2

Item 1a ....

....

....

....

....

Item 2b ....

Expected Result:

Item 1a ....

....

....

....

....

Why this result?

Because prefix of Item 1a and suffix of Item 2b matches the longest string in the text between them of all other prefix-suffix pair.

Example 2:

Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2

....
Item 1 ....

Item 2

Item 1a ....
....

....

....

.... Item 2b

....

Expected result:

Item 1 ....

....

....

Why this result?

What I have tried with regex:

I have tried with below regex for each prefix-suffix pair in my above list, but this didn't work.

regexs = [r'^' + re.escape(pre) + '(.*?)' + re.escape(suf) for pre in prefixes for suf in suffixes]

for regex in regexs:

    re.findall(regex, text, re.MULTLINE)

What I have tried using non-regex (Python string functions):

def extract_longest_match(text, prefixes, suffixes):

    longest_match = ''

    for line in text.splitlines():

        if line.startswith(tuple(prefixes)):

            beg_index = text.index(line)

            for suf in suffixes:

                end_index = text.find(suf, beg_index+len(line))

                match = text[beg_index:end_index]

                if len(match) > len(longest_match ):

                    longest_match = match

    return longest_match

Am I missing anything?

edited Nov 19 at 22:10

asked Nov 19 at 21:03

sgokhales

35k26105138

I've searched Google for my use-case but didn't find anything much useful.

I am not an expert in regular expression so I would appreciate if anybody in the community could help.

Question:

Substrings:

prefixes = ['Item 1', 'Item 1a', 'Item 1b']

suffixes = ['Item 2', 'Item 2a', 'Item 2b']

Example 1:

Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2 ....

Item 1 ....

Item 2

Item 1a ....

....

....

....

....

Item 2b ....

Expected Result:

Item 1a ....

....

....

....

....

Why this result?

Because prefix of Item 1a and suffix of Item 2b matches the longest string in the text between them of all other prefix-suffix pair.

Example 2:

Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2

....
Item 1 ....

Item 2

Item 1a ....
....

....

....

.... Item 2b

....

Expected result:

Item 1 ....

....

....

Why this result?

What I have tried with regex:

I have tried with below regex for each prefix-suffix pair in my above list, but this didn't work.

regexs = [r'^' + re.escape(pre) + '(.*?)' + re.escape(suf) for pre in prefixes for suf in suffixes]

for regex in regexs:

    re.findall(regex, text, re.MULTLINE)

What I have tried using non-regex (Python string functions):

def extract_longest_match(text, prefixes, suffixes):

    longest_match = ''

    for line in text.splitlines():

        if line.startswith(tuple(prefixes)):

            beg_index = text.index(line)

            for suf in suffixes:

                end_index = text.find(suf, beg_index+len(line))

                match = text[beg_index:end_index]

                if len(match) > len(longest_match ):

                    longest_match = match

    return longest_match

Am I missing anything?

python regex string pattern-matching string-matching

edited Nov 19 at 22:10

asked Nov 19 at 21:03

sgokhales

35k26105138

edited Nov 19 at 22:10

asked Nov 19 at 21:03

sgokhales

35k26105138

edited Nov 19 at 22:10

asked Nov 19 at 21:03

sgokhales

35k26105138

asked Nov 19 at 21:03

sgokhales

35k26105138

asked Nov 19 at 21:03

sgokhales

35k26105138

1

Regex can't match the longest substring. Use the regex or a non-regex solution to find the substrings and find the longest using the common language means.
– Wiktor Stribiżew
Nov 19 at 21:06

Yes, that is a possibility to check length of string returned with regex. But do you know how to get text between prefix and suffix where both prefix and suffix start at the left side of sentence? Is my regex correct?
– sgokhales
Nov 19 at 21:07

@WiktorStribiżew Regex can match the longest substring. The problem is that with (.*?) OP is explicitly using a non-greedy regex. Should probably only be (.*) instead of (.*?).
– quant
Nov 19 at 21:07

@quant That .* won't yield the longest substring, that is just matching from the leftmost occurrence of the leading delimiter till the rightmost occurrence of trailing delimiter. And that is not the same. It is a common confusion of greediness and longest/shortest substring extraction.
– Wiktor Stribiżew
Nov 19 at 21:09

@WiktorStribiżew What am I missing here then? What's the correct regex to match string between prefix and suffix. I can then later check length and maintain largest string in a variable.
– sgokhales
Nov 19 at 21:19

|
show 4 more comments

1

Regex can't match the longest substring. Use the regex or a non-regex solution to find the substrings and find the longest using the common language means.
– Wiktor Stribiżew
Nov 19 at 21:06

Yes, that is a possibility to check length of string returned with regex. But do you know how to get text between prefix and suffix where both prefix and suffix start at the left side of sentence? Is my regex correct?
– sgokhales
Nov 19 at 21:07

@WiktorStribiżew Regex can match the longest substring. The problem is that with (.*?) OP is explicitly using a non-greedy regex. Should probably only be (.*) instead of (.*?).
– quant
Nov 19 at 21:07

@quant That .* won't yield the longest substring, that is just matching from the leftmost occurrence of the leading delimiter till the rightmost occurrence of trailing delimiter. And that is not the same. It is a common confusion of greediness and longest/shortest substring extraction.
– Wiktor Stribiżew
Nov 19 at 21:09

@WiktorStribiżew What am I missing here then? What's the correct regex to match string between prefix and suffix. I can then later check length and maintain largest string in a variable.
– sgokhales
Nov 19 at 21:19

Regex can't match the longest substring. Use the regex or a non-regex solution to find the substrings and find the longest using the common language means.
– Wiktor Stribiżew
Nov 19 at 21:06

Yes, that is a possibility to check length of string returned with regex. But do you know how to get text between prefix and suffix where both prefix and suffix start at the left side of sentence? Is my regex correct?
– sgokhales
Nov 19 at 21:07

@WiktorStribiżew Regex can match the longest substring. The problem is that with (.*?) OP is explicitly using a non-greedy regex. Should probably only be (.*) instead of (.*?).
– quant
Nov 19 at 21:07

@quant That .* won't yield the longest substring, that is just matching from the leftmost occurrence of the leading delimiter till the rightmost occurrence of trailing delimiter. And that is not the same. It is a common confusion of greediness and longest/shortest substring extraction.
– Wiktor Stribiżew
Nov 19 at 21:09

@WiktorStribiżew What am I missing here then? What's the correct regex to match string between prefix and suffix. I can then later check length and maintain largest string in a variable.
– sgokhales
Nov 19 at 21:19

|
show 4 more comments

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

You need to

Build a regex that matches strings from the leftmost starting delimiter to the leftmost trailing delimiter (see Match text between two strings with regular expression)

Make sure the delimiters are matches at the line start positions only

Make sure the . matches the line break chars by using re.DOTALL or equivalent options (see Python regex, matching pattern over multiple lines)

Make sure the regex matches overlapping substrings (see Python regex find all overlapping matches)

Find all matches in the text (see How can I find all matches to a regular expression in Python?)

Get the longest one (see Python's most efficient way to choose longest string in list?).

Python demo:

import re

s="""Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2 ....

Item 1 ....

Item 2

Item 1a ....

....

....

....

....

Item 2b ...."""

prefixes = ['Item 1', 'Item 1a', 'Item 1b']

suffixes = ['Item 2', 'Item 2a', 'Item 2b']

rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))

# Or, a version with word boundaries:

# rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))

all_matches = re.findall(rx, s, re.S | re.M)

print(max(all_matches, key=len))

Output:

Item 1a ....

....

....

....

....

Item 2

The regex looks like

(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))

With word boundaries

(?sm)(?=^((?:Item 1|Item 1a|Item 1b)b.*?^(?:Item 2|Item 2a|Item 2b)b))

See the regex demo.

Details

(?sm) - re.S and re.M flags

(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))) - a positive lookahead that matches at any location that is immediately followed with a sequence of patterns:
- ^ - start of a line
- ((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)) - Group 1 (this value is returned with re.findall)
- (?:Item 1|Item 1a|Item 1b) - any of the items in the alternation (probably, it makes sense to add b word boundary after ) here)
- .*? - any 0+ chars, as few as possible
- ^ - start of a line
- (?:Item 2|Item 2a|Item 2b) - any alternative from the list (probably, it also makes sense to add b word boundary after ) here).

edited Nov 19 at 22:22

answered Nov 19 at 21:58

Wiktor Stribiżew

303k16123199

Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
– sgokhales
Nov 19 at 22:05

I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
– sgokhales
Nov 19 at 22:10

Awesome. Thank you very much!
– sgokhales
Nov 19 at 22:15

@sgokhales Perhaps, you actually want rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes)) to match the delimiters as whole words. See this Python demo.
– Wiktor Stribiżew
Nov 19 at 22:15

With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
– sgokhales
Nov 19 at 22:17

|
show 3 more comments

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53382579%2fregex-for-longest-matching-sequence-between-two-strings%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

You need to

Build a regex that matches strings from the leftmost starting delimiter to the leftmost trailing delimiter (see Match text between two strings with regular expression)

Make sure the delimiters are matches at the line start positions only

Make sure the . matches the line break chars by using re.DOTALL or equivalent options (see Python regex, matching pattern over multiple lines)

Make sure the regex matches overlapping substrings (see Python regex find all overlapping matches)

Find all matches in the text (see How can I find all matches to a regular expression in Python?)

Get the longest one (see Python's most efficient way to choose longest string in list?).

Python demo:

import re

s="""Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2 ....

Item 1 ....

Item 2

Item 1a ....

....

....

....

....

Item 2b ...."""

prefixes = ['Item 1', 'Item 1a', 'Item 1b']

suffixes = ['Item 2', 'Item 2a', 'Item 2b']

rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))

# Or, a version with word boundaries:

# rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))

all_matches = re.findall(rx, s, re.S | re.M)

print(max(all_matches, key=len))

Output:

Item 1a ....

....

....

....

....

Item 2

The regex looks like

(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))

With word boundaries

(?sm)(?=^((?:Item 1|Item 1a|Item 1b)b.*?^(?:Item 2|Item 2a|Item 2b)b))

See the regex demo.

Details

(?sm) - re.S and re.M flags

(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))) - a positive lookahead that matches at any location that is immediately followed with a sequence of patterns:
- ^ - start of a line
- ((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)) - Group 1 (this value is returned with re.findall)
- (?:Item 1|Item 1a|Item 1b) - any of the items in the alternation (probably, it makes sense to add b word boundary after ) here)
- .*? - any 0+ chars, as few as possible
- ^ - start of a line
- (?:Item 2|Item 2a|Item 2b) - any alternative from the list (probably, it also makes sense to add b word boundary after ) here).

edited Nov 19 at 22:22

answered Nov 19 at 21:58

Wiktor Stribiżew

303k16123199

Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
– sgokhales
Nov 19 at 22:05

I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
– sgokhales
Nov 19 at 22:10

Awesome. Thank you very much!
– sgokhales
Nov 19 at 22:15

@sgokhales Perhaps, you actually want rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes)) to match the delimiters as whole words. See this Python demo.
– Wiktor Stribiżew
Nov 19 at 22:15

With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
– sgokhales
Nov 19 at 22:17

|
show 3 more comments

up vote
1
down vote

accepted

You need to

Build a regex that matches strings from the leftmost starting delimiter to the leftmost trailing delimiter (see Match text between two strings with regular expression)

Make sure the delimiters are matches at the line start positions only

Make sure the . matches the line break chars by using re.DOTALL or equivalent options (see Python regex, matching pattern over multiple lines)

Make sure the regex matches overlapping substrings (see Python regex find all overlapping matches)

Find all matches in the text (see How can I find all matches to a regular expression in Python?)

Get the longest one (see Python's most efficient way to choose longest string in list?).

Python demo:

import re

s="""Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2 ....

Item 1 ....

Item 2

Item 1a ....

....

....

....

....

Item 2b ...."""

prefixes = ['Item 1', 'Item 1a', 'Item 1b']

suffixes = ['Item 2', 'Item 2a', 'Item 2b']

rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))

# Or, a version with word boundaries:

# rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))

all_matches = re.findall(rx, s, re.S | re.M)

print(max(all_matches, key=len))

Output:

Item 1a ....

....

....

....

....

Item 2

The regex looks like

(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))

With word boundaries

(?sm)(?=^((?:Item 1|Item 1a|Item 1b)b.*?^(?:Item 2|Item 2a|Item 2b)b))

See the regex demo.

Details

(?sm) - re.S and re.M flags

(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))) - a positive lookahead that matches at any location that is immediately followed with a sequence of patterns:
- ^ - start of a line
- ((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)) - Group 1 (this value is returned with re.findall)
- (?:Item 1|Item 1a|Item 1b) - any of the items in the alternation (probably, it makes sense to add b word boundary after ) here)
- .*? - any 0+ chars, as few as possible
- ^ - start of a line
- (?:Item 2|Item 2a|Item 2b) - any alternative from the list (probably, it also makes sense to add b word boundary after ) here).

edited Nov 19 at 22:22

answered Nov 19 at 21:58

Wiktor Stribiżew

303k16123199

Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
– sgokhales
Nov 19 at 22:05

I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
– sgokhales
Nov 19 at 22:10

Awesome. Thank you very much!
– sgokhales
Nov 19 at 22:15

@sgokhales Perhaps, you actually want rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes)) to match the delimiters as whole words. See this Python demo.
– Wiktor Stribiżew
Nov 19 at 22:15

With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
– sgokhales
Nov 19 at 22:17

|
show 3 more comments

up vote
1
down vote

accepted

You need to

Build a regex that matches strings from the leftmost starting delimiter to the leftmost trailing delimiter (see Match text between two strings with regular expression)

Make sure the delimiters are matches at the line start positions only

Make sure the . matches the line break chars by using re.DOTALL or equivalent options (see Python regex, matching pattern over multiple lines)

Make sure the regex matches overlapping substrings (see Python regex find all overlapping matches)

Find all matches in the text (see How can I find all matches to a regular expression in Python?)

Get the longest one (see Python's most efficient way to choose longest string in list?).

Python demo:

import re

s="""Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2 ....

Item 1 ....

Item 2

Item 1a ....

....

....

....

....

Item 2b ...."""

prefixes = ['Item 1', 'Item 1a', 'Item 1b']

suffixes = ['Item 2', 'Item 2a', 'Item 2b']

rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))

# Or, a version with word boundaries:

# rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))

all_matches = re.findall(rx, s, re.S | re.M)

print(max(all_matches, key=len))

Output:

Item 1a ....

....

....

....

....

Item 2

The regex looks like

(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))

With word boundaries

(?sm)(?=^((?:Item 1|Item 1a|Item 1b)b.*?^(?:Item 2|Item 2a|Item 2b)b))

See the regex demo.

Details

(?sm) - re.S and re.M flags

(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))) - a positive lookahead that matches at any location that is immediately followed with a sequence of patterns:
- ^ - start of a line
- ((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)) - Group 1 (this value is returned with re.findall)
- (?:Item 1|Item 1a|Item 1b) - any of the items in the alternation (probably, it makes sense to add b word boundary after ) here)
- .*? - any 0+ chars, as few as possible
- ^ - start of a line
- (?:Item 2|Item 2a|Item 2b) - any alternative from the list (probably, it also makes sense to add b word boundary after ) here).

edited Nov 19 at 22:22

answered Nov 19 at 21:58

Wiktor Stribiżew

303k16123199

You need to

Build a regex that matches strings from the leftmost starting delimiter to the leftmost trailing delimiter (see Match text between two strings with regular expression)

Make sure the delimiters are matches at the line start positions only

Make sure the . matches the line break chars by using re.DOTALL or equivalent options (see Python regex, matching pattern over multiple lines)

Make sure the regex matches overlapping substrings (see Python regex find all overlapping matches)

Find all matches in the text (see How can I find all matches to a regular expression in Python?)

Get the longest one (see Python's most efficient way to choose longest string in list?).

Python demo:

import re

s="""Item 1 ....

Item 2 ....

Item 1 ....

....

....

Item 2 ....

Item 1 ....

Item 2

Item 1a ....

....

....

....

....

Item 2b ...."""

prefixes = ['Item 1', 'Item 1a', 'Item 1b']

suffixes = ['Item 2', 'Item 2a', 'Item 2b']

rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))

# Or, a version with word boundaries:

# rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes))

all_matches = re.findall(rx, s, re.S | re.M)

print(max(all_matches, key=len))

Output:

Item 1a ....

....

....

....

....

Item 2

The regex looks like

(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))

With word boundaries

(?sm)(?=^((?:Item 1|Item 1a|Item 1b)b.*?^(?:Item 2|Item 2a|Item 2b)b))

See the regex demo.

Details

(?sm) - re.S and re.M flags

(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))) - a positive lookahead that matches at any location that is immediately followed with a sequence of patterns:
- ^ - start of a line
- ((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)) - Group 1 (this value is returned with re.findall)
- (?:Item 1|Item 1a|Item 1b) - any of the items in the alternation (probably, it makes sense to add b word boundary after ) here)
- .*? - any 0+ chars, as few as possible
- ^ - start of a line
- (?:Item 2|Item 2a|Item 2b) - any alternative from the list (probably, it also makes sense to add b word boundary after ) here).

edited Nov 19 at 22:22

answered Nov 19 at 21:58

Wiktor Stribiżew

303k16123199

edited Nov 19 at 22:22

answered Nov 19 at 21:58

Wiktor Stribiżew

303k16123199

answered Nov 19 at 21:58

Wiktor Stribiżew

303k16123199

answered Nov 19 at 21:58

Wiktor Stribiżew

303k16123199

Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
– sgokhales
Nov 19 at 22:05

I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
– sgokhales
Nov 19 at 22:10

Awesome. Thank you very much!
– sgokhales
Nov 19 at 22:15

@sgokhales Perhaps, you actually want rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes)) to match the delimiters as whole words. See this Python demo.
– Wiktor Stribiżew
Nov 19 at 22:15

With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
– sgokhales
Nov 19 at 22:17

|
show 3 more comments

Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
– sgokhales
Nov 19 at 22:05

I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
– sgokhales
Nov 19 at 22:10

Awesome. Thank you very much!
– sgokhales
Nov 19 at 22:15

@sgokhales Perhaps, you actually want rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes)) to match the delimiters as whole words. See this Python demo.
– Wiktor Stribiżew
Nov 19 at 22:15

With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
– sgokhales
Nov 19 at 22:17

Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-)
– sgokhales
Nov 19 at 22:05

I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-)
– sgokhales
Nov 19 at 22:10

Awesome. Thank you very much!
– sgokhales
Nov 19 at 22:15

@sgokhales Perhaps, you actually want rx = r"(?=^((?:{})b.*?^(?:{})b))".format("|".join(prefixes), "|".join(suffixes)) to match the delimiters as whole words. See this Python demo.
– Wiktor Stribiżew
Nov 19 at 22:15

With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match.
– sgokhales
Nov 19 at 22:17

|
show 3 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytukyg