Python text extraction from letter

Python text extraction from letter - index

up vote
1
down vote

favorite

I want to extract a certain part of a letter from a txt file with Python. The beginning and the ending is marked by clear beginning / ending expressions (letter_begin / letter_end). My problem is that the "recording" of the text needs to start at the very first occurence of any item in the letter_begin list and end at the very last item in the letter_end list (+3 lines buffer). I want to write the output text to file. Here is my sample text and my code so far:

sample_text = """Some random text right here 

.........

Dear Shareholders: We are pleased to provide this report to our shareholders and fellow shareholders. we thank you for your continued support.

Best regards, 

Douglas - Director





Other random text in this lines """



letter_begin = ["dear", "to our shareholders", "fellow shareholders"]

letter_end = ["best regards", "respectfully submitted", "thank you for your continued support"]



with open(filename, 'r', encoding="utf-8") as infile, open(xyz.txt, mode = 'w', encoding="utf-8") as f: 

        text = infile.read()

        lines = text.strip().split("n")

        target_start_idx = None

        target_end_idx = None

        for index, line in enumerate(lines):

            line = line.lower()

            if any(beg in line for beg in letter_begin):

                target_start_idx = index

                continue

            if any(end in line for end in letter_end):

                target_end_idx = index + 3

                break





        if target_start_idx is not None:

            target = "n".join(lines[target_start_idx : target_end_idx])

            f.write(str(target))

my desired output should be:

output = "Dear Shareholders: We are pleased to provide this report to our shareholders and fellow shareholders. we thank you for your continued support.

    Best regards, 

    Douglas - Director



    "

asked Nov 20 at 13:47

Dominik Scheld

498

add a comment |

up vote
1
down vote

favorite

sample_text = """Some random text right here 

.........

Dear Shareholders: We are pleased to provide this report to our shareholders and fellow shareholders. we thank you for your continued support.

Best regards, 

Douglas - Director





Other random text in this lines """



letter_begin = ["dear", "to our shareholders", "fellow shareholders"]

letter_end = ["best regards", "respectfully submitted", "thank you for your continued support"]



with open(filename, 'r', encoding="utf-8") as infile, open(xyz.txt, mode = 'w', encoding="utf-8") as f: 

        text = infile.read()

        lines = text.strip().split("n")

        target_start_idx = None

        target_end_idx = None

        for index, line in enumerate(lines):

            line = line.lower()

            if any(beg in line for beg in letter_begin):

                target_start_idx = index

                continue

            if any(end in line for end in letter_end):

                target_end_idx = index + 3

                break





        if target_start_idx is not None:

            target = "n".join(lines[target_start_idx : target_end_idx])

            f.write(str(target))

my desired output should be:

output = "Dear Shareholders: We are pleased to provide this report to our shareholders and fellow shareholders. we thank you for your continued support.

    Best regards, 

    Douglas - Director



    "

asked Nov 20 at 13:47

Dominik Scheld

498

add a comment |

up vote
1
down vote

favorite

sample_text = """Some random text right here 

.........

Dear Shareholders: We are pleased to provide this report to our shareholders and fellow shareholders. we thank you for your continued support.

Best regards, 

Douglas - Director





Other random text in this lines """



letter_begin = ["dear", "to our shareholders", "fellow shareholders"]

letter_end = ["best regards", "respectfully submitted", "thank you for your continued support"]



with open(filename, 'r', encoding="utf-8") as infile, open(xyz.txt, mode = 'w', encoding="utf-8") as f: 

        text = infile.read()

        lines = text.strip().split("n")

        target_start_idx = None

        target_end_idx = None

        for index, line in enumerate(lines):

            line = line.lower()

            if any(beg in line for beg in letter_begin):

                target_start_idx = index

                continue

            if any(end in line for end in letter_end):

                target_end_idx = index + 3

                break





        if target_start_idx is not None:

            target = "n".join(lines[target_start_idx : target_end_idx])

            f.write(str(target))

my desired output should be:

output = "Dear Shareholders: We are pleased to provide this report to our shareholders and fellow shareholders. we thank you for your continued support.

    Best regards, 

    Douglas - Director



    "

asked Nov 20 at 13:47

Dominik Scheld

498

sample_text = """Some random text right here 

.........

Dear Shareholders: We are pleased to provide this report to our shareholders and fellow shareholders. we thank you for your continued support.

Best regards, 

Douglas - Director





Other random text in this lines """



letter_begin = ["dear", "to our shareholders", "fellow shareholders"]

letter_end = ["best regards", "respectfully submitted", "thank you for your continued support"]



with open(filename, 'r', encoding="utf-8") as infile, open(xyz.txt, mode = 'w', encoding="utf-8") as f: 

        text = infile.read()

        lines = text.strip().split("n")

        target_start_idx = None

        target_end_idx = None

        for index, line in enumerate(lines):

            line = line.lower()

            if any(beg in line for beg in letter_begin):

                target_start_idx = index

                continue

            if any(end in line for end in letter_end):

                target_end_idx = index + 3

                break





        if target_start_idx is not None:

            target = "n".join(lines[target_start_idx : target_end_idx])

            f.write(str(target))

my desired output should be:

output = "Dear Shareholders: We are pleased to provide this report to our shareholders and fellow shareholders. we thank you for your continued support.

    Best regards, 

    Douglas - Director



    "

python nlp text-extraction

asked Nov 20 at 13:47

Dominik Scheld

498

asked Nov 20 at 13:47

Dominik Scheld

498

asked Nov 20 at 13:47

Dominik Scheld

498

asked Nov 20 at 13:47

Dominik Scheld

498

asked Nov 20 at 13:47

Dominik Scheld

498

add a comment |

2 Answers
2

active

oldest

votes

up vote
0
down vote

Your loop gives you the last occurrence of an opening sequence.

You should separate the read part into two loops, like this:

with open(filename, 'r', encoding="utf-8") as infile:



    text = infile.read()

    lines = text.strip().split("n")

    target_start_idx = None

    target_end_idx = None

    for index, line in enumerate(lines):

        line = line.lower()

        if any(beg in line for beg in letter_begin):

            target_start_idx = index

            break

    for index, line in enumerate(lines):

        if any(end in line for end in letter_end):

            target_end_idx = index + 3

            continue

In this way, you exit the loop when the first occurrence of an opening sequence appears.

edited Nov 20 at 14:12

answered Nov 20 at 14:08

Yakov Dan

1,039514

thanks a lot! your solution captures the first occurence in both cases, correct? But I want to have the last occurence in the second case (letter_end) - any idea how to do this?
– Dominik Scheld
Nov 20 at 14:11

Yes, the idea is that the loop should continue running until all lines are compared against letter_end. I've edited the code to reflect this.
– Yakov Dan
Nov 20 at 14:13

add a comment |

up vote
0
down vote

What about regular expressions?

import re

sample_text = """Some random text right here 

.........

Dear Shareholders: We are pleased to provide this report to our shareholders 

and fellow shareholders. we thank you for your continued support.

Best regards, 

Douglas - Director





Other random text in this lines """







letter_begin = ["dear", "to our shareholders", "fellow shareholders"]

letter_end = ["best regards", "respectfully submitted", "thank you for your continued support"]



"""

# read from a file

with open(input_file, 'r') as fo:

    input_text = fo.read()

"""



input_text = sample_text



begin_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_begin), flags=re.IGNORECASE)

end_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_end), flags=re.IGNORECASE)



match_begin = re.findall(begin_pattern, input_text)

idx_begin = min(re.search(x, input_text).start() for x in match_begin)



match_end = re.findall(end_pattern, input_text)

idx_end = max(re.search(x, input_text).end() for x in match_end)



# add 3 lines, assuming that there are always enough extra lines

extra_lines = 'n'.join(input_text[idx_end:].split('n')[:4])

output_text = input_text[idx_begin:idx_end] + extra_lines



print(output_text)

# Dear Shareholders: We are pleased to provide this report to our shareholders 

# and fellow shareholders. we thank you for your continued support.

# Best regards, 

# Douglas - Director

#

#

"""

# write to a new file

with open(output_file, 'w') as fo:

    fo.write(output_text)

"""

answered Nov 20 at 15:15

Vlad

190111

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394453%2fpython-text-extraction-from-letter-index%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
0
down vote

Your loop gives you the last occurrence of an opening sequence.

You should separate the read part into two loops, like this:

with open(filename, 'r', encoding="utf-8") as infile:



    text = infile.read()

    lines = text.strip().split("n")

    target_start_idx = None

    target_end_idx = None

    for index, line in enumerate(lines):

        line = line.lower()

        if any(beg in line for beg in letter_begin):

            target_start_idx = index

            break

    for index, line in enumerate(lines):

        if any(end in line for end in letter_end):

            target_end_idx = index + 3

            continue

In this way, you exit the loop when the first occurrence of an opening sequence appears.

edited Nov 20 at 14:12

answered Nov 20 at 14:08

Yakov Dan

1,039514

thanks a lot! your solution captures the first occurence in both cases, correct? But I want to have the last occurence in the second case (letter_end) - any idea how to do this?
– Dominik Scheld
Nov 20 at 14:11

Yes, the idea is that the loop should continue running until all lines are compared against letter_end. I've edited the code to reflect this.
– Yakov Dan
Nov 20 at 14:13

add a comment |

up vote
0
down vote

Your loop gives you the last occurrence of an opening sequence.

You should separate the read part into two loops, like this:

with open(filename, 'r', encoding="utf-8") as infile:



    text = infile.read()

    lines = text.strip().split("n")

    target_start_idx = None

    target_end_idx = None

    for index, line in enumerate(lines):

        line = line.lower()

        if any(beg in line for beg in letter_begin):

            target_start_idx = index

            break

    for index, line in enumerate(lines):

        if any(end in line for end in letter_end):

            target_end_idx = index + 3

            continue

In this way, you exit the loop when the first occurrence of an opening sequence appears.

edited Nov 20 at 14:12

answered Nov 20 at 14:08

Yakov Dan

1,039514

thanks a lot! your solution captures the first occurence in both cases, correct? But I want to have the last occurence in the second case (letter_end) - any idea how to do this?
– Dominik Scheld
Nov 20 at 14:11

Yes, the idea is that the loop should continue running until all lines are compared against letter_end. I've edited the code to reflect this.
– Yakov Dan
Nov 20 at 14:13

add a comment |

up vote
0
down vote

Your loop gives you the last occurrence of an opening sequence.

You should separate the read part into two loops, like this:

with open(filename, 'r', encoding="utf-8") as infile:



    text = infile.read()

    lines = text.strip().split("n")

    target_start_idx = None

    target_end_idx = None

    for index, line in enumerate(lines):

        line = line.lower()

        if any(beg in line for beg in letter_begin):

            target_start_idx = index

            break

    for index, line in enumerate(lines):

        if any(end in line for end in letter_end):

            target_end_idx = index + 3

            continue

In this way, you exit the loop when the first occurrence of an opening sequence appears.

edited Nov 20 at 14:12

answered Nov 20 at 14:08

Yakov Dan

1,039514

Your loop gives you the last occurrence of an opening sequence.

You should separate the read part into two loops, like this:

with open(filename, 'r', encoding="utf-8") as infile:



    text = infile.read()

    lines = text.strip().split("n")

    target_start_idx = None

    target_end_idx = None

    for index, line in enumerate(lines):

        line = line.lower()

        if any(beg in line for beg in letter_begin):

            target_start_idx = index

            break

    for index, line in enumerate(lines):

        if any(end in line for end in letter_end):

            target_end_idx = index + 3

            continue

In this way, you exit the loop when the first occurrence of an opening sequence appears.

edited Nov 20 at 14:12

answered Nov 20 at 14:08

Yakov Dan

1,039514

edited Nov 20 at 14:12

answered Nov 20 at 14:08

Yakov Dan

1,039514

answered Nov 20 at 14:08

Yakov Dan

1,039514

answered Nov 20 at 14:08

Yakov Dan

1,039514

thanks a lot! your solution captures the first occurence in both cases, correct? But I want to have the last occurence in the second case (letter_end) - any idea how to do this?
– Dominik Scheld
Nov 20 at 14:11

Yes, the idea is that the loop should continue running until all lines are compared against letter_end. I've edited the code to reflect this.
– Yakov Dan
Nov 20 at 14:13

add a comment |

thanks a lot! your solution captures the first occurence in both cases, correct? But I want to have the last occurence in the second case (letter_end) - any idea how to do this?
– Dominik Scheld
Nov 20 at 14:11

Yes, the idea is that the loop should continue running until all lines are compared against letter_end. I've edited the code to reflect this.
– Yakov Dan
Nov 20 at 14:13

thanks a lot! your solution captures the first occurence in both cases, correct? But I want to have the last occurence in the second case (letter_end) - any idea how to do this?
– Dominik Scheld
Nov 20 at 14:11

Yes, the idea is that the loop should continue running until all lines are compared against letter_end. I've edited the code to reflect this.
– Yakov Dan
Nov 20 at 14:13

add a comment |

up vote
0
down vote

What about regular expressions?

import re

sample_text = """Some random text right here 

.........

Dear Shareholders: We are pleased to provide this report to our shareholders 

and fellow shareholders. we thank you for your continued support.

Best regards, 

Douglas - Director





Other random text in this lines """







letter_begin = ["dear", "to our shareholders", "fellow shareholders"]

letter_end = ["best regards", "respectfully submitted", "thank you for your continued support"]



"""

# read from a file

with open(input_file, 'r') as fo:

    input_text = fo.read()

"""



input_text = sample_text



begin_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_begin), flags=re.IGNORECASE)

end_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_end), flags=re.IGNORECASE)



match_begin = re.findall(begin_pattern, input_text)

idx_begin = min(re.search(x, input_text).start() for x in match_begin)



match_end = re.findall(end_pattern, input_text)

idx_end = max(re.search(x, input_text).end() for x in match_end)



# add 3 lines, assuming that there are always enough extra lines

extra_lines = 'n'.join(input_text[idx_end:].split('n')[:4])

output_text = input_text[idx_begin:idx_end] + extra_lines



print(output_text)

# Dear Shareholders: We are pleased to provide this report to our shareholders 

# and fellow shareholders. we thank you for your continued support.

# Best regards, 

# Douglas - Director

#

#

"""

# write to a new file

with open(output_file, 'w') as fo:

    fo.write(output_text)

"""

answered Nov 20 at 15:15

Vlad

190111

add a comment |

up vote
0
down vote

What about regular expressions?

import re

sample_text = """Some random text right here 

.........

Dear Shareholders: We are pleased to provide this report to our shareholders 

and fellow shareholders. we thank you for your continued support.

Best regards, 

Douglas - Director





Other random text in this lines """







letter_begin = ["dear", "to our shareholders", "fellow shareholders"]

letter_end = ["best regards", "respectfully submitted", "thank you for your continued support"]



"""

# read from a file

with open(input_file, 'r') as fo:

    input_text = fo.read()

"""



input_text = sample_text



begin_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_begin), flags=re.IGNORECASE)

end_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_end), flags=re.IGNORECASE)



match_begin = re.findall(begin_pattern, input_text)

idx_begin = min(re.search(x, input_text).start() for x in match_begin)



match_end = re.findall(end_pattern, input_text)

idx_end = max(re.search(x, input_text).end() for x in match_end)



# add 3 lines, assuming that there are always enough extra lines

extra_lines = 'n'.join(input_text[idx_end:].split('n')[:4])

output_text = input_text[idx_begin:idx_end] + extra_lines



print(output_text)

# Dear Shareholders: We are pleased to provide this report to our shareholders 

# and fellow shareholders. we thank you for your continued support.

# Best regards, 

# Douglas - Director

#

#

"""

# write to a new file

with open(output_file, 'w') as fo:

    fo.write(output_text)

"""

answered Nov 20 at 15:15

Vlad

190111

add a comment |

up vote
0
down vote

What about regular expressions?

import re

sample_text = """Some random text right here 

.........

Dear Shareholders: We are pleased to provide this report to our shareholders 

and fellow shareholders. we thank you for your continued support.

Best regards, 

Douglas - Director





Other random text in this lines """







letter_begin = ["dear", "to our shareholders", "fellow shareholders"]

letter_end = ["best regards", "respectfully submitted", "thank you for your continued support"]



"""

# read from a file

with open(input_file, 'r') as fo:

    input_text = fo.read()

"""



input_text = sample_text



begin_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_begin), flags=re.IGNORECASE)

end_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_end), flags=re.IGNORECASE)



match_begin = re.findall(begin_pattern, input_text)

idx_begin = min(re.search(x, input_text).start() for x in match_begin)



match_end = re.findall(end_pattern, input_text)

idx_end = max(re.search(x, input_text).end() for x in match_end)



# add 3 lines, assuming that there are always enough extra lines

extra_lines = 'n'.join(input_text[idx_end:].split('n')[:4])

output_text = input_text[idx_begin:idx_end] + extra_lines



print(output_text)

# Dear Shareholders: We are pleased to provide this report to our shareholders 

# and fellow shareholders. we thank you for your continued support.

# Best regards, 

# Douglas - Director

#

#

"""

# write to a new file

with open(output_file, 'w') as fo:

    fo.write(output_text)

"""

answered Nov 20 at 15:15

Vlad

190111

What about regular expressions?

import re

sample_text = """Some random text right here 

.........

Dear Shareholders: We are pleased to provide this report to our shareholders 

and fellow shareholders. we thank you for your continued support.

Best regards, 

Douglas - Director





Other random text in this lines """







letter_begin = ["dear", "to our shareholders", "fellow shareholders"]

letter_end = ["best regards", "respectfully submitted", "thank you for your continued support"]



"""

# read from a file

with open(input_file, 'r') as fo:

    input_text = fo.read()

"""



input_text = sample_text



begin_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_begin), flags=re.IGNORECASE)

end_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_end), flags=re.IGNORECASE)



match_begin = re.findall(begin_pattern, input_text)

idx_begin = min(re.search(x, input_text).start() for x in match_begin)



match_end = re.findall(end_pattern, input_text)

idx_end = max(re.search(x, input_text).end() for x in match_end)



# add 3 lines, assuming that there are always enough extra lines

extra_lines = 'n'.join(input_text[idx_end:].split('n')[:4])

output_text = input_text[idx_begin:idx_end] + extra_lines



print(output_text)

# Dear Shareholders: We are pleased to provide this report to our shareholders 

# and fellow shareholders. we thank you for your continued support.

# Best regards, 

# Douglas - Director

#

#

"""

# write to a new file

with open(output_file, 'w') as fo:

    fo.write(output_text)

"""

answered Nov 20 at 15:15

Vlad

190111

answered Nov 20 at 15:15

Vlad

190111

answered Nov 20 at 15:15

Vlad

190111

answered Nov 20 at 15:15

Vlad

190111

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

xXL0jv87dEOyydUKsrUwJd7a,X lMP,3nk1,poxLvm lkmhm,QuHQxBvf2c0GQZ,7W,CcLGMGpg

搜尋此網誌

Ytukyg