Python text extraction from letter - index











up vote
1
down vote

favorite
1












I want to extract a certain part of a letter from a txt file with Python. The beginning and the ending is marked by clear beginning / ending expressions (letter_begin / letter_end). My problem is that the "recording" of the text needs to start at the very first occurence of any item in the letter_begin list and end at the very last item in the letter_end list (+3 lines buffer). I want to write the output text to file. Here is my sample text and my code so far:



sample_text = """Some random text right here 
.........
Dear Shareholders: We are pleased to provide this report to our shareholders and fellow shareholders. we thank you for your continued support.
Best regards,
Douglas - Director


Other random text in this lines """

letter_begin = ["dear", "to our shareholders", "fellow shareholders"]
letter_end = ["best regards", "respectfully submitted", "thank you for your continued support"]

with open(filename, 'r', encoding="utf-8") as infile, open(xyz.txt, mode = 'w', encoding="utf-8") as f:
text = infile.read()
lines = text.strip().split("n")
target_start_idx = None
target_end_idx = None
for index, line in enumerate(lines):
line = line.lower()
if any(beg in line for beg in letter_begin):
target_start_idx = index
continue
if any(end in line for end in letter_end):
target_end_idx = index + 3
break


if target_start_idx is not None:
target = "n".join(lines[target_start_idx : target_end_idx])
f.write(str(target))


my desired output should be:



output = "Dear Shareholders: We are pleased to provide this report to our shareholders and fellow shareholders. we thank you for your continued support.
Best regards,
Douglas - Director

"









share|improve this question


























    up vote
    1
    down vote

    favorite
    1












    I want to extract a certain part of a letter from a txt file with Python. The beginning and the ending is marked by clear beginning / ending expressions (letter_begin / letter_end). My problem is that the "recording" of the text needs to start at the very first occurence of any item in the letter_begin list and end at the very last item in the letter_end list (+3 lines buffer). I want to write the output text to file. Here is my sample text and my code so far:



    sample_text = """Some random text right here 
    .........
    Dear Shareholders: We are pleased to provide this report to our shareholders and fellow shareholders. we thank you for your continued support.
    Best regards,
    Douglas - Director


    Other random text in this lines """

    letter_begin = ["dear", "to our shareholders", "fellow shareholders"]
    letter_end = ["best regards", "respectfully submitted", "thank you for your continued support"]

    with open(filename, 'r', encoding="utf-8") as infile, open(xyz.txt, mode = 'w', encoding="utf-8") as f:
    text = infile.read()
    lines = text.strip().split("n")
    target_start_idx = None
    target_end_idx = None
    for index, line in enumerate(lines):
    line = line.lower()
    if any(beg in line for beg in letter_begin):
    target_start_idx = index
    continue
    if any(end in line for end in letter_end):
    target_end_idx = index + 3
    break


    if target_start_idx is not None:
    target = "n".join(lines[target_start_idx : target_end_idx])
    f.write(str(target))


    my desired output should be:



    output = "Dear Shareholders: We are pleased to provide this report to our shareholders and fellow shareholders. we thank you for your continued support.
    Best regards,
    Douglas - Director

    "









    share|improve this question
























      up vote
      1
      down vote

      favorite
      1









      up vote
      1
      down vote

      favorite
      1






      1





      I want to extract a certain part of a letter from a txt file with Python. The beginning and the ending is marked by clear beginning / ending expressions (letter_begin / letter_end). My problem is that the "recording" of the text needs to start at the very first occurence of any item in the letter_begin list and end at the very last item in the letter_end list (+3 lines buffer). I want to write the output text to file. Here is my sample text and my code so far:



      sample_text = """Some random text right here 
      .........
      Dear Shareholders: We are pleased to provide this report to our shareholders and fellow shareholders. we thank you for your continued support.
      Best regards,
      Douglas - Director


      Other random text in this lines """

      letter_begin = ["dear", "to our shareholders", "fellow shareholders"]
      letter_end = ["best regards", "respectfully submitted", "thank you for your continued support"]

      with open(filename, 'r', encoding="utf-8") as infile, open(xyz.txt, mode = 'w', encoding="utf-8") as f:
      text = infile.read()
      lines = text.strip().split("n")
      target_start_idx = None
      target_end_idx = None
      for index, line in enumerate(lines):
      line = line.lower()
      if any(beg in line for beg in letter_begin):
      target_start_idx = index
      continue
      if any(end in line for end in letter_end):
      target_end_idx = index + 3
      break


      if target_start_idx is not None:
      target = "n".join(lines[target_start_idx : target_end_idx])
      f.write(str(target))


      my desired output should be:



      output = "Dear Shareholders: We are pleased to provide this report to our shareholders and fellow shareholders. we thank you for your continued support.
      Best regards,
      Douglas - Director

      "









      share|improve this question













      I want to extract a certain part of a letter from a txt file with Python. The beginning and the ending is marked by clear beginning / ending expressions (letter_begin / letter_end). My problem is that the "recording" of the text needs to start at the very first occurence of any item in the letter_begin list and end at the very last item in the letter_end list (+3 lines buffer). I want to write the output text to file. Here is my sample text and my code so far:



      sample_text = """Some random text right here 
      .........
      Dear Shareholders: We are pleased to provide this report to our shareholders and fellow shareholders. we thank you for your continued support.
      Best regards,
      Douglas - Director


      Other random text in this lines """

      letter_begin = ["dear", "to our shareholders", "fellow shareholders"]
      letter_end = ["best regards", "respectfully submitted", "thank you for your continued support"]

      with open(filename, 'r', encoding="utf-8") as infile, open(xyz.txt, mode = 'w', encoding="utf-8") as f:
      text = infile.read()
      lines = text.strip().split("n")
      target_start_idx = None
      target_end_idx = None
      for index, line in enumerate(lines):
      line = line.lower()
      if any(beg in line for beg in letter_begin):
      target_start_idx = index
      continue
      if any(end in line for end in letter_end):
      target_end_idx = index + 3
      break


      if target_start_idx is not None:
      target = "n".join(lines[target_start_idx : target_end_idx])
      f.write(str(target))


      my desired output should be:



      output = "Dear Shareholders: We are pleased to provide this report to our shareholders and fellow shareholders. we thank you for your continued support.
      Best regards,
      Douglas - Director

      "






      python nlp text-extraction






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 20 at 13:47









      Dominik Scheld

      498




      498
























          2 Answers
          2






          active

          oldest

          votes

















          up vote
          0
          down vote













          Your loop gives you the last occurrence of an opening sequence.



          You should separate the read part into two loops, like this:



          with open(filename, 'r', encoding="utf-8") as infile:

          text = infile.read()
          lines = text.strip().split("n")
          target_start_idx = None
          target_end_idx = None
          for index, line in enumerate(lines):
          line = line.lower()
          if any(beg in line for beg in letter_begin):
          target_start_idx = index
          break
          for index, line in enumerate(lines):
          if any(end in line for end in letter_end):
          target_end_idx = index + 3
          continue


          In this way, you exit the loop when the first occurrence of an opening sequence appears.






          share|improve this answer























          • thanks a lot! your solution captures the first occurence in both cases, correct? But I want to have the last occurence in the second case (letter_end) - any idea how to do this?
            – Dominik Scheld
            Nov 20 at 14:11










          • Yes, the idea is that the loop should continue running until all lines are compared against letter_end. I've edited the code to reflect this.
            – Yakov Dan
            Nov 20 at 14:13


















          up vote
          0
          down vote













          What about regular expressions?



          import re
          sample_text = """Some random text right here
          .........
          Dear Shareholders: We are pleased to provide this report to our shareholders
          and fellow shareholders. we thank you for your continued support.
          Best regards,
          Douglas - Director


          Other random text in this lines """



          letter_begin = ["dear", "to our shareholders", "fellow shareholders"]
          letter_end = ["best regards", "respectfully submitted", "thank you for your continued support"]

          """
          # read from a file
          with open(input_file, 'r') as fo:
          input_text = fo.read()
          """

          input_text = sample_text

          begin_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_begin), flags=re.IGNORECASE)
          end_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_end), flags=re.IGNORECASE)

          match_begin = re.findall(begin_pattern, input_text)
          idx_begin = min(re.search(x, input_text).start() for x in match_begin)

          match_end = re.findall(end_pattern, input_text)
          idx_end = max(re.search(x, input_text).end() for x in match_end)

          # add 3 lines, assuming that there are always enough extra lines
          extra_lines = 'n'.join(input_text[idx_end:].split('n')[:4])
          output_text = input_text[idx_begin:idx_end] + extra_lines

          print(output_text)
          # Dear Shareholders: We are pleased to provide this report to our shareholders
          # and fellow shareholders. we thank you for your continued support.
          # Best regards,
          # Douglas - Director
          #
          #
          """
          # write to a new file
          with open(output_file, 'w') as fo:
          fo.write(output_text)
          """





          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394453%2fpython-text-extraction-from-letter-index%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote













            Your loop gives you the last occurrence of an opening sequence.



            You should separate the read part into two loops, like this:



            with open(filename, 'r', encoding="utf-8") as infile:

            text = infile.read()
            lines = text.strip().split("n")
            target_start_idx = None
            target_end_idx = None
            for index, line in enumerate(lines):
            line = line.lower()
            if any(beg in line for beg in letter_begin):
            target_start_idx = index
            break
            for index, line in enumerate(lines):
            if any(end in line for end in letter_end):
            target_end_idx = index + 3
            continue


            In this way, you exit the loop when the first occurrence of an opening sequence appears.






            share|improve this answer























            • thanks a lot! your solution captures the first occurence in both cases, correct? But I want to have the last occurence in the second case (letter_end) - any idea how to do this?
              – Dominik Scheld
              Nov 20 at 14:11










            • Yes, the idea is that the loop should continue running until all lines are compared against letter_end. I've edited the code to reflect this.
              – Yakov Dan
              Nov 20 at 14:13















            up vote
            0
            down vote













            Your loop gives you the last occurrence of an opening sequence.



            You should separate the read part into two loops, like this:



            with open(filename, 'r', encoding="utf-8") as infile:

            text = infile.read()
            lines = text.strip().split("n")
            target_start_idx = None
            target_end_idx = None
            for index, line in enumerate(lines):
            line = line.lower()
            if any(beg in line for beg in letter_begin):
            target_start_idx = index
            break
            for index, line in enumerate(lines):
            if any(end in line for end in letter_end):
            target_end_idx = index + 3
            continue


            In this way, you exit the loop when the first occurrence of an opening sequence appears.






            share|improve this answer























            • thanks a lot! your solution captures the first occurence in both cases, correct? But I want to have the last occurence in the second case (letter_end) - any idea how to do this?
              – Dominik Scheld
              Nov 20 at 14:11










            • Yes, the idea is that the loop should continue running until all lines are compared against letter_end. I've edited the code to reflect this.
              – Yakov Dan
              Nov 20 at 14:13













            up vote
            0
            down vote










            up vote
            0
            down vote









            Your loop gives you the last occurrence of an opening sequence.



            You should separate the read part into two loops, like this:



            with open(filename, 'r', encoding="utf-8") as infile:

            text = infile.read()
            lines = text.strip().split("n")
            target_start_idx = None
            target_end_idx = None
            for index, line in enumerate(lines):
            line = line.lower()
            if any(beg in line for beg in letter_begin):
            target_start_idx = index
            break
            for index, line in enumerate(lines):
            if any(end in line for end in letter_end):
            target_end_idx = index + 3
            continue


            In this way, you exit the loop when the first occurrence of an opening sequence appears.






            share|improve this answer














            Your loop gives you the last occurrence of an opening sequence.



            You should separate the read part into two loops, like this:



            with open(filename, 'r', encoding="utf-8") as infile:

            text = infile.read()
            lines = text.strip().split("n")
            target_start_idx = None
            target_end_idx = None
            for index, line in enumerate(lines):
            line = line.lower()
            if any(beg in line for beg in letter_begin):
            target_start_idx = index
            break
            for index, line in enumerate(lines):
            if any(end in line for end in letter_end):
            target_end_idx = index + 3
            continue


            In this way, you exit the loop when the first occurrence of an opening sequence appears.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 20 at 14:12

























            answered Nov 20 at 14:08









            Yakov Dan

            1,039514




            1,039514












            • thanks a lot! your solution captures the first occurence in both cases, correct? But I want to have the last occurence in the second case (letter_end) - any idea how to do this?
              – Dominik Scheld
              Nov 20 at 14:11










            • Yes, the idea is that the loop should continue running until all lines are compared against letter_end. I've edited the code to reflect this.
              – Yakov Dan
              Nov 20 at 14:13


















            • thanks a lot! your solution captures the first occurence in both cases, correct? But I want to have the last occurence in the second case (letter_end) - any idea how to do this?
              – Dominik Scheld
              Nov 20 at 14:11










            • Yes, the idea is that the loop should continue running until all lines are compared against letter_end. I've edited the code to reflect this.
              – Yakov Dan
              Nov 20 at 14:13
















            thanks a lot! your solution captures the first occurence in both cases, correct? But I want to have the last occurence in the second case (letter_end) - any idea how to do this?
            – Dominik Scheld
            Nov 20 at 14:11




            thanks a lot! your solution captures the first occurence in both cases, correct? But I want to have the last occurence in the second case (letter_end) - any idea how to do this?
            – Dominik Scheld
            Nov 20 at 14:11












            Yes, the idea is that the loop should continue running until all lines are compared against letter_end. I've edited the code to reflect this.
            – Yakov Dan
            Nov 20 at 14:13




            Yes, the idea is that the loop should continue running until all lines are compared against letter_end. I've edited the code to reflect this.
            – Yakov Dan
            Nov 20 at 14:13












            up vote
            0
            down vote













            What about regular expressions?



            import re
            sample_text = """Some random text right here
            .........
            Dear Shareholders: We are pleased to provide this report to our shareholders
            and fellow shareholders. we thank you for your continued support.
            Best regards,
            Douglas - Director


            Other random text in this lines """



            letter_begin = ["dear", "to our shareholders", "fellow shareholders"]
            letter_end = ["best regards", "respectfully submitted", "thank you for your continued support"]

            """
            # read from a file
            with open(input_file, 'r') as fo:
            input_text = fo.read()
            """

            input_text = sample_text

            begin_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_begin), flags=re.IGNORECASE)
            end_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_end), flags=re.IGNORECASE)

            match_begin = re.findall(begin_pattern, input_text)
            idx_begin = min(re.search(x, input_text).start() for x in match_begin)

            match_end = re.findall(end_pattern, input_text)
            idx_end = max(re.search(x, input_text).end() for x in match_end)

            # add 3 lines, assuming that there are always enough extra lines
            extra_lines = 'n'.join(input_text[idx_end:].split('n')[:4])
            output_text = input_text[idx_begin:idx_end] + extra_lines

            print(output_text)
            # Dear Shareholders: We are pleased to provide this report to our shareholders
            # and fellow shareholders. we thank you for your continued support.
            # Best regards,
            # Douglas - Director
            #
            #
            """
            # write to a new file
            with open(output_file, 'w') as fo:
            fo.write(output_text)
            """





            share|improve this answer

























              up vote
              0
              down vote













              What about regular expressions?



              import re
              sample_text = """Some random text right here
              .........
              Dear Shareholders: We are pleased to provide this report to our shareholders
              and fellow shareholders. we thank you for your continued support.
              Best regards,
              Douglas - Director


              Other random text in this lines """



              letter_begin = ["dear", "to our shareholders", "fellow shareholders"]
              letter_end = ["best regards", "respectfully submitted", "thank you for your continued support"]

              """
              # read from a file
              with open(input_file, 'r') as fo:
              input_text = fo.read()
              """

              input_text = sample_text

              begin_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_begin), flags=re.IGNORECASE)
              end_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_end), flags=re.IGNORECASE)

              match_begin = re.findall(begin_pattern, input_text)
              idx_begin = min(re.search(x, input_text).start() for x in match_begin)

              match_end = re.findall(end_pattern, input_text)
              idx_end = max(re.search(x, input_text).end() for x in match_end)

              # add 3 lines, assuming that there are always enough extra lines
              extra_lines = 'n'.join(input_text[idx_end:].split('n')[:4])
              output_text = input_text[idx_begin:idx_end] + extra_lines

              print(output_text)
              # Dear Shareholders: We are pleased to provide this report to our shareholders
              # and fellow shareholders. we thank you for your continued support.
              # Best regards,
              # Douglas - Director
              #
              #
              """
              # write to a new file
              with open(output_file, 'w') as fo:
              fo.write(output_text)
              """





              share|improve this answer























                up vote
                0
                down vote










                up vote
                0
                down vote









                What about regular expressions?



                import re
                sample_text = """Some random text right here
                .........
                Dear Shareholders: We are pleased to provide this report to our shareholders
                and fellow shareholders. we thank you for your continued support.
                Best regards,
                Douglas - Director


                Other random text in this lines """



                letter_begin = ["dear", "to our shareholders", "fellow shareholders"]
                letter_end = ["best regards", "respectfully submitted", "thank you for your continued support"]

                """
                # read from a file
                with open(input_file, 'r') as fo:
                input_text = fo.read()
                """

                input_text = sample_text

                begin_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_begin), flags=re.IGNORECASE)
                end_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_end), flags=re.IGNORECASE)

                match_begin = re.findall(begin_pattern, input_text)
                idx_begin = min(re.search(x, input_text).start() for x in match_begin)

                match_end = re.findall(end_pattern, input_text)
                idx_end = max(re.search(x, input_text).end() for x in match_end)

                # add 3 lines, assuming that there are always enough extra lines
                extra_lines = 'n'.join(input_text[idx_end:].split('n')[:4])
                output_text = input_text[idx_begin:idx_end] + extra_lines

                print(output_text)
                # Dear Shareholders: We are pleased to provide this report to our shareholders
                # and fellow shareholders. we thank you for your continued support.
                # Best regards,
                # Douglas - Director
                #
                #
                """
                # write to a new file
                with open(output_file, 'w') as fo:
                fo.write(output_text)
                """





                share|improve this answer












                What about regular expressions?



                import re
                sample_text = """Some random text right here
                .........
                Dear Shareholders: We are pleased to provide this report to our shareholders
                and fellow shareholders. we thank you for your continued support.
                Best regards,
                Douglas - Director


                Other random text in this lines """



                letter_begin = ["dear", "to our shareholders", "fellow shareholders"]
                letter_end = ["best regards", "respectfully submitted", "thank you for your continued support"]

                """
                # read from a file
                with open(input_file, 'r') as fo:
                input_text = fo.read()
                """

                input_text = sample_text

                begin_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_begin), flags=re.IGNORECASE)
                end_pattern = re.compile(r'b(?:%s)b' % '|'.join(letter_end), flags=re.IGNORECASE)

                match_begin = re.findall(begin_pattern, input_text)
                idx_begin = min(re.search(x, input_text).start() for x in match_begin)

                match_end = re.findall(end_pattern, input_text)
                idx_end = max(re.search(x, input_text).end() for x in match_end)

                # add 3 lines, assuming that there are always enough extra lines
                extra_lines = 'n'.join(input_text[idx_end:].split('n')[:4])
                output_text = input_text[idx_begin:idx_end] + extra_lines

                print(output_text)
                # Dear Shareholders: We are pleased to provide this report to our shareholders
                # and fellow shareholders. we thank you for your continued support.
                # Best regards,
                # Douglas - Director
                #
                #
                """
                # write to a new file
                with open(output_file, 'w') as fo:
                fo.write(output_text)
                """






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 20 at 15:15









                Vlad

                190111




                190111






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394453%2fpython-text-extraction-from-letter-index%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Wiesbaden

                    Marschland

                    Dieringhausen