Email parsing test dataset












0















I am evaluating email parsing libraries for an Elixir/Erlang project and am trying to figure out which one is "best", or if I should build my own. The criteria I am using for "best" is: which library is the most RFC compliant.



The problem I am facing is that (unsurprisingly) each library has its own tests, so If I want to compare apples-to-apples I need to run them against the same tests.



Is there a collection of test emails available that I can use for evaluation? Or am I better off to copy tests from a more active Java/Ruby/Python library?










share|improve this question



























    0















    I am evaluating email parsing libraries for an Elixir/Erlang project and am trying to figure out which one is "best", or if I should build my own. The criteria I am using for "best" is: which library is the most RFC compliant.



    The problem I am facing is that (unsurprisingly) each library has its own tests, so If I want to compare apples-to-apples I need to run them against the same tests.



    Is there a collection of test emails available that I can use for evaluation? Or am I better off to copy tests from a more active Java/Ruby/Python library?










    share|improve this question

























      0












      0








      0








      I am evaluating email parsing libraries for an Elixir/Erlang project and am trying to figure out which one is "best", or if I should build my own. The criteria I am using for "best" is: which library is the most RFC compliant.



      The problem I am facing is that (unsurprisingly) each library has its own tests, so If I want to compare apples-to-apples I need to run them against the same tests.



      Is there a collection of test emails available that I can use for evaluation? Or am I better off to copy tests from a more active Java/Ruby/Python library?










      share|improve this question














      I am evaluating email parsing libraries for an Elixir/Erlang project and am trying to figure out which one is "best", or if I should build my own. The criteria I am using for "best" is: which library is the most RFC compliant.



      The problem I am facing is that (unsurprisingly) each library has its own tests, so If I want to compare apples-to-apples I need to run them against the same tests.



      Is there a collection of test emails available that I can use for evaluation? Or am I better off to copy tests from a more active Java/Ruby/Python library?







      email elixir email-parsing






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 23 '18 at 16:20









      TylerTyler

      9,63963266




      9,63963266
























          2 Answers
          2






          active

          oldest

          votes


















          1














          I don't think you will find any complete test suite for e-mail parsing in Elixir, but it would be a very nice project to work on.



          If I'm going to start a project like that, I would probably pick tests for any library, evaluate how complete it is (based on the RFC) and build a generic way to run that against any library.



          DockYard/elixir-mail/blob/master/test/mail/parsers/rfc_2822_test.exs can be a good start point for you.






          share|improve this answer































            1














            I have a collection of mbox's that I use for testing mime parsers.



            https://github.com/jstedfast/MimeKit/tree/master/UnitTests/TestData/mbox



            That link is a directory containing a few *.mbox.txt files and their equivalent summary files (which is just some metadata about each message that should be easy to get from the message once the parser has parsed it from the mbox).



            There's also some *.html files which are just the extracted html message bodies that are used for testing the logic for figuring out which body part is the actual message body. You can probably ignore that as it's not really about rfc-compliance.



            The main mbox to look at and use is the jwz.mbox.txt file - that's the mbox file I got from Jamie Zawinski of Netscape Mail fame back in the early 2000's for testing Netscape Mail's parser.



            simple.mbox.txt is a very short mbox of 3 messages with nested multiparts using different sets of boundary markers. The second and third message are the 2 that are most likely to break parsers (the first might break random mime parsers written by newbies on sourceforge or github, but nothing seriously written). THe second message has all nested multiparts using boundary="x" which will break parsers that don't use a boundary stack. The third message has nested multiparts that all use an empty string boundary (e.g. boundary="").



            Then there's a content-length.mbox.txt for testing that the parser properly handles Content-Length headers.



            unmunged.mbox.txt looks like it was accidentally committed - looks like I wrote that to test to see what Thunderbird did with Content-Length headers and unmunged From lines?



            Anyway, to see how I generated the output for the summary files, you can check out https://github.com/jstedfast/MimeKit/blob/master/UnitTests/MimeParserTests.cs#L624



            Methods like DumpMimeTree, etc are all listed above that method in the file.



            I've got a very similar test suite for my C MIME parser as well (if you'd rather read C than C#): https://github.com/jstedfast/gmime/blob/master/tests/test-parser.c



            Additional Thoughts:



            One thing to keep in mind when evaluating MIME parsers is that you don't really want strict rfc-compliance in parsing because that means that a lot of messages will fail to parse. What you really want is a library that will handle as much brokenness as possible while outputting new messages that strictly conform to the rfcs (as much as possible anyway).



            While those mbox files should be helpful in making sure that the parsers you test are at least robust enough to handle those, that's not necessarily the end-all of testing.



            One of the next things I do when evaluating a MIME parser is to check how the parser parses address headers. Does it do something stupid like splitting the header value on ,'s? If so, it's out. I would probably say it had better use a tokenizer approach or it's probably not even worth considering.



            The same goes for rfc2047 decoding.



            Here's a rant I wrote back in 2013 when I was in your position looking for a reasonably good MIME parser for C#/.NET: https://jeffreystedfast.blogspot.com/2013/09/time-for-rant-on-mime-parsers.html



            This links back to an earlier post I had written which is a rant about why decoding headers (rfc2047) is hard to get right: https://jeffreystedfast.blogspot.com/2013/08/why-decoding-rfc2047-encoded-headers-is.html



            I guess the problem with trying to evaluate a MIME parser/email library is that you kind of need to be intimately familiar with the specifications in order to have much confidence in trying to evaluate them beyond the simple "can it parse my random set of messages?"



            I hope that this has been helpful, but... yea, if your experience is anything like mine was back in 2013 looking for a decent C# parser, you're going to need to write your own - just please, please, please read and follow the specs if you do because otherwise you just end up giving other email devs nightmares.






            share|improve this answer

























              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450019%2femail-parsing-test-dataset%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1














              I don't think you will find any complete test suite for e-mail parsing in Elixir, but it would be a very nice project to work on.



              If I'm going to start a project like that, I would probably pick tests for any library, evaluate how complete it is (based on the RFC) and build a generic way to run that against any library.



              DockYard/elixir-mail/blob/master/test/mail/parsers/rfc_2822_test.exs can be a good start point for you.






              share|improve this answer




























                1














                I don't think you will find any complete test suite for e-mail parsing in Elixir, but it would be a very nice project to work on.



                If I'm going to start a project like that, I would probably pick tests for any library, evaluate how complete it is (based on the RFC) and build a generic way to run that against any library.



                DockYard/elixir-mail/blob/master/test/mail/parsers/rfc_2822_test.exs can be a good start point for you.






                share|improve this answer


























                  1












                  1








                  1







                  I don't think you will find any complete test suite for e-mail parsing in Elixir, but it would be a very nice project to work on.



                  If I'm going to start a project like that, I would probably pick tests for any library, evaluate how complete it is (based on the RFC) and build a generic way to run that against any library.



                  DockYard/elixir-mail/blob/master/test/mail/parsers/rfc_2822_test.exs can be a good start point for you.






                  share|improve this answer













                  I don't think you will find any complete test suite for e-mail parsing in Elixir, but it would be a very nice project to work on.



                  If I'm going to start a project like that, I would probably pick tests for any library, evaluate how complete it is (based on the RFC) and build a generic way to run that against any library.



                  DockYard/elixir-mail/blob/master/test/mail/parsers/rfc_2822_test.exs can be a good start point for you.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 23 '18 at 16:44









                  Marcos TapajósMarcos Tapajós

                  1915




                  1915

























                      1














                      I have a collection of mbox's that I use for testing mime parsers.



                      https://github.com/jstedfast/MimeKit/tree/master/UnitTests/TestData/mbox



                      That link is a directory containing a few *.mbox.txt files and their equivalent summary files (which is just some metadata about each message that should be easy to get from the message once the parser has parsed it from the mbox).



                      There's also some *.html files which are just the extracted html message bodies that are used for testing the logic for figuring out which body part is the actual message body. You can probably ignore that as it's not really about rfc-compliance.



                      The main mbox to look at and use is the jwz.mbox.txt file - that's the mbox file I got from Jamie Zawinski of Netscape Mail fame back in the early 2000's for testing Netscape Mail's parser.



                      simple.mbox.txt is a very short mbox of 3 messages with nested multiparts using different sets of boundary markers. The second and third message are the 2 that are most likely to break parsers (the first might break random mime parsers written by newbies on sourceforge or github, but nothing seriously written). THe second message has all nested multiparts using boundary="x" which will break parsers that don't use a boundary stack. The third message has nested multiparts that all use an empty string boundary (e.g. boundary="").



                      Then there's a content-length.mbox.txt for testing that the parser properly handles Content-Length headers.



                      unmunged.mbox.txt looks like it was accidentally committed - looks like I wrote that to test to see what Thunderbird did with Content-Length headers and unmunged From lines?



                      Anyway, to see how I generated the output for the summary files, you can check out https://github.com/jstedfast/MimeKit/blob/master/UnitTests/MimeParserTests.cs#L624



                      Methods like DumpMimeTree, etc are all listed above that method in the file.



                      I've got a very similar test suite for my C MIME parser as well (if you'd rather read C than C#): https://github.com/jstedfast/gmime/blob/master/tests/test-parser.c



                      Additional Thoughts:



                      One thing to keep in mind when evaluating MIME parsers is that you don't really want strict rfc-compliance in parsing because that means that a lot of messages will fail to parse. What you really want is a library that will handle as much brokenness as possible while outputting new messages that strictly conform to the rfcs (as much as possible anyway).



                      While those mbox files should be helpful in making sure that the parsers you test are at least robust enough to handle those, that's not necessarily the end-all of testing.



                      One of the next things I do when evaluating a MIME parser is to check how the parser parses address headers. Does it do something stupid like splitting the header value on ,'s? If so, it's out. I would probably say it had better use a tokenizer approach or it's probably not even worth considering.



                      The same goes for rfc2047 decoding.



                      Here's a rant I wrote back in 2013 when I was in your position looking for a reasonably good MIME parser for C#/.NET: https://jeffreystedfast.blogspot.com/2013/09/time-for-rant-on-mime-parsers.html



                      This links back to an earlier post I had written which is a rant about why decoding headers (rfc2047) is hard to get right: https://jeffreystedfast.blogspot.com/2013/08/why-decoding-rfc2047-encoded-headers-is.html



                      I guess the problem with trying to evaluate a MIME parser/email library is that you kind of need to be intimately familiar with the specifications in order to have much confidence in trying to evaluate them beyond the simple "can it parse my random set of messages?"



                      I hope that this has been helpful, but... yea, if your experience is anything like mine was back in 2013 looking for a decent C# parser, you're going to need to write your own - just please, please, please read and follow the specs if you do because otherwise you just end up giving other email devs nightmares.






                      share|improve this answer






























                        1














                        I have a collection of mbox's that I use for testing mime parsers.



                        https://github.com/jstedfast/MimeKit/tree/master/UnitTests/TestData/mbox



                        That link is a directory containing a few *.mbox.txt files and their equivalent summary files (which is just some metadata about each message that should be easy to get from the message once the parser has parsed it from the mbox).



                        There's also some *.html files which are just the extracted html message bodies that are used for testing the logic for figuring out which body part is the actual message body. You can probably ignore that as it's not really about rfc-compliance.



                        The main mbox to look at and use is the jwz.mbox.txt file - that's the mbox file I got from Jamie Zawinski of Netscape Mail fame back in the early 2000's for testing Netscape Mail's parser.



                        simple.mbox.txt is a very short mbox of 3 messages with nested multiparts using different sets of boundary markers. The second and third message are the 2 that are most likely to break parsers (the first might break random mime parsers written by newbies on sourceforge or github, but nothing seriously written). THe second message has all nested multiparts using boundary="x" which will break parsers that don't use a boundary stack. The third message has nested multiparts that all use an empty string boundary (e.g. boundary="").



                        Then there's a content-length.mbox.txt for testing that the parser properly handles Content-Length headers.



                        unmunged.mbox.txt looks like it was accidentally committed - looks like I wrote that to test to see what Thunderbird did with Content-Length headers and unmunged From lines?



                        Anyway, to see how I generated the output for the summary files, you can check out https://github.com/jstedfast/MimeKit/blob/master/UnitTests/MimeParserTests.cs#L624



                        Methods like DumpMimeTree, etc are all listed above that method in the file.



                        I've got a very similar test suite for my C MIME parser as well (if you'd rather read C than C#): https://github.com/jstedfast/gmime/blob/master/tests/test-parser.c



                        Additional Thoughts:



                        One thing to keep in mind when evaluating MIME parsers is that you don't really want strict rfc-compliance in parsing because that means that a lot of messages will fail to parse. What you really want is a library that will handle as much brokenness as possible while outputting new messages that strictly conform to the rfcs (as much as possible anyway).



                        While those mbox files should be helpful in making sure that the parsers you test are at least robust enough to handle those, that's not necessarily the end-all of testing.



                        One of the next things I do when evaluating a MIME parser is to check how the parser parses address headers. Does it do something stupid like splitting the header value on ,'s? If so, it's out. I would probably say it had better use a tokenizer approach or it's probably not even worth considering.



                        The same goes for rfc2047 decoding.



                        Here's a rant I wrote back in 2013 when I was in your position looking for a reasonably good MIME parser for C#/.NET: https://jeffreystedfast.blogspot.com/2013/09/time-for-rant-on-mime-parsers.html



                        This links back to an earlier post I had written which is a rant about why decoding headers (rfc2047) is hard to get right: https://jeffreystedfast.blogspot.com/2013/08/why-decoding-rfc2047-encoded-headers-is.html



                        I guess the problem with trying to evaluate a MIME parser/email library is that you kind of need to be intimately familiar with the specifications in order to have much confidence in trying to evaluate them beyond the simple "can it parse my random set of messages?"



                        I hope that this has been helpful, but... yea, if your experience is anything like mine was back in 2013 looking for a decent C# parser, you're going to need to write your own - just please, please, please read and follow the specs if you do because otherwise you just end up giving other email devs nightmares.






                        share|improve this answer




























                          1












                          1








                          1







                          I have a collection of mbox's that I use for testing mime parsers.



                          https://github.com/jstedfast/MimeKit/tree/master/UnitTests/TestData/mbox



                          That link is a directory containing a few *.mbox.txt files and their equivalent summary files (which is just some metadata about each message that should be easy to get from the message once the parser has parsed it from the mbox).



                          There's also some *.html files which are just the extracted html message bodies that are used for testing the logic for figuring out which body part is the actual message body. You can probably ignore that as it's not really about rfc-compliance.



                          The main mbox to look at and use is the jwz.mbox.txt file - that's the mbox file I got from Jamie Zawinski of Netscape Mail fame back in the early 2000's for testing Netscape Mail's parser.



                          simple.mbox.txt is a very short mbox of 3 messages with nested multiparts using different sets of boundary markers. The second and third message are the 2 that are most likely to break parsers (the first might break random mime parsers written by newbies on sourceforge or github, but nothing seriously written). THe second message has all nested multiparts using boundary="x" which will break parsers that don't use a boundary stack. The third message has nested multiparts that all use an empty string boundary (e.g. boundary="").



                          Then there's a content-length.mbox.txt for testing that the parser properly handles Content-Length headers.



                          unmunged.mbox.txt looks like it was accidentally committed - looks like I wrote that to test to see what Thunderbird did with Content-Length headers and unmunged From lines?



                          Anyway, to see how I generated the output for the summary files, you can check out https://github.com/jstedfast/MimeKit/blob/master/UnitTests/MimeParserTests.cs#L624



                          Methods like DumpMimeTree, etc are all listed above that method in the file.



                          I've got a very similar test suite for my C MIME parser as well (if you'd rather read C than C#): https://github.com/jstedfast/gmime/blob/master/tests/test-parser.c



                          Additional Thoughts:



                          One thing to keep in mind when evaluating MIME parsers is that you don't really want strict rfc-compliance in parsing because that means that a lot of messages will fail to parse. What you really want is a library that will handle as much brokenness as possible while outputting new messages that strictly conform to the rfcs (as much as possible anyway).



                          While those mbox files should be helpful in making sure that the parsers you test are at least robust enough to handle those, that's not necessarily the end-all of testing.



                          One of the next things I do when evaluating a MIME parser is to check how the parser parses address headers. Does it do something stupid like splitting the header value on ,'s? If so, it's out. I would probably say it had better use a tokenizer approach or it's probably not even worth considering.



                          The same goes for rfc2047 decoding.



                          Here's a rant I wrote back in 2013 when I was in your position looking for a reasonably good MIME parser for C#/.NET: https://jeffreystedfast.blogspot.com/2013/09/time-for-rant-on-mime-parsers.html



                          This links back to an earlier post I had written which is a rant about why decoding headers (rfc2047) is hard to get right: https://jeffreystedfast.blogspot.com/2013/08/why-decoding-rfc2047-encoded-headers-is.html



                          I guess the problem with trying to evaluate a MIME parser/email library is that you kind of need to be intimately familiar with the specifications in order to have much confidence in trying to evaluate them beyond the simple "can it parse my random set of messages?"



                          I hope that this has been helpful, but... yea, if your experience is anything like mine was back in 2013 looking for a decent C# parser, you're going to need to write your own - just please, please, please read and follow the specs if you do because otherwise you just end up giving other email devs nightmares.






                          share|improve this answer















                          I have a collection of mbox's that I use for testing mime parsers.



                          https://github.com/jstedfast/MimeKit/tree/master/UnitTests/TestData/mbox



                          That link is a directory containing a few *.mbox.txt files and their equivalent summary files (which is just some metadata about each message that should be easy to get from the message once the parser has parsed it from the mbox).



                          There's also some *.html files which are just the extracted html message bodies that are used for testing the logic for figuring out which body part is the actual message body. You can probably ignore that as it's not really about rfc-compliance.



                          The main mbox to look at and use is the jwz.mbox.txt file - that's the mbox file I got from Jamie Zawinski of Netscape Mail fame back in the early 2000's for testing Netscape Mail's parser.



                          simple.mbox.txt is a very short mbox of 3 messages with nested multiparts using different sets of boundary markers. The second and third message are the 2 that are most likely to break parsers (the first might break random mime parsers written by newbies on sourceforge or github, but nothing seriously written). THe second message has all nested multiparts using boundary="x" which will break parsers that don't use a boundary stack. The third message has nested multiparts that all use an empty string boundary (e.g. boundary="").



                          Then there's a content-length.mbox.txt for testing that the parser properly handles Content-Length headers.



                          unmunged.mbox.txt looks like it was accidentally committed - looks like I wrote that to test to see what Thunderbird did with Content-Length headers and unmunged From lines?



                          Anyway, to see how I generated the output for the summary files, you can check out https://github.com/jstedfast/MimeKit/blob/master/UnitTests/MimeParserTests.cs#L624



                          Methods like DumpMimeTree, etc are all listed above that method in the file.



                          I've got a very similar test suite for my C MIME parser as well (if you'd rather read C than C#): https://github.com/jstedfast/gmime/blob/master/tests/test-parser.c



                          Additional Thoughts:



                          One thing to keep in mind when evaluating MIME parsers is that you don't really want strict rfc-compliance in parsing because that means that a lot of messages will fail to parse. What you really want is a library that will handle as much brokenness as possible while outputting new messages that strictly conform to the rfcs (as much as possible anyway).



                          While those mbox files should be helpful in making sure that the parsers you test are at least robust enough to handle those, that's not necessarily the end-all of testing.



                          One of the next things I do when evaluating a MIME parser is to check how the parser parses address headers. Does it do something stupid like splitting the header value on ,'s? If so, it's out. I would probably say it had better use a tokenizer approach or it's probably not even worth considering.



                          The same goes for rfc2047 decoding.



                          Here's a rant I wrote back in 2013 when I was in your position looking for a reasonably good MIME parser for C#/.NET: https://jeffreystedfast.blogspot.com/2013/09/time-for-rant-on-mime-parsers.html



                          This links back to an earlier post I had written which is a rant about why decoding headers (rfc2047) is hard to get right: https://jeffreystedfast.blogspot.com/2013/08/why-decoding-rfc2047-encoded-headers-is.html



                          I guess the problem with trying to evaluate a MIME parser/email library is that you kind of need to be intimately familiar with the specifications in order to have much confidence in trying to evaluate them beyond the simple "can it parse my random set of messages?"



                          I hope that this has been helpful, but... yea, if your experience is anything like mine was back in 2013 looking for a decent C# parser, you're going to need to write your own - just please, please, please read and follow the specs if you do because otherwise you just end up giving other email devs nightmares.







                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Nov 24 '18 at 14:58

























                          answered Nov 24 '18 at 12:36









                          jstedfastjstedfast

                          18.9k25177




                          18.9k25177






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450019%2femail-parsing-test-dataset%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Tonle Sap (See)

                              I get strange results when I access the Sqlitedatabase with Unity C# via XAMPP

                              Guatemaltekische Davis-Cup-Mannschaft