Why are there different size limitations in Watson NLC for training (1024 chars) and for production (2048...





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







1















IBM Watson Natural Language Classifier (NLC) limits the text values in the training set to 1024 characters:
https://console.bluemix.net/docs/services/natural-language-classifier/using-your-data.html#training-limits .



However the trained model can then classify every text whose length is at most 2048 characters:
https://console.bluemix.net/apidocs/natural-language-classifier#classify-a-phrase .



This difference creates some confusion for me: I have always known that we should apply the same pre-processing to both training phase and production phase, therefore if I had to cap off the training data at 1024 chars I would do the same also in production.



Is my reasoning correct or not? Should I cap off the text in production at 1024 chars (as I think I should) or at 2048 chars (maybe because 1024 chars are too few)?



Thank you in advance!










share|improve this question





























    1















    IBM Watson Natural Language Classifier (NLC) limits the text values in the training set to 1024 characters:
    https://console.bluemix.net/docs/services/natural-language-classifier/using-your-data.html#training-limits .



    However the trained model can then classify every text whose length is at most 2048 characters:
    https://console.bluemix.net/apidocs/natural-language-classifier#classify-a-phrase .



    This difference creates some confusion for me: I have always known that we should apply the same pre-processing to both training phase and production phase, therefore if I had to cap off the training data at 1024 chars I would do the same also in production.



    Is my reasoning correct or not? Should I cap off the text in production at 1024 chars (as I think I should) or at 2048 chars (maybe because 1024 chars are too few)?



    Thank you in advance!










    share|improve this question

























      1












      1








      1








      IBM Watson Natural Language Classifier (NLC) limits the text values in the training set to 1024 characters:
      https://console.bluemix.net/docs/services/natural-language-classifier/using-your-data.html#training-limits .



      However the trained model can then classify every text whose length is at most 2048 characters:
      https://console.bluemix.net/apidocs/natural-language-classifier#classify-a-phrase .



      This difference creates some confusion for me: I have always known that we should apply the same pre-processing to both training phase and production phase, therefore if I had to cap off the training data at 1024 chars I would do the same also in production.



      Is my reasoning correct or not? Should I cap off the text in production at 1024 chars (as I think I should) or at 2048 chars (maybe because 1024 chars are too few)?



      Thank you in advance!










      share|improve this question














      IBM Watson Natural Language Classifier (NLC) limits the text values in the training set to 1024 characters:
      https://console.bluemix.net/docs/services/natural-language-classifier/using-your-data.html#training-limits .



      However the trained model can then classify every text whose length is at most 2048 characters:
      https://console.bluemix.net/apidocs/natural-language-classifier#classify-a-phrase .



      This difference creates some confusion for me: I have always known that we should apply the same pre-processing to both training phase and production phase, therefore if I had to cap off the training data at 1024 chars I would do the same also in production.



      Is my reasoning correct or not? Should I cap off the text in production at 1024 chars (as I think I should) or at 2048 chars (maybe because 1024 chars are too few)?



      Thank you in advance!







      ibm-watson nl-classifier






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 26 '18 at 15:44









      RosaRosa

      102




      102
























          1 Answer
          1






          active

          oldest

          votes


















          0














          Recently, I had the same question and one of the answers on an article clarified the same




          Currently, the limits are set at 1024 for training and 2048 for
          testing/classification
          . The 1024 limit may require some curation of
          the training data prior to training. Most organizations who require
          larger character limits for their data end up chunking their input
          text into 1024 chunks. Additionally, in use cases with data similar to
          the Airbnb reviews, the primary category can typically be assessed
          within the first 2048 characters since there is often a lot of noise
          in lengthy reviews.




          Here's the link to the article






          share|improve this answer
























          • @Rosa Post your questions here. Will try to answer best of knowledge

            – Vidyasagar Machupalli
            Nov 28 '18 at 10:46











          • Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?

            – Rosa
            Nov 28 '18 at 10:57













          • Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…

            – Vidyasagar Machupalli
            Nov 28 '18 at 12:06











          • Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?

            – Rosa
            Nov 28 '18 at 15:48











          • I would say stick with 2048 for testing/classification (production)

            – Vidyasagar Machupalli
            Nov 29 '18 at 0:19














          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53484609%2fwhy-are-there-different-size-limitations-in-watson-nlc-for-training-1024-chars%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          Recently, I had the same question and one of the answers on an article clarified the same




          Currently, the limits are set at 1024 for training and 2048 for
          testing/classification
          . The 1024 limit may require some curation of
          the training data prior to training. Most organizations who require
          larger character limits for their data end up chunking their input
          text into 1024 chunks. Additionally, in use cases with data similar to
          the Airbnb reviews, the primary category can typically be assessed
          within the first 2048 characters since there is often a lot of noise
          in lengthy reviews.




          Here's the link to the article






          share|improve this answer
























          • @Rosa Post your questions here. Will try to answer best of knowledge

            – Vidyasagar Machupalli
            Nov 28 '18 at 10:46











          • Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?

            – Rosa
            Nov 28 '18 at 10:57













          • Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…

            – Vidyasagar Machupalli
            Nov 28 '18 at 12:06











          • Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?

            – Rosa
            Nov 28 '18 at 15:48











          • I would say stick with 2048 for testing/classification (production)

            – Vidyasagar Machupalli
            Nov 29 '18 at 0:19


















          0














          Recently, I had the same question and one of the answers on an article clarified the same




          Currently, the limits are set at 1024 for training and 2048 for
          testing/classification
          . The 1024 limit may require some curation of
          the training data prior to training. Most organizations who require
          larger character limits for their data end up chunking their input
          text into 1024 chunks. Additionally, in use cases with data similar to
          the Airbnb reviews, the primary category can typically be assessed
          within the first 2048 characters since there is often a lot of noise
          in lengthy reviews.




          Here's the link to the article






          share|improve this answer
























          • @Rosa Post your questions here. Will try to answer best of knowledge

            – Vidyasagar Machupalli
            Nov 28 '18 at 10:46











          • Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?

            – Rosa
            Nov 28 '18 at 10:57













          • Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…

            – Vidyasagar Machupalli
            Nov 28 '18 at 12:06











          • Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?

            – Rosa
            Nov 28 '18 at 15:48











          • I would say stick with 2048 for testing/classification (production)

            – Vidyasagar Machupalli
            Nov 29 '18 at 0:19
















          0












          0








          0







          Recently, I had the same question and one of the answers on an article clarified the same




          Currently, the limits are set at 1024 for training and 2048 for
          testing/classification
          . The 1024 limit may require some curation of
          the training data prior to training. Most organizations who require
          larger character limits for their data end up chunking their input
          text into 1024 chunks. Additionally, in use cases with data similar to
          the Airbnb reviews, the primary category can typically be assessed
          within the first 2048 characters since there is often a lot of noise
          in lengthy reviews.




          Here's the link to the article






          share|improve this answer













          Recently, I had the same question and one of the answers on an article clarified the same




          Currently, the limits are set at 1024 for training and 2048 for
          testing/classification
          . The 1024 limit may require some curation of
          the training data prior to training. Most organizations who require
          larger character limits for their data end up chunking their input
          text into 1024 chunks. Additionally, in use cases with data similar to
          the Airbnb reviews, the primary category can typically be assessed
          within the first 2048 characters since there is often a lot of noise
          in lengthy reviews.




          Here's the link to the article







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 27 '18 at 8:20









          Vidyasagar MachupalliVidyasagar Machupalli

          1,1961818




          1,1961818













          • @Rosa Post your questions here. Will try to answer best of knowledge

            – Vidyasagar Machupalli
            Nov 28 '18 at 10:46











          • Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?

            – Rosa
            Nov 28 '18 at 10:57













          • Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…

            – Vidyasagar Machupalli
            Nov 28 '18 at 12:06











          • Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?

            – Rosa
            Nov 28 '18 at 15:48











          • I would say stick with 2048 for testing/classification (production)

            – Vidyasagar Machupalli
            Nov 29 '18 at 0:19





















          • @Rosa Post your questions here. Will try to answer best of knowledge

            – Vidyasagar Machupalli
            Nov 28 '18 at 10:46











          • Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?

            – Rosa
            Nov 28 '18 at 10:57













          • Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…

            – Vidyasagar Machupalli
            Nov 28 '18 at 12:06











          • Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?

            – Rosa
            Nov 28 '18 at 15:48











          • I would say stick with 2048 for testing/classification (production)

            – Vidyasagar Machupalli
            Nov 29 '18 at 0:19



















          @Rosa Post your questions here. Will try to answer best of knowledge

          – Vidyasagar Machupalli
          Nov 28 '18 at 10:46





          @Rosa Post your questions here. Will try to answer best of knowledge

          – Vidyasagar Machupalli
          Nov 28 '18 at 10:46













          Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?

          – Rosa
          Nov 28 '18 at 10:57







          Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?

          – Rosa
          Nov 28 '18 at 10:57















          Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…

          – Vidyasagar Machupalli
          Nov 28 '18 at 12:06





          Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…

          – Vidyasagar Machupalli
          Nov 28 '18 at 12:06













          Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?

          – Rosa
          Nov 28 '18 at 15:48





          Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?

          – Rosa
          Nov 28 '18 at 15:48













          I would say stick with 2048 for testing/classification (production)

          – Vidyasagar Machupalli
          Nov 29 '18 at 0:19







          I would say stick with 2048 for testing/classification (production)

          – Vidyasagar Machupalli
          Nov 29 '18 at 0:19






















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53484609%2fwhy-are-there-different-size-limitations-in-watson-nlc-for-training-1024-chars%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Tonle Sap (See)

          I get strange results when I access the Sqlitedatabase with Unity C# via XAMPP

          Guatemaltekische Davis-Cup-Mannschaft