Why are there different size limitations in Watson NLC for training (1024 chars) and for production (2048...
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
IBM Watson Natural Language Classifier (NLC) limits the text values in the training set to 1024 characters:
https://console.bluemix.net/docs/services/natural-language-classifier/using-your-data.html#training-limits .
However the trained model can then classify every text whose length is at most 2048 characters:
https://console.bluemix.net/apidocs/natural-language-classifier#classify-a-phrase .
This difference creates some confusion for me: I have always known that we should apply the same pre-processing to both training phase and production phase, therefore if I had to cap off the training data at 1024 chars I would do the same also in production.
Is my reasoning correct or not? Should I cap off the text in production at 1024 chars (as I think I should) or at 2048 chars (maybe because 1024 chars are too few)?
Thank you in advance!
ibm-watson nl-classifier
add a comment |
IBM Watson Natural Language Classifier (NLC) limits the text values in the training set to 1024 characters:
https://console.bluemix.net/docs/services/natural-language-classifier/using-your-data.html#training-limits .
However the trained model can then classify every text whose length is at most 2048 characters:
https://console.bluemix.net/apidocs/natural-language-classifier#classify-a-phrase .
This difference creates some confusion for me: I have always known that we should apply the same pre-processing to both training phase and production phase, therefore if I had to cap off the training data at 1024 chars I would do the same also in production.
Is my reasoning correct or not? Should I cap off the text in production at 1024 chars (as I think I should) or at 2048 chars (maybe because 1024 chars are too few)?
Thank you in advance!
ibm-watson nl-classifier
add a comment |
IBM Watson Natural Language Classifier (NLC) limits the text values in the training set to 1024 characters:
https://console.bluemix.net/docs/services/natural-language-classifier/using-your-data.html#training-limits .
However the trained model can then classify every text whose length is at most 2048 characters:
https://console.bluemix.net/apidocs/natural-language-classifier#classify-a-phrase .
This difference creates some confusion for me: I have always known that we should apply the same pre-processing to both training phase and production phase, therefore if I had to cap off the training data at 1024 chars I would do the same also in production.
Is my reasoning correct or not? Should I cap off the text in production at 1024 chars (as I think I should) or at 2048 chars (maybe because 1024 chars are too few)?
Thank you in advance!
ibm-watson nl-classifier
IBM Watson Natural Language Classifier (NLC) limits the text values in the training set to 1024 characters:
https://console.bluemix.net/docs/services/natural-language-classifier/using-your-data.html#training-limits .
However the trained model can then classify every text whose length is at most 2048 characters:
https://console.bluemix.net/apidocs/natural-language-classifier#classify-a-phrase .
This difference creates some confusion for me: I have always known that we should apply the same pre-processing to both training phase and production phase, therefore if I had to cap off the training data at 1024 chars I would do the same also in production.
Is my reasoning correct or not? Should I cap off the text in production at 1024 chars (as I think I should) or at 2048 chars (maybe because 1024 chars are too few)?
Thank you in advance!
ibm-watson nl-classifier
ibm-watson nl-classifier
asked Nov 26 '18 at 15:44
RosaRosa
102
102
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Recently, I had the same question and one of the answers on an article clarified the same
Currently, the limits are set at 1024 for training and 2048 for
testing/classification. The 1024 limit may require some curation of
the training data prior to training. Most organizations who require
larger character limits for their data end up chunking their input
text into 1024 chunks. Additionally, in use cases with data similar to
the Airbnb reviews, the primary category can typically be assessed
within the first 2048 characters since there is often a lot of noise
in lengthy reviews.
Here's the link to the article
@Rosa Post your questions here. Will try to answer best of knowledge
– Vidyasagar Machupalli
Nov 28 '18 at 10:46
Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?
– Rosa
Nov 28 '18 at 10:57
Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…
– Vidyasagar Machupalli
Nov 28 '18 at 12:06
Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?
– Rosa
Nov 28 '18 at 15:48
I would say stick with 2048 for testing/classification (production)
– Vidyasagar Machupalli
Nov 29 '18 at 0:19
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53484609%2fwhy-are-there-different-size-limitations-in-watson-nlc-for-training-1024-chars%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Recently, I had the same question and one of the answers on an article clarified the same
Currently, the limits are set at 1024 for training and 2048 for
testing/classification. The 1024 limit may require some curation of
the training data prior to training. Most organizations who require
larger character limits for their data end up chunking their input
text into 1024 chunks. Additionally, in use cases with data similar to
the Airbnb reviews, the primary category can typically be assessed
within the first 2048 characters since there is often a lot of noise
in lengthy reviews.
Here's the link to the article
@Rosa Post your questions here. Will try to answer best of knowledge
– Vidyasagar Machupalli
Nov 28 '18 at 10:46
Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?
– Rosa
Nov 28 '18 at 10:57
Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…
– Vidyasagar Machupalli
Nov 28 '18 at 12:06
Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?
– Rosa
Nov 28 '18 at 15:48
I would say stick with 2048 for testing/classification (production)
– Vidyasagar Machupalli
Nov 29 '18 at 0:19
add a comment |
Recently, I had the same question and one of the answers on an article clarified the same
Currently, the limits are set at 1024 for training and 2048 for
testing/classification. The 1024 limit may require some curation of
the training data prior to training. Most organizations who require
larger character limits for their data end up chunking their input
text into 1024 chunks. Additionally, in use cases with data similar to
the Airbnb reviews, the primary category can typically be assessed
within the first 2048 characters since there is often a lot of noise
in lengthy reviews.
Here's the link to the article
@Rosa Post your questions here. Will try to answer best of knowledge
– Vidyasagar Machupalli
Nov 28 '18 at 10:46
Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?
– Rosa
Nov 28 '18 at 10:57
Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…
– Vidyasagar Machupalli
Nov 28 '18 at 12:06
Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?
– Rosa
Nov 28 '18 at 15:48
I would say stick with 2048 for testing/classification (production)
– Vidyasagar Machupalli
Nov 29 '18 at 0:19
add a comment |
Recently, I had the same question and one of the answers on an article clarified the same
Currently, the limits are set at 1024 for training and 2048 for
testing/classification. The 1024 limit may require some curation of
the training data prior to training. Most organizations who require
larger character limits for their data end up chunking their input
text into 1024 chunks. Additionally, in use cases with data similar to
the Airbnb reviews, the primary category can typically be assessed
within the first 2048 characters since there is often a lot of noise
in lengthy reviews.
Here's the link to the article
Recently, I had the same question and one of the answers on an article clarified the same
Currently, the limits are set at 1024 for training and 2048 for
testing/classification. The 1024 limit may require some curation of
the training data prior to training. Most organizations who require
larger character limits for their data end up chunking their input
text into 1024 chunks. Additionally, in use cases with data similar to
the Airbnb reviews, the primary category can typically be assessed
within the first 2048 characters since there is often a lot of noise
in lengthy reviews.
Here's the link to the article
answered Nov 27 '18 at 8:20
Vidyasagar MachupalliVidyasagar Machupalli
1,1961818
1,1961818
@Rosa Post your questions here. Will try to answer best of knowledge
– Vidyasagar Machupalli
Nov 28 '18 at 10:46
Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?
– Rosa
Nov 28 '18 at 10:57
Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…
– Vidyasagar Machupalli
Nov 28 '18 at 12:06
Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?
– Rosa
Nov 28 '18 at 15:48
I would say stick with 2048 for testing/classification (production)
– Vidyasagar Machupalli
Nov 29 '18 at 0:19
add a comment |
@Rosa Post your questions here. Will try to answer best of knowledge
– Vidyasagar Machupalli
Nov 28 '18 at 10:46
Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?
– Rosa
Nov 28 '18 at 10:57
Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…
– Vidyasagar Machupalli
Nov 28 '18 at 12:06
Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?
– Rosa
Nov 28 '18 at 15:48
I would say stick with 2048 for testing/classification (production)
– Vidyasagar Machupalli
Nov 29 '18 at 0:19
@Rosa Post your questions here. Will try to answer best of knowledge
– Vidyasagar Machupalli
Nov 28 '18 at 10:46
@Rosa Post your questions here. Will try to answer best of knowledge
– Vidyasagar Machupalli
Nov 28 '18 at 10:46
Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?
– Rosa
Nov 28 '18 at 10:57
Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?
– Rosa
Nov 28 '18 at 10:57
Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…
– Vidyasagar Machupalli
Nov 28 '18 at 12:06
Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…
– Vidyasagar Machupalli
Nov 28 '18 at 12:06
Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?
– Rosa
Nov 28 '18 at 15:48
Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?
– Rosa
Nov 28 '18 at 15:48
I would say stick with 2048 for testing/classification (production)
– Vidyasagar Machupalli
Nov 29 '18 at 0:19
I would say stick with 2048 for testing/classification (production)
– Vidyasagar Machupalli
Nov 29 '18 at 0:19
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53484609%2fwhy-are-there-different-size-limitations-in-watson-nlc-for-training-1024-chars%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown