Why are there different size limitations in Watson NLC for training (1024 chars) and for production (2048...

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

IBM Watson Natural Language Classifier (NLC) limits the text values in the training set to 1024 characters:
https://console.bluemix.net/docs/services/natural-language-classifier/using-your-data.html#training-limits .

However the trained model can then classify every text whose length is at most 2048 characters:
https://console.bluemix.net/apidocs/natural-language-classifier#classify-a-phrase .

This difference creates some confusion for me: I have always known that we should apply the same pre-processing to both training phase and production phase, therefore if I had to cap off the training data at 1024 chars I would do the same also in production.

Is my reasoning correct or not? Should I cap off the text in production at 1024 chars (as I think I should) or at 2048 chars (maybe because 1024 chars are too few)?

Thank you in advance!

asked Nov 26 '18 at 15:44

Rosa

102

add a comment |

However the trained model can then classify every text whose length is at most 2048 characters:
https://console.bluemix.net/apidocs/natural-language-classifier#classify-a-phrase .

Is my reasoning correct or not? Should I cap off the text in production at 1024 chars (as I think I should) or at 2048 chars (maybe because 1024 chars are too few)?

Thank you in advance!

asked Nov 26 '18 at 15:44

Rosa

102

add a comment |

However the trained model can then classify every text whose length is at most 2048 characters:
https://console.bluemix.net/apidocs/natural-language-classifier#classify-a-phrase .

Is my reasoning correct or not? Should I cap off the text in production at 1024 chars (as I think I should) or at 2048 chars (maybe because 1024 chars are too few)?

Thank you in advance!

asked Nov 26 '18 at 15:44

Rosa

102

However the trained model can then classify every text whose length is at most 2048 characters:
https://console.bluemix.net/apidocs/natural-language-classifier#classify-a-phrase .

Is my reasoning correct or not? Should I cap off the text in production at 1024 chars (as I think I should) or at 2048 chars (maybe because 1024 chars are too few)?

Thank you in advance!

ibm-watson nl-classifier

asked Nov 26 '18 at 15:44

Rosa

102

asked Nov 26 '18 at 15:44

Rosa

102

asked Nov 26 '18 at 15:44

Rosa

102

asked Nov 26 '18 at 15:44

Rosa

102

asked Nov 26 '18 at 15:44

Rosa

102

add a comment |

1 Answer
1

active

oldest

votes

Recently, I had the same question and one of the answers on an article clarified the same

Currently, the limits are set at 1024 for training and 2048 for
testing/classification. The 1024 limit may require some curation of
the training data prior to training. Most organizations who require
larger character limits for their data end up chunking their input
text into 1024 chunks. Additionally, in use cases with data similar to
the Airbnb reviews, the primary category can typically be assessed
within the first 2048 characters since there is often a lot of noise
in lengthy reviews.

Here's the link to the article

answered Nov 27 '18 at 8:20

Vidyasagar Machupalli

1,1961818

@Rosa Post your questions here. Will try to answer best of knowledge

– Vidyasagar Machupalli
Nov 28 '18 at 10:46

Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?

– Rosa
Nov 28 '18 at 10:57

Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…

– Vidyasagar Machupalli
Nov 28 '18 at 12:06

Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?

– Rosa
Nov 28 '18 at 15:48

I would say stick with 2048 for testing/classification (production)

– Vidyasagar Machupalli
Nov 29 '18 at 0:19

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53484609%2fwhy-are-there-different-size-limitations-in-watson-nlc-for-training-1024-chars%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Recently, I had the same question and one of the answers on an article clarified the same

Currently, the limits are set at 1024 for training and 2048 for
testing/classification. The 1024 limit may require some curation of
the training data prior to training. Most organizations who require
larger character limits for their data end up chunking their input
text into 1024 chunks. Additionally, in use cases with data similar to
the Airbnb reviews, the primary category can typically be assessed
within the first 2048 characters since there is often a lot of noise
in lengthy reviews.

Here's the link to the article

answered Nov 27 '18 at 8:20

Vidyasagar Machupalli

1,1961818

@Rosa Post your questions here. Will try to answer best of knowledge

– Vidyasagar Machupalli
Nov 28 '18 at 10:46

Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?

– Rosa
Nov 28 '18 at 10:57

Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…

– Vidyasagar Machupalli
Nov 28 '18 at 12:06

Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?

– Rosa
Nov 28 '18 at 15:48

I would say stick with 2048 for testing/classification (production)

– Vidyasagar Machupalli
Nov 29 '18 at 0:19

add a comment |

Recently, I had the same question and one of the answers on an article clarified the same

Currently, the limits are set at 1024 for training and 2048 for
testing/classification. The 1024 limit may require some curation of
the training data prior to training. Most organizations who require
larger character limits for their data end up chunking their input
text into 1024 chunks. Additionally, in use cases with data similar to
the Airbnb reviews, the primary category can typically be assessed
within the first 2048 characters since there is often a lot of noise
in lengthy reviews.

Here's the link to the article

answered Nov 27 '18 at 8:20

Vidyasagar Machupalli

1,1961818

@Rosa Post your questions here. Will try to answer best of knowledge

– Vidyasagar Machupalli
Nov 28 '18 at 10:46

Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?

– Rosa
Nov 28 '18 at 10:57

Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…

– Vidyasagar Machupalli
Nov 28 '18 at 12:06

Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?

– Rosa
Nov 28 '18 at 15:48

I would say stick with 2048 for testing/classification (production)

– Vidyasagar Machupalli
Nov 29 '18 at 0:19

add a comment |

Recently, I had the same question and one of the answers on an article clarified the same

Currently, the limits are set at 1024 for training and 2048 for
testing/classification. The 1024 limit may require some curation of
the training data prior to training. Most organizations who require
larger character limits for their data end up chunking their input
text into 1024 chunks. Additionally, in use cases with data similar to
the Airbnb reviews, the primary category can typically be assessed
within the first 2048 characters since there is often a lot of noise
in lengthy reviews.

Here's the link to the article

answered Nov 27 '18 at 8:20

Vidyasagar Machupalli

1,1961818

Recently, I had the same question and one of the answers on an article clarified the same

Currently, the limits are set at 1024 for training and 2048 for
testing/classification. The 1024 limit may require some curation of
the training data prior to training. Most organizations who require
larger character limits for their data end up chunking their input
text into 1024 chunks. Additionally, in use cases with data similar to
the Airbnb reviews, the primary category can typically be assessed
within the first 2048 characters since there is often a lot of noise
in lengthy reviews.

Here's the link to the article

answered Nov 27 '18 at 8:20

Vidyasagar Machupalli

1,1961818

answered Nov 27 '18 at 8:20

Vidyasagar Machupalli

1,1961818

answered Nov 27 '18 at 8:20

Vidyasagar Machupalli

1,1961818

answered Nov 27 '18 at 8:20

Vidyasagar Machupalli

1,1961818

@Rosa Post your questions here. Will try to answer best of knowledge

– Vidyasagar Machupalli
Nov 28 '18 at 10:46

Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?

– Rosa
Nov 28 '18 at 10:57

Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…

– Vidyasagar Machupalli
Nov 28 '18 at 12:06

Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?

– Rosa
Nov 28 '18 at 15:48

I would say stick with 2048 for testing/classification (production)

– Vidyasagar Machupalli
Nov 29 '18 at 0:19

add a comment |

@Rosa Post your questions here. Will try to answer best of knowledge

– Vidyasagar Machupalli
Nov 28 '18 at 10:46

Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?

– Rosa
Nov 28 '18 at 10:57

Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…

– Vidyasagar Machupalli
Nov 28 '18 at 12:06

Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?

– Rosa
Nov 28 '18 at 15:48

I would say stick with 2048 for testing/classification (production)

– Vidyasagar Machupalli
Nov 29 '18 at 0:19

@Rosa Post your questions here. Will try to answer best of knowledge

– Vidyasagar Machupalli
Nov 28 '18 at 10:46

Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that?

– Rosa
Nov 28 '18 at 10:57

Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation ibm.com/watson/assets-watson/pdf/…

– Vidyasagar Machupalli
Nov 28 '18 at 12:06

Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048?

– Rosa
Nov 28 '18 at 15:48

I would say stick with 2048 for testing/classification (production)

– Vidyasagar Machupalli
Nov 29 '18 at 0:19

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytukyg