Re-processing data for Elasticsearch with a new pipeline
I have an ELK-stack server that is being used to analyse Apache web log data. We're loading ALL of the logs, going back several years. The purpose is to look at some application-specific trends over this time period.
The data-processing pipeline is still being tweaked, as this is the first time anyone has looked in detail into this data and some people are still trying to decide how they want the data to be processed.
Some changes were suggested and while they're easy enough to do in the logstash pipeline for new, incoming data, I'm not sure how to apply these changes to the data that's already in elastic. It took several days to load the current data set, and quite a bit more data has been added so re-processing everything through logstash, with the modified pipeline will probably take several days longer.
What's the best way to apply these changes to data that has already been ingested into elastic? In the early stages of testing this set-up, I would just remove the index and rebuild from scratch, but that was done with very limited data sets and with the amount of data in use here, I'm not sure that's feasible. Is there a better way?
elasticsearch logstash
add a comment |
I have an ELK-stack server that is being used to analyse Apache web log data. We're loading ALL of the logs, going back several years. The purpose is to look at some application-specific trends over this time period.
The data-processing pipeline is still being tweaked, as this is the first time anyone has looked in detail into this data and some people are still trying to decide how they want the data to be processed.
Some changes were suggested and while they're easy enough to do in the logstash pipeline for new, incoming data, I'm not sure how to apply these changes to the data that's already in elastic. It took several days to load the current data set, and quite a bit more data has been added so re-processing everything through logstash, with the modified pipeline will probably take several days longer.
What's the best way to apply these changes to data that has already been ingested into elastic? In the early stages of testing this set-up, I would just remove the index and rebuild from scratch, but that was done with very limited data sets and with the amount of data in use here, I'm not sure that's feasible. Is there a better way?
elasticsearch logstash
add a comment |
I have an ELK-stack server that is being used to analyse Apache web log data. We're loading ALL of the logs, going back several years. The purpose is to look at some application-specific trends over this time period.
The data-processing pipeline is still being tweaked, as this is the first time anyone has looked in detail into this data and some people are still trying to decide how they want the data to be processed.
Some changes were suggested and while they're easy enough to do in the logstash pipeline for new, incoming data, I'm not sure how to apply these changes to the data that's already in elastic. It took several days to load the current data set, and quite a bit more data has been added so re-processing everything through logstash, with the modified pipeline will probably take several days longer.
What's the best way to apply these changes to data that has already been ingested into elastic? In the early stages of testing this set-up, I would just remove the index and rebuild from scratch, but that was done with very limited data sets and with the amount of data in use here, I'm not sure that's feasible. Is there a better way?
elasticsearch logstash
I have an ELK-stack server that is being used to analyse Apache web log data. We're loading ALL of the logs, going back several years. The purpose is to look at some application-specific trends over this time period.
The data-processing pipeline is still being tweaked, as this is the first time anyone has looked in detail into this data and some people are still trying to decide how they want the data to be processed.
Some changes were suggested and while they're easy enough to do in the logstash pipeline for new, incoming data, I'm not sure how to apply these changes to the data that's already in elastic. It took several days to load the current data set, and quite a bit more data has been added so re-processing everything through logstash, with the modified pipeline will probably take several days longer.
What's the best way to apply these changes to data that has already been ingested into elastic? In the early stages of testing this set-up, I would just remove the index and rebuild from scratch, but that was done with very limited data sets and with the amount of data in use here, I'm not sure that's feasible. Is there a better way?
elasticsearch logstash
elasticsearch logstash
asked Nov 21 '18 at 21:48
FrustratedWithFormsDesignerFrustratedWithFormsDesigner
20.8k26115173
20.8k26115173
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Setup an ingest pipeline and use reindex API to move data from current index to new index (with the pipeline configured for destination index)
Ingest Node
It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)
– FrustratedWithFormsDesigner
Nov 22 '18 at 15:44
...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?
– FrustratedWithFormsDesigner
Nov 22 '18 at 15:59
Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.
– FrustratedWithFormsDesigner
Nov 22 '18 at 17:00
Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.
– ben5556
Nov 22 '18 at 18:47
To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.
– ben5556
Nov 22 '18 at 18:49
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53420930%2fre-processing-data-for-elasticsearch-with-a-new-pipeline%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Setup an ingest pipeline and use reindex API to move data from current index to new index (with the pipeline configured for destination index)
Ingest Node
It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)
– FrustratedWithFormsDesigner
Nov 22 '18 at 15:44
...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?
– FrustratedWithFormsDesigner
Nov 22 '18 at 15:59
Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.
– FrustratedWithFormsDesigner
Nov 22 '18 at 17:00
Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.
– ben5556
Nov 22 '18 at 18:47
To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.
– ben5556
Nov 22 '18 at 18:49
add a comment |
Setup an ingest pipeline and use reindex API to move data from current index to new index (with the pipeline configured for destination index)
Ingest Node
It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)
– FrustratedWithFormsDesigner
Nov 22 '18 at 15:44
...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?
– FrustratedWithFormsDesigner
Nov 22 '18 at 15:59
Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.
– FrustratedWithFormsDesigner
Nov 22 '18 at 17:00
Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.
– ben5556
Nov 22 '18 at 18:47
To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.
– ben5556
Nov 22 '18 at 18:49
add a comment |
Setup an ingest pipeline and use reindex API to move data from current index to new index (with the pipeline configured for destination index)
Ingest Node
Setup an ingest pipeline and use reindex API to move data from current index to new index (with the pipeline configured for destination index)
Ingest Node
answered Nov 22 '18 at 0:31
ben5556ben5556
1,8621310
1,8621310
It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)
– FrustratedWithFormsDesigner
Nov 22 '18 at 15:44
...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?
– FrustratedWithFormsDesigner
Nov 22 '18 at 15:59
Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.
– FrustratedWithFormsDesigner
Nov 22 '18 at 17:00
Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.
– ben5556
Nov 22 '18 at 18:47
To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.
– ben5556
Nov 22 '18 at 18:49
add a comment |
It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)
– FrustratedWithFormsDesigner
Nov 22 '18 at 15:44
...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?
– FrustratedWithFormsDesigner
Nov 22 '18 at 15:59
Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.
– FrustratedWithFormsDesigner
Nov 22 '18 at 17:00
Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.
– ben5556
Nov 22 '18 at 18:47
To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.
– ben5556
Nov 22 '18 at 18:49
It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)
– FrustratedWithFormsDesigner
Nov 22 '18 at 15:44
It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)
– FrustratedWithFormsDesigner
Nov 22 '18 at 15:44
...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?
– FrustratedWithFormsDesigner
Nov 22 '18 at 15:59
...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?
– FrustratedWithFormsDesigner
Nov 22 '18 at 15:59
Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.
– FrustratedWithFormsDesigner
Nov 22 '18 at 17:00
Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.
– FrustratedWithFormsDesigner
Nov 22 '18 at 17:00
Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.
– ben5556
Nov 22 '18 at 18:47
Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.
– ben5556
Nov 22 '18 at 18:47
To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.
– ben5556
Nov 22 '18 at 18:49
To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.
– ben5556
Nov 22 '18 at 18:49
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53420930%2fre-processing-data-for-elasticsearch-with-a-new-pipeline%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown