Indexing only specific domains with Solr and Nutch
I want to crawl a website with Nutch and the Index it with Solr.
I have a website which have the following structure:
Homepage: example.com
Documents I want to index: subdomain.example.com/{some_number}.html
To "discover" all these documents I start from example.com/discover which has a list of many documents that I want.
So what I have now is:
In my regex-urlfilter.txt I set to crawl only documents from example.com and this works perfectly
I index with Solr and everything works well. I use the following command:
./$nutch/bin/crawl -i -s $nutch/urls/ $nutch/ 5
What I want now is to ONLY index the documents that are in the format: subdomain.example.com/{some_number}.html, ignoring everything else (i.e. I don't want to index example.com/discover)
I guess this is done by changing some configuration in Solr, since it's the indexing part.
solr web-crawler nutch
add a comment |
I want to crawl a website with Nutch and the Index it with Solr.
I have a website which have the following structure:
Homepage: example.com
Documents I want to index: subdomain.example.com/{some_number}.html
To "discover" all these documents I start from example.com/discover which has a list of many documents that I want.
So what I have now is:
In my regex-urlfilter.txt I set to crawl only documents from example.com and this works perfectly
I index with Solr and everything works well. I use the following command:
./$nutch/bin/crawl -i -s $nutch/urls/ $nutch/ 5
What I want now is to ONLY index the documents that are in the format: subdomain.example.com/{some_number}.html, ignoring everything else (i.e. I don't want to index example.com/discover)
I guess this is done by changing some configuration in Solr, since it's the indexing part.
solr web-crawler nutch
add a comment |
I want to crawl a website with Nutch and the Index it with Solr.
I have a website which have the following structure:
Homepage: example.com
Documents I want to index: subdomain.example.com/{some_number}.html
To "discover" all these documents I start from example.com/discover which has a list of many documents that I want.
So what I have now is:
In my regex-urlfilter.txt I set to crawl only documents from example.com and this works perfectly
I index with Solr and everything works well. I use the following command:
./$nutch/bin/crawl -i -s $nutch/urls/ $nutch/ 5
What I want now is to ONLY index the documents that are in the format: subdomain.example.com/{some_number}.html, ignoring everything else (i.e. I don't want to index example.com/discover)
I guess this is done by changing some configuration in Solr, since it's the indexing part.
solr web-crawler nutch
I want to crawl a website with Nutch and the Index it with Solr.
I have a website which have the following structure:
Homepage: example.com
Documents I want to index: subdomain.example.com/{some_number}.html
To "discover" all these documents I start from example.com/discover which has a list of many documents that I want.
So what I have now is:
In my regex-urlfilter.txt I set to crawl only documents from example.com and this works perfectly
I index with Solr and everything works well. I use the following command:
./$nutch/bin/crawl -i -s $nutch/urls/ $nutch/ 5
What I want now is to ONLY index the documents that are in the format: subdomain.example.com/{some_number}.html, ignoring everything else (i.e. I don't want to index example.com/discover)
I guess this is done by changing some configuration in Solr, since it's the indexing part.
solr web-crawler nutch
solr web-crawler nutch
asked Nov 22 '18 at 11:44
Gregory WullimannGregory Wullimann
358314
358314
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
In this case, the configuration could be done on the Nutch side. Filtering the documents before they're sent to Solr.
If you only want to "index" (meaning that you want to fetch&parse all the links, but store only on Solr the ones that match the regex) you can use the index-jexl-filter. With this plugin, you can write a small JEXL script to check if the URL of a document matches your regex and if it does it will be sent to Solr.
The script could be something like (configured on your nutch-site.xml file):
url =~ "^https?://[a-z]+.example.com/(d+).html"
urlis a default primitive available on the JEXL context. You can find more info about this on https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1755-L1771
If by "index" you really meant to only crawl the URLs that match your regex (if it doesn't match it will not be fetched nor parsed) then you can use the same regex-urlfilter.txt to define the desired format. Keep in mind that with this approach you would need to run the crawl again.
Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add theindex.jexl.filterpropriety, and invalueI should put the url like you said,url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.
– Gregory Wullimann
Nov 22 '18 at 14:20
1
You also need to enable theindex-jexl-filterplugin. For doing this you need to add it to thevalueofplugin.includes. For instance you can just add it to the end.
– Jorge Luis
Nov 22 '18 at 16:16
Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.
– Gregory Wullimann
Nov 23 '18 at 15:51
Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁
– Jorge Luis
Nov 23 '18 at 16:23
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53430293%2findexing-only-specific-domains-with-solr-and-nutch%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
In this case, the configuration could be done on the Nutch side. Filtering the documents before they're sent to Solr.
If you only want to "index" (meaning that you want to fetch&parse all the links, but store only on Solr the ones that match the regex) you can use the index-jexl-filter. With this plugin, you can write a small JEXL script to check if the URL of a document matches your regex and if it does it will be sent to Solr.
The script could be something like (configured on your nutch-site.xml file):
url =~ "^https?://[a-z]+.example.com/(d+).html"
urlis a default primitive available on the JEXL context. You can find more info about this on https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1755-L1771
If by "index" you really meant to only crawl the URLs that match your regex (if it doesn't match it will not be fetched nor parsed) then you can use the same regex-urlfilter.txt to define the desired format. Keep in mind that with this approach you would need to run the crawl again.
Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add theindex.jexl.filterpropriety, and invalueI should put the url like you said,url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.
– Gregory Wullimann
Nov 22 '18 at 14:20
1
You also need to enable theindex-jexl-filterplugin. For doing this you need to add it to thevalueofplugin.includes. For instance you can just add it to the end.
– Jorge Luis
Nov 22 '18 at 16:16
Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.
– Gregory Wullimann
Nov 23 '18 at 15:51
Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁
– Jorge Luis
Nov 23 '18 at 16:23
add a comment |
In this case, the configuration could be done on the Nutch side. Filtering the documents before they're sent to Solr.
If you only want to "index" (meaning that you want to fetch&parse all the links, but store only on Solr the ones that match the regex) you can use the index-jexl-filter. With this plugin, you can write a small JEXL script to check if the URL of a document matches your regex and if it does it will be sent to Solr.
The script could be something like (configured on your nutch-site.xml file):
url =~ "^https?://[a-z]+.example.com/(d+).html"
urlis a default primitive available on the JEXL context. You can find more info about this on https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1755-L1771
If by "index" you really meant to only crawl the URLs that match your regex (if it doesn't match it will not be fetched nor parsed) then you can use the same regex-urlfilter.txt to define the desired format. Keep in mind that with this approach you would need to run the crawl again.
Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add theindex.jexl.filterpropriety, and invalueI should put the url like you said,url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.
– Gregory Wullimann
Nov 22 '18 at 14:20
1
You also need to enable theindex-jexl-filterplugin. For doing this you need to add it to thevalueofplugin.includes. For instance you can just add it to the end.
– Jorge Luis
Nov 22 '18 at 16:16
Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.
– Gregory Wullimann
Nov 23 '18 at 15:51
Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁
– Jorge Luis
Nov 23 '18 at 16:23
add a comment |
In this case, the configuration could be done on the Nutch side. Filtering the documents before they're sent to Solr.
If you only want to "index" (meaning that you want to fetch&parse all the links, but store only on Solr the ones that match the regex) you can use the index-jexl-filter. With this plugin, you can write a small JEXL script to check if the URL of a document matches your regex and if it does it will be sent to Solr.
The script could be something like (configured on your nutch-site.xml file):
url =~ "^https?://[a-z]+.example.com/(d+).html"
urlis a default primitive available on the JEXL context. You can find more info about this on https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1755-L1771
If by "index" you really meant to only crawl the URLs that match your regex (if it doesn't match it will not be fetched nor parsed) then you can use the same regex-urlfilter.txt to define the desired format. Keep in mind that with this approach you would need to run the crawl again.
In this case, the configuration could be done on the Nutch side. Filtering the documents before they're sent to Solr.
If you only want to "index" (meaning that you want to fetch&parse all the links, but store only on Solr the ones that match the regex) you can use the index-jexl-filter. With this plugin, you can write a small JEXL script to check if the URL of a document matches your regex and if it does it will be sent to Solr.
The script could be something like (configured on your nutch-site.xml file):
url =~ "^https?://[a-z]+.example.com/(d+).html"
urlis a default primitive available on the JEXL context. You can find more info about this on https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1755-L1771
If by "index" you really meant to only crawl the URLs that match your regex (if it doesn't match it will not be fetched nor parsed) then you can use the same regex-urlfilter.txt to define the desired format. Keep in mind that with this approach you would need to run the crawl again.
answered Nov 22 '18 at 13:09
Jorge LuisJorge Luis
2,07421016
2,07421016
Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add theindex.jexl.filterpropriety, and invalueI should put the url like you said,url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.
– Gregory Wullimann
Nov 22 '18 at 14:20
1
You also need to enable theindex-jexl-filterplugin. For doing this you need to add it to thevalueofplugin.includes. For instance you can just add it to the end.
– Jorge Luis
Nov 22 '18 at 16:16
Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.
– Gregory Wullimann
Nov 23 '18 at 15:51
Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁
– Jorge Luis
Nov 23 '18 at 16:23
add a comment |
Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add theindex.jexl.filterpropriety, and invalueI should put the url like you said,url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.
– Gregory Wullimann
Nov 22 '18 at 14:20
1
You also need to enable theindex-jexl-filterplugin. For doing this you need to add it to thevalueofplugin.includes. For instance you can just add it to the end.
– Jorge Luis
Nov 22 '18 at 16:16
Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.
– Gregory Wullimann
Nov 23 '18 at 15:51
Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁
– Jorge Luis
Nov 23 '18 at 16:23
Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add the
index.jexl.filter propriety, and in value I should put the url like you said, url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.– Gregory Wullimann
Nov 22 '18 at 14:20
Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add the
index.jexl.filter propriety, and in value I should put the url like you said, url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.– Gregory Wullimann
Nov 22 '18 at 14:20
1
1
You also need to enable the
index-jexl-filter plugin. For doing this you need to add it to the value of plugin.includes. For instance you can just add it to the end.– Jorge Luis
Nov 22 '18 at 16:16
You also need to enable the
index-jexl-filter plugin. For doing this you need to add it to the value of plugin.includes. For instance you can just add it to the end.– Jorge Luis
Nov 22 '18 at 16:16
Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.
– Gregory Wullimann
Nov 23 '18 at 15:51
Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.
– Gregory Wullimann
Nov 23 '18 at 15:51
Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁
– Jorge Luis
Nov 23 '18 at 16:23
Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁
– Jorge Luis
Nov 23 '18 at 16:23
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53430293%2findexing-only-specific-domains-with-solr-and-nutch%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown