Indexing only specific domains with Solr and Nutch

I want to crawl a website with Nutch and the Index it with Solr.

I have a website which have the following structure:

Homepage: example.com

Documents I want to index: subdomain.example.com/{some_number}.html

To "discover" all these documents I start from example.com/discover which has a list of many documents that I want.

So what I have now is:

In my regex-urlfilter.txt I set to crawl only documents from example.com and this works perfectly

I index with Solr and everything works well. I use the following command:

./$nutch/bin/crawl -i -s $nutch/urls/ $nutch/ 5

What I want now is to ONLY index the documents that are in the format: subdomain.example.com/{some_number}.html, ignoring everything else (i.e. I don't want to index example.com/discover)

I guess this is done by changing some configuration in Solr, since it's the indexing part.

asked Nov 22 '18 at 11:44

Gregory Wullimann

358314

add a comment |

I want to crawl a website with Nutch and the Index it with Solr.

I have a website which have the following structure:

Homepage: example.com

Documents I want to index: subdomain.example.com/{some_number}.html

To "discover" all these documents I start from example.com/discover which has a list of many documents that I want.

So what I have now is:

In my regex-urlfilter.txt I set to crawl only documents from example.com and this works perfectly

I index with Solr and everything works well. I use the following command:

./$nutch/bin/crawl -i -s $nutch/urls/ $nutch/ 5

What I want now is to ONLY index the documents that are in the format: subdomain.example.com/{some_number}.html, ignoring everything else (i.e. I don't want to index example.com/discover)

I guess this is done by changing some configuration in Solr, since it's the indexing part.

asked Nov 22 '18 at 11:44

Gregory Wullimann

358314

add a comment |

I want to crawl a website with Nutch and the Index it with Solr.

I have a website which have the following structure:

Homepage: example.com

Documents I want to index: subdomain.example.com/{some_number}.html

To "discover" all these documents I start from example.com/discover which has a list of many documents that I want.

So what I have now is:

In my regex-urlfilter.txt I set to crawl only documents from example.com and this works perfectly

I index with Solr and everything works well. I use the following command:

./$nutch/bin/crawl -i -s $nutch/urls/ $nutch/ 5

What I want now is to ONLY index the documents that are in the format: subdomain.example.com/{some_number}.html, ignoring everything else (i.e. I don't want to index example.com/discover)

I guess this is done by changing some configuration in Solr, since it's the indexing part.

asked Nov 22 '18 at 11:44

Gregory Wullimann

358314

I want to crawl a website with Nutch and the Index it with Solr.

I have a website which have the following structure:

Homepage: example.com

Documents I want to index: subdomain.example.com/{some_number}.html

To "discover" all these documents I start from example.com/discover which has a list of many documents that I want.

So what I have now is:

In my regex-urlfilter.txt I set to crawl only documents from example.com and this works perfectly

I index with Solr and everything works well. I use the following command:

./$nutch/bin/crawl -i -s $nutch/urls/ $nutch/ 5

What I want now is to ONLY index the documents that are in the format: subdomain.example.com/{some_number}.html, ignoring everything else (i.e. I don't want to index example.com/discover)

I guess this is done by changing some configuration in Solr, since it's the indexing part.

solr web-crawler nutch

asked Nov 22 '18 at 11:44

Gregory Wullimann

358314

asked Nov 22 '18 at 11:44

Gregory Wullimann

358314

asked Nov 22 '18 at 11:44

Gregory Wullimann

358314

asked Nov 22 '18 at 11:44

Gregory Wullimann

358314

asked Nov 22 '18 at 11:44

Gregory Wullimann

358314

add a comment |

1 Answer
1

active

oldest

votes

In this case, the configuration could be done on the Nutch side. Filtering the documents before they're sent to Solr.

If you only want to "index" (meaning that you want to fetch&parse all the links, but store only on Solr the ones that match the regex) you can use the index-jexl-filter. With this plugin, you can write a small JEXL script to check if the URL of a document matches your regex and if it does it will be sent to Solr.

The script could be something like (configured on your nutch-site.xml file):

url =~ "^https?://[a-z]+.example.com/(d+).html"

url is a default primitive available on the JEXL context. You can find more info about this on https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1755-L1771

If by "index" you really meant to only crawl the URLs that match your regex (if it doesn't match it will not be fetched nor parsed) then you can use the same regex-urlfilter.txt to define the desired format. Keep in mind that with this approach you would need to run the crawl again.

answered Nov 22 '18 at 13:09

Jorge Luis

2,07421016

Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add the index.jexl.filter propriety, and in value I should put the url like you said, url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.

– Gregory Wullimann
Nov 22 '18 at 14:20

1

You also need to enable the index-jexl-filter plugin. For doing this you need to add it to the value of plugin.includes. For instance you can just add it to the end.

– Jorge Luis
Nov 22 '18 at 16:16

Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.

– Gregory Wullimann
Nov 23 '18 at 15:51

Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁

– Jorge Luis
Nov 23 '18 at 16:23

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53430293%2findexing-only-specific-domains-with-solr-and-nutch%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

In this case, the configuration could be done on the Nutch side. Filtering the documents before they're sent to Solr.

The script could be something like (configured on your nutch-site.xml file):

url =~ "^https?://[a-z]+.example.com/(d+).html"

url is a default primitive available on the JEXL context. You can find more info about this on https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1755-L1771

answered Nov 22 '18 at 13:09

Jorge Luis

2,07421016

Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add the index.jexl.filter propriety, and in value I should put the url like you said, url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.

– Gregory Wullimann
Nov 22 '18 at 14:20

1

You also need to enable the index-jexl-filter plugin. For doing this you need to add it to the value of plugin.includes. For instance you can just add it to the end.

– Jorge Luis
Nov 22 '18 at 16:16

Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.

– Gregory Wullimann
Nov 23 '18 at 15:51

Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁

– Jorge Luis
Nov 23 '18 at 16:23

add a comment |

In this case, the configuration could be done on the Nutch side. Filtering the documents before they're sent to Solr.

The script could be something like (configured on your nutch-site.xml file):

url =~ "^https?://[a-z]+.example.com/(d+).html"

url is a default primitive available on the JEXL context. You can find more info about this on https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1755-L1771

answered Nov 22 '18 at 13:09

Jorge Luis

2,07421016

Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add the index.jexl.filter propriety, and in value I should put the url like you said, url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.

– Gregory Wullimann
Nov 22 '18 at 14:20

1

You also need to enable the index-jexl-filter plugin. For doing this you need to add it to the value of plugin.includes. For instance you can just add it to the end.

– Jorge Luis
Nov 22 '18 at 16:16

Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.

– Gregory Wullimann
Nov 23 '18 at 15:51

Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁

– Jorge Luis
Nov 23 '18 at 16:23

add a comment |

In this case, the configuration could be done on the Nutch side. Filtering the documents before they're sent to Solr.

The script could be something like (configured on your nutch-site.xml file):

url =~ "^https?://[a-z]+.example.com/(d+).html"

url is a default primitive available on the JEXL context. You can find more info about this on https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1755-L1771

answered Nov 22 '18 at 13:09

Jorge Luis

2,07421016

In this case, the configuration could be done on the Nutch side. Filtering the documents before they're sent to Solr.

The script could be something like (configured on your nutch-site.xml file):

url =~ "^https?://[a-z]+.example.com/(d+).html"

url is a default primitive available on the JEXL context. You can find more info about this on https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1755-L1771

answered Nov 22 '18 at 13:09

Jorge Luis

2,07421016

answered Nov 22 '18 at 13:09

Jorge Luis

2,07421016

answered Nov 22 '18 at 13:09

Jorge Luis

2,07421016

answered Nov 22 '18 at 13:09

Jorge Luis

2,07421016

Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add the index.jexl.filter propriety, and in value I should put the url like you said, url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.

– Gregory Wullimann
Nov 22 '18 at 14:20

1

You also need to enable the index-jexl-filter plugin. For doing this you need to add it to the value of plugin.includes. For instance you can just add it to the end.

– Jorge Luis
Nov 22 '18 at 16:16

Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.

– Gregory Wullimann
Nov 23 '18 at 15:51

Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁

– Jorge Luis
Nov 23 '18 at 16:23

add a comment |

Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add the index.jexl.filter propriety, and in value I should put the url like you said, url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.

– Gregory Wullimann
Nov 22 '18 at 14:20

1

You also need to enable the index-jexl-filter plugin. For doing this you need to add it to the value of plugin.includes. For instance you can just add it to the end.

– Jorge Luis
Nov 22 '18 at 16:16

Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.

– Gregory Wullimann
Nov 23 '18 at 15:51

Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁

– Jorge Luis
Nov 23 '18 at 16:23

Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add the index.jexl.filter propriety, and in value I should put the url like you said, url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.

– Gregory Wullimann
Nov 22 '18 at 14:20

You also need to enable the index-jexl-filter plugin. For doing this you need to add it to the value of plugin.includes. For instance you can just add it to the end.

– Jorge Luis
Nov 22 '18 at 16:16

Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.

– Gregory Wullimann
Nov 23 '18 at 15:51

Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁

– Jorge Luis
Nov 23 '18 at 16:23

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytukyg