nutch urls not fetched











up vote
1
down vote

favorite












Trying to crawl some urls from a local website from this domain:



https://foo.foofoo.com


But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:



https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa


Only some urls (not all! only few!) like these get fetched:



https://foo.foofoo.com/en/foofoo


Here is my regex-urlfilter file, by which I only fetch english web pages:



-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com


Any brilliant idea, please?










share|improve this question









New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















  • How depth do you use?
    – Quent
    2 days ago










  • try this -^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
    – Quent
    2 days ago












  • Still the same. But I think your writing is better than mine. As depth, it is 10.
    – Oppa pi
    2 days ago















up vote
1
down vote

favorite












Trying to crawl some urls from a local website from this domain:



https://foo.foofoo.com


But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:



https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa


Only some urls (not all! only few!) like these get fetched:



https://foo.foofoo.com/en/foofoo


Here is my regex-urlfilter file, by which I only fetch english web pages:



-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com


Any brilliant idea, please?










share|improve this question









New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















  • How depth do you use?
    – Quent
    2 days ago










  • try this -^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
    – Quent
    2 days ago












  • Still the same. But I think your writing is better than mine. As depth, it is 10.
    – Oppa pi
    2 days ago













up vote
1
down vote

favorite









up vote
1
down vote

favorite











Trying to crawl some urls from a local website from this domain:



https://foo.foofoo.com


But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:



https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa


Only some urls (not all! only few!) like these get fetched:



https://foo.foofoo.com/en/foofoo


Here is my regex-urlfilter file, by which I only fetch english web pages:



-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com


Any brilliant idea, please?










share|improve this question









New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











Trying to crawl some urls from a local website from this domain:



https://foo.foofoo.com


But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:



https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa


Only some urls (not all! only few!) like these get fetched:



https://foo.foofoo.com/en/foofoo


Here is my regex-urlfilter file, by which I only fetch english web pages:



-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com


Any brilliant idea, please?







java regex filter web-crawler nutch






share|improve this question









New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 2 days ago





















New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 2 days ago









Oppa pi

63




63




New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • How depth do you use?
    – Quent
    2 days ago










  • try this -^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
    – Quent
    2 days ago












  • Still the same. But I think your writing is better than mine. As depth, it is 10.
    – Oppa pi
    2 days ago


















  • How depth do you use?
    – Quent
    2 days ago










  • try this -^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
    – Quent
    2 days ago












  • Still the same. But I think your writing is better than mine. As depth, it is 10.
    – Oppa pi
    2 days ago
















How depth do you use?
– Quent
2 days ago




How depth do you use?
– Quent
2 days ago












try this -^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
2 days ago






try this -^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
2 days ago














Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago




Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago












1 Answer
1






active

oldest

votes

















up vote
0
down vote



accepted










After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).






share|improve this answer








New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.


















  • cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
    – Quent
    2 days ago










  • I see. Thank you @Quent ! :)
    – Oppa pi
    2 days ago











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});






Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.










 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53372642%2fnutch-urls-not-fetched%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
0
down vote



accepted










After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).






share|improve this answer








New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.


















  • cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
    – Quent
    2 days ago










  • I see. Thank you @Quent ! :)
    – Oppa pi
    2 days ago















up vote
0
down vote



accepted










After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).






share|improve this answer








New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.


















  • cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
    – Quent
    2 days ago










  • I see. Thank you @Quent ! :)
    – Oppa pi
    2 days ago













up vote
0
down vote



accepted







up vote
0
down vote



accepted






After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).






share|improve this answer








New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).







share|improve this answer








New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this answer



share|improve this answer






New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









answered 2 days ago









Oppa pi

63




63




New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
    – Quent
    2 days ago










  • I see. Thank you @Quent ! :)
    – Oppa pi
    2 days ago


















  • cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
    – Quent
    2 days ago










  • I see. Thank you @Quent ! :)
    – Oppa pi
    2 days ago
















cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago




cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago












I see. Thank you @Quent ! :)
– Oppa pi
2 days ago




I see. Thank you @Quent ! :)
– Oppa pi
2 days ago










Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.










 

draft saved


draft discarded


















Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.













Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.












Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.















 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53372642%2fnutch-urls-not-fetched%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Wiesbaden

To store a contact into the json file from server.js file using a class in NodeJS

Marschland