nutch urls not fetched

Multi tool use
up vote
1
down vote
favorite
Trying to crawl some urls from a local website from this domain:
https://foo.foofoo.com
But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:
https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa
Only some urls (not all! only few!) like these get fetched:
https://foo.foofoo.com/en/foofoo
Here is my regex-urlfilter file, by which I only fetch english web pages:
-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com
Any brilliant idea, please?
java regex filter web-crawler nutch
New contributor
Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
up vote
1
down vote
favorite
Trying to crawl some urls from a local website from this domain:
https://foo.foofoo.com
But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:
https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa
Only some urls (not all! only few!) like these get fetched:
https://foo.foofoo.com/en/foofoo
Here is my regex-urlfilter file, by which I only fetch english web pages:
-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com
Any brilliant idea, please?
java regex filter web-crawler nutch
New contributor
Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
How depth do you use?
– Quent
2 days ago
try this-^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
2 days ago
Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
Trying to crawl some urls from a local website from this domain:
https://foo.foofoo.com
But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:
https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa
Only some urls (not all! only few!) like these get fetched:
https://foo.foofoo.com/en/foofoo
Here is my regex-urlfilter file, by which I only fetch english web pages:
-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com
Any brilliant idea, please?
java regex filter web-crawler nutch
New contributor
Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Trying to crawl some urls from a local website from this domain:
https://foo.foofoo.com
But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:
https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa
Only some urls (not all! only few!) like these get fetched:
https://foo.foofoo.com/en/foofoo
Here is my regex-urlfilter file, by which I only fetch english web pages:
-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com
Any brilliant idea, please?
java regex filter web-crawler nutch
java regex filter web-crawler nutch
New contributor
Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
edited 2 days ago
New contributor
Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
asked 2 days ago


Oppa pi
63
63
New contributor
Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
How depth do you use?
– Quent
2 days ago
try this-^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
2 days ago
Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago
add a comment |
How depth do you use?
– Quent
2 days ago
try this-^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
2 days ago
Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago
How depth do you use?
– Quent
2 days ago
How depth do you use?
– Quent
2 days ago
try this
-^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
2 days ago
try this
-^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
2 days ago
Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago
Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
accepted
After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).
New contributor
Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
cool. For me i have this and its work maybe you do not need to delete as muchprotocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago
I see. Thank you @Quent ! :)
– Oppa pi
2 days ago
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).
New contributor
Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
cool. For me i have this and its work maybe you do not need to delete as muchprotocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago
I see. Thank you @Quent ! :)
– Oppa pi
2 days ago
add a comment |
up vote
0
down vote
accepted
After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).
New contributor
Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
cool. For me i have this and its work maybe you do not need to delete as muchprotocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago
I see. Thank you @Quent ! :)
– Oppa pi
2 days ago
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).
New contributor
Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).
New contributor
Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
answered 2 days ago


Oppa pi
63
63
New contributor
Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
cool. For me i have this and its work maybe you do not need to delete as muchprotocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago
I see. Thank you @Quent ! :)
– Oppa pi
2 days ago
add a comment |
cool. For me i have this and its work maybe you do not need to delete as muchprotocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago
I see. Thank you @Quent ! :)
– Oppa pi
2 days ago
cool. For me i have this and its work maybe you do not need to delete as much
protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago
cool. For me i have this and its work maybe you do not need to delete as much
protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago
I see. Thank you @Quent ! :)
– Oppa pi
2 days ago
I see. Thank you @Quent ! :)
– Oppa pi
2 days ago
add a comment |
Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.
Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.
Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.
Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53372642%2fnutch-urls-not-fetched%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3j3OaTjyGHG4LqSgiq,rHPz7a85Ch1Q3C8z2,M9U1lwrUgQqwnilXbX 9F8p4Yt3v0rVenti DIBBT
How depth do you use?
– Quent
2 days ago
try this
-^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
2 days ago
Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago