nutch urls not fetched
up vote
1
down vote
favorite
Trying to crawl some urls from a local website from this domain:
https://foo.foofoo.com
But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:
https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa
Only some urls (not all! only few!) like these get fetched:
https://foo.foofoo.com/en/foofoo
Here is my regex-urlfilter file, by which I only fetch english web pages:
-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com
Any brilliant idea, please?
java regex filter web-crawler nutch
New contributor
add a comment |
up vote
1
down vote
favorite
Trying to crawl some urls from a local website from this domain:
https://foo.foofoo.com
But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:
https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa
Only some urls (not all! only few!) like these get fetched:
https://foo.foofoo.com/en/foofoo
Here is my regex-urlfilter file, by which I only fetch english web pages:
-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com
Any brilliant idea, please?
java regex filter web-crawler nutch
New contributor
How depth do you use?
– Quent
2 days ago
try this-^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
2 days ago
Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
Trying to crawl some urls from a local website from this domain:
https://foo.foofoo.com
But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:
https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa
Only some urls (not all! only few!) like these get fetched:
https://foo.foofoo.com/en/foofoo
Here is my regex-urlfilter file, by which I only fetch english web pages:
-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com
Any brilliant idea, please?
java regex filter web-crawler nutch
New contributor
Trying to crawl some urls from a local website from this domain:
https://foo.foofoo.com
But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:
https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa
Only some urls (not all! only few!) like these get fetched:
https://foo.foofoo.com/en/foofoo
Here is my regex-urlfilter file, by which I only fetch english web pages:
-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com
Any brilliant idea, please?
java regex filter web-crawler nutch
java regex filter web-crawler nutch
New contributor
New contributor
edited 2 days ago
New contributor
asked 2 days ago
Oppa pi
63
63
New contributor
New contributor
How depth do you use?
– Quent
2 days ago
try this-^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
2 days ago
Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago
add a comment |
How depth do you use?
– Quent
2 days ago
try this-^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
2 days ago
Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago
How depth do you use?
– Quent
2 days ago
How depth do you use?
– Quent
2 days ago
try this
-^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
2 days ago
try this
-^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
2 days ago
Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago
Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
accepted
After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).
New contributor
cool. For me i have this and its work maybe you do not need to delete as muchprotocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago
I see. Thank you @Quent ! :)
– Oppa pi
2 days ago
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).
New contributor
cool. For me i have this and its work maybe you do not need to delete as muchprotocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago
I see. Thank you @Quent ! :)
– Oppa pi
2 days ago
add a comment |
up vote
0
down vote
accepted
After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).
New contributor
cool. For me i have this and its work maybe you do not need to delete as muchprotocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago
I see. Thank you @Quent ! :)
– Oppa pi
2 days ago
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).
New contributor
After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).
New contributor
New contributor
answered 2 days ago
Oppa pi
63
63
New contributor
New contributor
cool. For me i have this and its work maybe you do not need to delete as muchprotocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago
I see. Thank you @Quent ! :)
– Oppa pi
2 days ago
add a comment |
cool. For me i have this and its work maybe you do not need to delete as muchprotocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago
I see. Thank you @Quent ! :)
– Oppa pi
2 days ago
cool. For me i have this and its work maybe you do not need to delete as much
protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago
cool. For me i have this and its work maybe you do not need to delete as much
protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago
I see. Thank you @Quent ! :)
– Oppa pi
2 days ago
I see. Thank you @Quent ! :)
– Oppa pi
2 days ago
add a comment |
Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.
Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.
Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.
Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53372642%2fnutch-urls-not-fetched%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
How depth do you use?
– Quent
2 days ago
try this
-^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
2 days ago
Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago