nutch urls not fetched

up vote
1
down vote

favorite

Trying to crawl some urls from a local website from this domain:

https://foo.foofoo.com

But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:

https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo

https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo

https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa

Only some urls (not all! only few!) like these get fetched:

https://foo.foofoo.com/en/foofoo

Here is my regex-urlfilter file, by which I only fetch english web pages:

-^(file|ftp|mailto):

-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)

+^https?://foo.foofoo.com

Any brilliant idea, please?

edited 2 days ago

asked 2 days ago

Oppa pi

New contributor

How depth do you use?
– Quent
2 days ago

try this -^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
2 days ago

Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago

add a comment |

up vote
1
down vote

favorite

Trying to crawl some urls from a local website from this domain:

https://foo.foofoo.com

But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:

https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo

https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo

https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa

Only some urls (not all! only few!) like these get fetched:

https://foo.foofoo.com/en/foofoo

Here is my regex-urlfilter file, by which I only fetch english web pages:

-^(file|ftp|mailto):

-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)

+^https?://foo.foofoo.com

Any brilliant idea, please?

edited 2 days ago

asked 2 days ago

Oppa pi

New contributor

How depth do you use?
– Quent
2 days ago

try this -^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
2 days ago

Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago

add a comment |

up vote
1
down vote

favorite

Trying to crawl some urls from a local website from this domain:

https://foo.foofoo.com

But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:

https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo

https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo

https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa

Only some urls (not all! only few!) like these get fetched:

https://foo.foofoo.com/en/foofoo

Here is my regex-urlfilter file, by which I only fetch english web pages:

-^(file|ftp|mailto):

-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)

+^https?://foo.foofoo.com

Any brilliant idea, please?

edited 2 days ago

asked 2 days ago

Oppa pi

New contributor

Trying to crawl some urls from a local website from this domain:

https://foo.foofoo.com

But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:

https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo

https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo

https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa

Only some urls (not all! only few!) like these get fetched:

https://foo.foofoo.com/en/foofoo

Here is my regex-urlfilter file, by which I only fetch english web pages:

-^(file|ftp|mailto):

-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)

+^https?://foo.foofoo.com

Any brilliant idea, please?

java regex filter web-crawler nutch

edited 2 days ago

asked 2 days ago

Oppa pi

New contributor

edited 2 days ago

asked 2 days ago

Oppa pi

New contributor

edited 2 days ago

asked 2 days ago

Oppa pi

New contributor

asked 2 days ago

Oppa pi

asked 2 days ago

Oppa pi

New contributor

Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

How depth do you use?
– Quent
2 days ago

try this -^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
2 days ago

Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago

add a comment |

How depth do you use?
– Quent
2 days ago

try this -^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
2 days ago

Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago

How depth do you use?
– Quent
2 days ago

Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
2 days ago

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).

answered 2 days ago

Oppa pi

New contributor

cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago

I see. Thank you @Quent ! :)
– Oppa pi
2 days ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53372642%2fnutch-urls-not-fetched%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).

answered 2 days ago

Oppa pi

New contributor

cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago

I see. Thank you @Quent ! :)
– Oppa pi
2 days ago

add a comment |

up vote
0
down vote

accepted

After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).

answered 2 days ago

Oppa pi

New contributor

cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago

I see. Thank you @Quent ! :)
– Oppa pi
2 days ago

add a comment |

up vote
0
down vote

accepted

After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).

answered 2 days ago

Oppa pi

New contributor

After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).

answered 2 days ago

Oppa pi

New contributor

answered 2 days ago

Oppa pi

New contributor

answered 2 days ago

Oppa pi

answered 2 days ago

Oppa pi

New contributor

Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago

I see. Thank you @Quent ! :)
– Oppa pi
2 days ago

add a comment |

cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
2 days ago

I see. Thank you @Quent ! :)
– Oppa pi
2 days ago

cool. For me i have this and its work maybe you do not need to delete as much

protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)

– Quent
2 days ago

cool. For me i have this and its work maybe you do not need to delete as much

protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)

– Quent
2 days ago

I see. Thank you @Quent ! :)
– Oppa pi
2 days ago

add a comment |

Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytukyg