Indexing only specific domains with Solr and Nutch












0















I want to crawl a website with Nutch and the Index it with Solr.



I have a website which have the following structure:



Homepage: example.com



Documents I want to index: subdomain.example.com/{some_number}.html



To "discover" all these documents I start from example.com/discover which has a list of many documents that I want.



So what I have now is:



In my regex-urlfilter.txt I set to crawl only documents from example.com and this works perfectly



I index with Solr and everything works well. I use the following command:



./$nutch/bin/crawl -i -s $nutch/urls/ $nutch/ 5



What I want now is to ONLY index the documents that are in the format: subdomain.example.com/{some_number}.html, ignoring everything else (i.e. I don't want to index example.com/discover)



I guess this is done by changing some configuration in Solr, since it's the indexing part.










share|improve this question



























    0















    I want to crawl a website with Nutch and the Index it with Solr.



    I have a website which have the following structure:



    Homepage: example.com



    Documents I want to index: subdomain.example.com/{some_number}.html



    To "discover" all these documents I start from example.com/discover which has a list of many documents that I want.



    So what I have now is:



    In my regex-urlfilter.txt I set to crawl only documents from example.com and this works perfectly



    I index with Solr and everything works well. I use the following command:



    ./$nutch/bin/crawl -i -s $nutch/urls/ $nutch/ 5



    What I want now is to ONLY index the documents that are in the format: subdomain.example.com/{some_number}.html, ignoring everything else (i.e. I don't want to index example.com/discover)



    I guess this is done by changing some configuration in Solr, since it's the indexing part.










    share|improve this question

























      0












      0








      0








      I want to crawl a website with Nutch and the Index it with Solr.



      I have a website which have the following structure:



      Homepage: example.com



      Documents I want to index: subdomain.example.com/{some_number}.html



      To "discover" all these documents I start from example.com/discover which has a list of many documents that I want.



      So what I have now is:



      In my regex-urlfilter.txt I set to crawl only documents from example.com and this works perfectly



      I index with Solr and everything works well. I use the following command:



      ./$nutch/bin/crawl -i -s $nutch/urls/ $nutch/ 5



      What I want now is to ONLY index the documents that are in the format: subdomain.example.com/{some_number}.html, ignoring everything else (i.e. I don't want to index example.com/discover)



      I guess this is done by changing some configuration in Solr, since it's the indexing part.










      share|improve this question














      I want to crawl a website with Nutch and the Index it with Solr.



      I have a website which have the following structure:



      Homepage: example.com



      Documents I want to index: subdomain.example.com/{some_number}.html



      To "discover" all these documents I start from example.com/discover which has a list of many documents that I want.



      So what I have now is:



      In my regex-urlfilter.txt I set to crawl only documents from example.com and this works perfectly



      I index with Solr and everything works well. I use the following command:



      ./$nutch/bin/crawl -i -s $nutch/urls/ $nutch/ 5



      What I want now is to ONLY index the documents that are in the format: subdomain.example.com/{some_number}.html, ignoring everything else (i.e. I don't want to index example.com/discover)



      I guess this is done by changing some configuration in Solr, since it's the indexing part.







      solr web-crawler nutch






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 22 '18 at 11:44









      Gregory WullimannGregory Wullimann

      358314




      358314
























          1 Answer
          1






          active

          oldest

          votes


















          2














          In this case, the configuration could be done on the Nutch side. Filtering the documents before they're sent to Solr.



          If you only want to "index" (meaning that you want to fetch&parse all the links, but store only on Solr the ones that match the regex) you can use the index-jexl-filter. With this plugin, you can write a small JEXL script to check if the URL of a document matches your regex and if it does it will be sent to Solr.



          The script could be something like (configured on your nutch-site.xml file):



          url =~ "^https?://[a-z]+.example.com/(d+).html"




          • url is a default primitive available on the JEXL context. You can find more info about this on https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1755-L1771


          If by "index" you really meant to only crawl the URLs that match your regex (if it doesn't match it will not be fetched nor parsed) then you can use the same regex-urlfilter.txt to define the desired format. Keep in mind that with this approach you would need to run the crawl again.






          share|improve this answer
























          • Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add the index.jexl.filter propriety, and in value I should put the url like you said, url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.

            – Gregory Wullimann
            Nov 22 '18 at 14:20








          • 1





            You also need to enable the index-jexl-filter plugin. For doing this you need to add it to the value of plugin.includes. For instance you can just add it to the end.

            – Jorge Luis
            Nov 22 '18 at 16:16













          • Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.

            – Gregory Wullimann
            Nov 23 '18 at 15:51













          • Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁

            – Jorge Luis
            Nov 23 '18 at 16:23











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53430293%2findexing-only-specific-domains-with-solr-and-nutch%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2














          In this case, the configuration could be done on the Nutch side. Filtering the documents before they're sent to Solr.



          If you only want to "index" (meaning that you want to fetch&parse all the links, but store only on Solr the ones that match the regex) you can use the index-jexl-filter. With this plugin, you can write a small JEXL script to check if the URL of a document matches your regex and if it does it will be sent to Solr.



          The script could be something like (configured on your nutch-site.xml file):



          url =~ "^https?://[a-z]+.example.com/(d+).html"




          • url is a default primitive available on the JEXL context. You can find more info about this on https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1755-L1771


          If by "index" you really meant to only crawl the URLs that match your regex (if it doesn't match it will not be fetched nor parsed) then you can use the same regex-urlfilter.txt to define the desired format. Keep in mind that with this approach you would need to run the crawl again.






          share|improve this answer
























          • Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add the index.jexl.filter propriety, and in value I should put the url like you said, url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.

            – Gregory Wullimann
            Nov 22 '18 at 14:20








          • 1





            You also need to enable the index-jexl-filter plugin. For doing this you need to add it to the value of plugin.includes. For instance you can just add it to the end.

            – Jorge Luis
            Nov 22 '18 at 16:16













          • Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.

            – Gregory Wullimann
            Nov 23 '18 at 15:51













          • Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁

            – Jorge Luis
            Nov 23 '18 at 16:23
















          2














          In this case, the configuration could be done on the Nutch side. Filtering the documents before they're sent to Solr.



          If you only want to "index" (meaning that you want to fetch&parse all the links, but store only on Solr the ones that match the regex) you can use the index-jexl-filter. With this plugin, you can write a small JEXL script to check if the URL of a document matches your regex and if it does it will be sent to Solr.



          The script could be something like (configured on your nutch-site.xml file):



          url =~ "^https?://[a-z]+.example.com/(d+).html"




          • url is a default primitive available on the JEXL context. You can find more info about this on https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1755-L1771


          If by "index" you really meant to only crawl the URLs that match your regex (if it doesn't match it will not be fetched nor parsed) then you can use the same regex-urlfilter.txt to define the desired format. Keep in mind that with this approach you would need to run the crawl again.






          share|improve this answer
























          • Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add the index.jexl.filter propriety, and in value I should put the url like you said, url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.

            – Gregory Wullimann
            Nov 22 '18 at 14:20








          • 1





            You also need to enable the index-jexl-filter plugin. For doing this you need to add it to the value of plugin.includes. For instance you can just add it to the end.

            – Jorge Luis
            Nov 22 '18 at 16:16













          • Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.

            – Gregory Wullimann
            Nov 23 '18 at 15:51













          • Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁

            – Jorge Luis
            Nov 23 '18 at 16:23














          2












          2








          2







          In this case, the configuration could be done on the Nutch side. Filtering the documents before they're sent to Solr.



          If you only want to "index" (meaning that you want to fetch&parse all the links, but store only on Solr the ones that match the regex) you can use the index-jexl-filter. With this plugin, you can write a small JEXL script to check if the URL of a document matches your regex and if it does it will be sent to Solr.



          The script could be something like (configured on your nutch-site.xml file):



          url =~ "^https?://[a-z]+.example.com/(d+).html"




          • url is a default primitive available on the JEXL context. You can find more info about this on https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1755-L1771


          If by "index" you really meant to only crawl the URLs that match your regex (if it doesn't match it will not be fetched nor parsed) then you can use the same regex-urlfilter.txt to define the desired format. Keep in mind that with this approach you would need to run the crawl again.






          share|improve this answer













          In this case, the configuration could be done on the Nutch side. Filtering the documents before they're sent to Solr.



          If you only want to "index" (meaning that you want to fetch&parse all the links, but store only on Solr the ones that match the regex) you can use the index-jexl-filter. With this plugin, you can write a small JEXL script to check if the URL of a document matches your regex and if it does it will be sent to Solr.



          The script could be something like (configured on your nutch-site.xml file):



          url =~ "^https?://[a-z]+.example.com/(d+).html"




          • url is a default primitive available on the JEXL context. You can find more info about this on https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1755-L1771


          If by "index" you really meant to only crawl the URLs that match your regex (if it doesn't match it will not be fetched nor parsed) then you can use the same regex-urlfilter.txt to define the desired format. Keep in mind that with this approach you would need to run the crawl again.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 22 '18 at 13:09









          Jorge LuisJorge Luis

          2,07421016




          2,07421016













          • Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add the index.jexl.filter propriety, and in value I should put the url like you said, url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.

            – Gregory Wullimann
            Nov 22 '18 at 14:20








          • 1





            You also need to enable the index-jexl-filter plugin. For doing this you need to add it to the value of plugin.includes. For instance you can just add it to the end.

            – Jorge Luis
            Nov 22 '18 at 16:16













          • Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.

            – Gregory Wullimann
            Nov 23 '18 at 15:51













          • Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁

            – Jorge Luis
            Nov 23 '18 at 16:23



















          • Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add the index.jexl.filter propriety, and in value I should put the url like you said, url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.

            – Gregory Wullimann
            Nov 22 '18 at 14:20








          • 1





            You also need to enable the index-jexl-filter plugin. For doing this you need to add it to the value of plugin.includes. For instance you can just add it to the end.

            – Jorge Luis
            Nov 22 '18 at 16:16













          • Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.

            – Gregory Wullimann
            Nov 23 '18 at 15:51













          • Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁

            – Jorge Luis
            Nov 23 '18 at 16:23

















          Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add the index.jexl.filter propriety, and in value I should put the url like you said, url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.

          – Gregory Wullimann
          Nov 22 '18 at 14:20







          Thanks for answering! By indexing I meant storing in Solr, so the jexl filter should do the trick. I'm totally new to Nutch and Solr so I'm not understanding well. In my Nutch configuration file I add the index.jexl.filter propriety, and in value I should put the url like you said, url=~ ....? I tried like this by crawling again as well but unwanted documents are indexed anyway.

          – Gregory Wullimann
          Nov 22 '18 at 14:20






          1




          1





          You also need to enable the index-jexl-filter plugin. For doing this you need to add it to the value of plugin.includes. For instance you can just add it to the end.

          – Jorge Luis
          Nov 22 '18 at 16:16







          You also need to enable the index-jexl-filter plugin. For doing this you need to add it to the value of plugin.includes. For instance you can just add it to the end.

          – Jorge Luis
          Nov 22 '18 at 16:16















          Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.

          – Gregory Wullimann
          Nov 23 '18 at 15:51







          Thanks! Now it works as I wanted. Meanwhile I found also another solution which is a little bit worse. Instead of crawling and indexing immediately (-i option) I only crawled, then before indexing (using the -filter options), I changed the regex-urlfilter.txt with the Regex to match only the documents I wanted to be indexed.

          – Gregory Wullimann
          Nov 23 '18 at 15:51















          Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁

          – Jorge Luis
          Nov 23 '18 at 16:23





          Yep, that should work as well. Nice that you found another option. Usually, a deal with continuous running crawls and I jumped directly to the plugin system 😁

          – Jorge Luis
          Nov 23 '18 at 16:23


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53430293%2findexing-only-specific-domains-with-solr-and-nutch%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Tonle Sap (See)

          I get strange results when I access the Sqlitedatabase with Unity C# via XAMPP

          Guatemaltekische Davis-Cup-Mannschaft