Sunday, 12 April 2015

More about nutch

Nutch title length


For long lenth title we can increase the value parameter value in nutch-
default.xml or we can set the same in nutch-site.xml to override nutch-default.xml.

ByDefault it is 100

<property>
  <name>indexer.max.title.length</name>
  <value>10000</value>
  <description>The maximum number of characters of a title that are indexed.
 </description>
</property>

Nutch set language


<property>
    <name>http.accept.language</name>
    <value>en-us,en-gb,en</value>
    <description>Value of the "Accept-Language" request header field.
        This allows selecting non-English language as default one to retrieve.
        It is a useful setting for search engines build for certain national group.
    </description>
</property>


Is it possible to fetch only pages from some specific domains?


Adding some regular expression to the regex-urlfilter.txt file might work, but adding a
list with thousands of regular expressions would slow down system excessively.

Alternatively, we can set dd.ignore.external.links to 'true' and inject seeds from the
domain .Doing this will let the crawl go through only these domains without learning to start crawling external links.

<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

For secure url add protocol-httpclient plugin-


Edit nutch-site.xml

<property>
  <name>plugin.includes</name>
 <value> protocol-httpclient |parse-tika|protocol-http|urlfilter-regex|parse-
(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-
opic</value>
 <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
 </property>

Ontology plugin with nutch


Download ontology plugin from

https://gitorious.org/discovered/repo/source/329ff64e9d7295aff108f85e9a8103
f5e5f8f398:src/plugin/ontology

copy all folders in given locations in apache nutch /src/plugin and /src/java /src/test folder

 edit nutch-site.xml file
<property>
<name>extension.ontology.extension-name</name>
<value>org.apache.nutch.ontology.jena.OntologyImpl</value>
<description>Loads the Ontology plugin</description>
</property>
<property>
<name>extension.ontology.urls</name>
<value>file:///usr/local/apache-nutch-2.2.1/src/plugin/ontology/sample/</value>
<description>Shows the owl file</description>
</property>
<property>
  <name>plugin.includes</name>
 <value>ontology|protocol-http|urlfilter-regex|parse-(html|tika)|index-
(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
 <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>

edit build.xml file in /src/plugin

<ant dir="ontology" target="deploy"/>

after that run command ,

ant clean
ant runtime

Rss feeds in nutch

If you want to crawl Rss feeds you have to do only two  things,

1. Add feed plugin into  your nutch-site.xml,
<property>
  <name>plugin.includes</name>
 <value>parse-tika|feed|protocol-http|urlfilter-regex|parse-(html|tika)|index-
(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
 <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>

2. Also in parse-plugin.xml you have to do following configuration:
<mimeType  name="application/rss+xml">
                <plugin id="feed" />
        </mimeType>
<mimeType  name="text/xml">
                <plugin id="feed" />
                <plugin id="parse-tika" />
        </mimeType>
     
After that you will be able to crawl rss feeds.