For Developers: April 2015

Nutch title length

For long lenth title we can increase the value parameter value in nutch-
default.xml or we can set the same in nutch-site.xml to override nutch-default.xml.

ByDefault it is 100

<property>
<name>indexer.max.title.length</name>
<value>10000</value>
<description>The maximum number of characters of a title that are indexed.
</description>
</property>

Nutch set language

<property>
<name>http.accept.language</name>
<value>en-us,en-gb,en</value>
<description>Value of the "Accept-Language" request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>

Is it possible to fetch only pages from some specific domains?

Adding some regular expression to the regex-urlfilter.txt file might work, but adding a
list with thousands of regular expressions would slow down system excessively.

Alternatively, we can set dd.ignore.external.links to 'true' and inject seeds from the
domain .Doing this will let the crawl go through only these domains without learning to start crawling external links.

<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without creating complex URLFilters.
</description>
</property>

For secure url add protocol-httpclient plugin-

Edit nutch-site.xml

<property>
<name>plugin.includes</name>
<value> protocol-httpclient |parse-tika|protocol-http|urlfilter-regex|parse-
(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-
opic</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>

Ontology plugin with nutch

Download ontology plugin from

https://gitorious.org/discovered/repo/source/329ff64e9d7295aff108f85e9a8103
f5e5f8f398:src/plugin/ontology

copy all folders in given locations in apache nutch /src/plugin and /src/java /src/test folder

edit nutch-site.xml file
<property>
<name>extension.ontology.extension-name</name>
<value>org.apache.nutch.ontology.jena.OntologyImpl</value>
<description>Loads the Ontology plugin</description>
</property>
<property>
<name>extension.ontology.urls</name>
<value>file:///usr/local/apache-nutch-2.2.1/src/plugin/ontology/sample/</value>
<description>Shows the owl file</description>
</property>
<property>
<name>plugin.includes</name>
<value>ontology|protocol-http|urlfilter-regex|parse-(html|tika)|index-
(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>

edit build.xml file in /src/plugin

<ant dir="ontology" target="deploy"/>

after that run command ,

ant clean
ant runtime

If you want to crawl Rss feeds you have to do only two things,

1. Add feed plugin into your nutch-site.xml,

<name>plugin.includes</name>

<description>Regular expression naming plugin directory names to

include. Any plugin not matching this expression is excluded.

In any case you need at least include the nutch-extensionpoints plugin. By

default Nutch includes crawling just HTML and plain text via HTTP,

and basic indexing and search plugins. In order to use HTTPS please enable

protocol-httpclient, but be aware of possible intermittent problems with the

underlying commons-httpclient library.

</description>

</property>

2. Also in parse-plugin.xml you have to do following configuration:

</mimeType>

</mimeType>

After that you will be able to crawl rss feeds.

For Developers

Sunday, 12 April 2015

More about nutch

Nutch title length

Nutch set language

Is it possible to fetch only pages from some specific domains?

For secure url add protocol-httpclient plugin-

Ontology plugin with nutch

Rss feeds in nutch