Nutch title length
For long lenth title we can increase the value parameter value in nutch-
default.xml or we can set the same in nutch-site.xml to override nutch-default.xml.
ByDefault it is 100
<property>
<name>indexer.max.title.length</name>
<value>10000</value>
<description>The maximum number of characters of a title that are indexed.
</description>
</property>
Nutch set language
<property>
<name>http.accept.language</name>
<value>en-us,en-gb,en</value>
<description>Value of the "Accept-Language" request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>
Is it possible to fetch only pages from some specific domains?
Adding some regular expression to the regex-urlfilter.txt file might work, but adding a
list with thousands of regular expressions would slow down system excessively.
Alternatively, we can set dd.ignore.external.links to 'true' and inject seeds from the
domain .Doing this will let the crawl go through only these domains without learning to start crawling external links.
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without creating complex URLFilters.
</description>
</property>
For secure url add protocol-httpclient plugin-
Edit nutch-site.xml
<property>
<name>plugin.includes</name>
<value> protocol-httpclient |parse-tika|protocol-http|urlfilter-regex|parse-
(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-
opic</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
Ontology plugin with nutch
Download ontology plugin from
https://gitorious.org/discovered/repo/source/329ff64e9d7295aff108f85e9a8103
f5e5f8f398:src/plugin/ontology
copy all folders in given locations in apache nutch /src/plugin and /src/java /src/test folder
edit nutch-site.xml file
<property>
<name>extension.ontology.extension-name</name>
<value>org.apache.nutch.ontology.jena.OntologyImpl</value>
<description>Loads the Ontology plugin</description>
</property>
<property>
<name>extension.ontology.urls</name>
<value>file:///usr/local/apache-nutch-2.2.1/src/plugin/ontology/sample/</value>
<description>Shows the owl file</description>
</property>
<property>
<name>plugin.includes</name>
<value>ontology|protocol-http|urlfilter-regex|parse-(html|tika)|index-
(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
edit build.xml file in /src/plugin
<ant dir="ontology" target="deploy"/>
after that run command ,
ant clean
ant runtime