Using tika we can integrate nutch with boilerpipe because in tika we have boilerpipe jar.
We have apache tika in our nutch directory /apache-nutch-2.2.1/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
In plugin folder we have apache-nutch-2.2.1/runtime/local/plugins/parse-tika/boilerpipe-1.1.0.jar file
Edit nutch-site.xml
<property>
<name>tika.boilerpipe</name>
<value>true</value> </property>
<property>
<name>tika.boilerpipe.extractor</name>
<value>ArticleExtractor</value>
</property>
<name>plugin.includes</name>
<value>parse-tika|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
<mimeType name="text/html">
<plugin id="parse-tika" />
</mimeType>
<mimeType name="application/xhtml+xml">
<plugin id="parse-tika" />
</mimeType>
Check url using the following command-
$ bin/nutch parsechecker -dumpText [url] > check.log
for e.g.
$ bin/nutch parsechecker -dumpText http://venturebeat.com/> check.log
It will create dump file which store filter content of url.
And when we crawl and index data using nutch ,solr and hbase
we can check that our hbase table data is filtered. Hbase filtered data -