For Developers: July 2014

Using tika we can integrate nutch with boilerpipe because in tika we have boilerpipe jar.

We have apache tika in our nutch directory /apache-nutch-2.2.1/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java

In plugin folder we have apache-nutch-2.2.1/runtime/local/plugins/parse-tika/boilerpipe-1.1.0.jar file

Edit nutch-site.xml

<name>tika.boilerpipe</name>

<name>tika.boilerpipe.extractor</name>

<value>ArticleExtractor</value>

</property>

<name>plugin.includes</name>

<description>Regular expression naming plugin directory names to

include. Any plugin not matching this expression is excluded.

In any case you need at least include the nutch-extensionpoints plugin. By

default Nutch includes crawling just HTML and plain text via HTTP,

and basic indexing and search plugins. In order to use HTTPS please enable

protocol-httpclient, but be aware of possible intermittent problems with the

underlying commons-httpclient library.

</description>

</property>

Edit parse-plugin.xml

</mimeType>

</mimeType>

Check url using the following command-

$ bin/nutch parsechecker -dumpText [url] > check.log

for e.g.

$ bin/nutch parsechecker -dumpText http://venturebeat.com/> check.log

It will create dump file which store filter content of url.

And when we crawl and index data using nutch ,solr and hbase

we can check that our hbase table data is filtered. Hbase filtered data -

For Developers