Monday, 21 July 2014

Boiler pipe support with nutch, solr and hbase



Using tika we can integrate nutch with boilerpipe because in tika we have boilerpipe jar.

We have apache tika in our nutch directory /apache-nutch-2.2.1/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java 

In plugin folder we have apache-nutch-2.2.1/runtime/local/plugins/parse-tika/boilerpipe-1.1.0.jar file 

Edit nutch-site.xml

<property>
<name>tika.boilerpipe</name>
<value>true</value> </property>
<property>
<name>tika.boilerpipe.extractor</name>
<value>ArticleExtractor</value>
</property>

<property>
            <name>plugin.includes</name>
       <value>parse-tika|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
           <description>Regular expression naming plugin directory names to
           include.  Any plugin not matching this expression is excluded.
            In any case you need at least include the nutch-extensionpoints plugin. By
           default Nutch includes crawling just HTML and plain text via HTTP,
           and basic indexing and search plugins. In order to use HTTPS please enable
            protocol-httpclient, but be aware of possible intermittent problems with the
           underlying commons-httpclient library.
            </description>
           </property>

           Edit parse-plugin.xml

<mimeType name="text/html">
                <plugin id="parse-tika" />
           </mimeType>

            <mimeType name="application/xhtml+xml">
                <plugin id="parse-tika" />
           </mimeType>

Check url using the following command-

$ bin/nutch parsechecker -dumpText [url] > check.log

for e.g.

$ bin/nutch parsechecker -dumpText http://venturebeat.com/> check.log

It will create dump file which store filter content of url.
          



 And when we crawl and index  data using nutch ,solr  and hbase 
we can check that our hbase table data is filtered. Hbase filtered data -






1 comment:

  1. I tried out the steps. It is not working. The text is still having the invalid content. Are there any additional steps required other than this?

    ReplyDelete