Sunday, 12 April 2015

Rss feeds in nutch

If you want to crawl Rss feeds you have to do only two  things,

1. Add feed plugin into  your nutch-site.xml,
<property>
  <name>plugin.includes</name>
 <value>parse-tika|feed|protocol-http|urlfilter-regex|parse-(html|tika)|index-
(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
 <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>

2. Also in parse-plugin.xml you have to do following configuration:
<mimeType  name="application/rss+xml">
                <plugin id="feed" />
        </mimeType>
<mimeType  name="text/xml">
                <plugin id="feed" />
                <plugin id="parse-tika" />
        </mimeType>
     
After that you will be able to crawl rss feeds.

No comments:

Post a Comment