Sunday, 22 June 2014

Big-Data (Apache-Nutch)

Apache-Nutch-2.2.1 with
 Hbase-0.90.4 , Solr-4.8.1



Installation on Mac

About apache nutch: 

Apache Nutch is an open source Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over.

Features
    Fetching and parsing are done separately by default, this reduces the risk of an error corrupting the fetch parse stage of a crawl with Nutch.
    Plugins have been overhauled as a direct result of removal of legacy Lucene dependency for indexing and search.
    The number of plugins for processing various document types being shipped with Nutch has been refined. Plain text, XML, OpenDocument (OpenOffice.org), Microsoft Office (Word, Excel, Powerpoint), PDF, RTF, MP3 (ID3 tags) are all now parsed by the Tika plugin. The only parser plugins shipped with Nutch now are Feed (RSS/Atom), HTML, Ext, JavaScript, SWF, Tika & ZIP.
    MapReduce ;
    Distributed filesystem (via Hadoop)
    Link-graph database
NTLM authentication

About apache solr: 
SolrTM is the popular, blazing fast open source enterprise search platform from the Apache LuceneTMproject. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest.

internet sites.
Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.

Features:
    Advanced Full-Text Search Capabilities
    Optimized for High Volume Web Traffic
    Standards Based Open Interfaces - XML, JSON and HTTP
    Comprehensive HTML Administration Interfaces
    Server statistics exposed over JMX for monitoring
    Linearly scalable, auto index replication, auto failover and recovery
    Near Real-time indexing
    Flexible and Adaptable with XML configuration Extensible Plugin Architecture

About apache Hbase: 

HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection). HBase is a type of "NoSQL" database.
Features:
    Linear and modular scalability.
    Strictly consistent reads and writes.
    Automatic and configurable sharding of tables
    Automatic failover support between RegionServers.

    Convenient base classes for backing Hadoop MapReduce jobs with Apache . HBase tables
    Easy to use Java API for client access.
    Block cache and Bloom Filters for real-time queries.
    Query predicate push down via server side Filters
    Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
    Extensible jruby-based (JIRB) shell
    Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX

Installation

Hbase Installation steps:

Download Hbase using


Untar file
$ tar -vxf hbase-0.90.4.tar.gz

Change /usr/local/Hbase/conf /hbase-site.xml as below

<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///usr/local/hbase</value>
</property>
<property>
   <name>hbase.zookeeper.quorum</name>
   <value>localhost</value>
</property>

Add JAVA_HOME to /usr/local/Hbase/conf/hbase-env.sh

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home
export HBASE_OPTS="-Djava.security.krb5.realm= -Djava.security.krb5.kdc="

Start Hbase-

$ ./bin/start-hbase.sh

Check if Hbase install correctly

$ ./bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.90.4, r1150278, Sun Jul 24 15:53:29 PDT 2011

Create table

hbase(main):001:0> create 'test', 'cf'
0 row(s) in 0.4340 seconds

Check table

hbase(main):002:0> list 'test'
TABLE                                                                          
test                                                                            
1 row(s) in 0.0580 seconds

Put data on table

hbase(main):003:0> put 'test', 'row1', 'cf:a', 'value1'
0 row(s) in 0.2130 seconds

hbase(main):004:0> put 'test', 'row2', 'cf:b', 'value2'
0 row(s) in 0.0140 seconds

hbase(main):005:0> put 'test', 'row3', 'cf:c', 'value3'
0 row(s) in 0.0130 seconds

Check records of table

hbase(main):006:0> scan 'test'
ROW                   COLUMN+CELL                                              
 row1                 column=cf:a, timestamp=1403154436134, value=value1       
 row2                 column=cf:b, timestamp=1403154448918, value=value2       
 row3                 column=cf:c, timestamp=1403154456718, value=value3       
3 row(s) in 0.0910 seconds


hbase(main):008:0> exit

Stop Hbase-

$ ./bin/stop-hbase.sh




Apache Nutch Installation steps:

Download apache-nutch-2.2.1 from


Extract apache-nutch-2.2.1-src.tar.gz file

Move downloaded file to the directory-

$ mv apache-nutch-2.2.1 /usr/local/

Edit usr/local/apache-nutch-2.2.1/conf/nutch-site.xml file

<configuration>

    <property>
        <name>storage.data.store.class</name>
        <value>org.apache.gora.hbase.store.HBaseStore</value>
        <description>Default class for storing data</description>
    </property>
    <property>
        <name>http.agent.name</name>
        <value>NutchCrawler</value>
    </property>
    <property>
        <name>http.robots.agents</name>
        <value>NutchCrawler,*</value>
    </property>
</configuration>

Edit usr/local/apache-nutch-2.2.1/conf/hbase-site.xml file

<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///usr/local/hbase</value>
</property>
<property>
   <name>hbase.zookeeper.quorum</name>
   <value>localhost</value>
</property>


<property>
   <name>hbase.zookeeper.property.clientPort</name>
   <value>2181</value>
</property>
</configuration>

Edit usr/local/apache-nutch-2.2.1/conf/gora.properties

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

Uncomment the /usr/local/apache-nutch-2.2.1/ivy/ivy.xml

<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

Edit usr/local/apache-nutch-2.2.1/conf/regex-urlfiter.txt
+^http://work-at-google.com

Run ant

$ ant clean
$ ant runtime

It  will create runtime folder in apache-nutch-2.2.1 folder

usr/local/apache-nutch-2.2.1/runtime

Create directory /usr/local/apache-nutch-2.2.1/runtime/local
$ mkdir urls
$ echo "http://work-at-google.com" > urls/seed.txt


Set path for Java_Home

$ export JAVA_HOME="$(/usr/libexec/java_home)"

Crawl with Nutch

$ bin/nutch inject urls
$ bin/nutch generate -topN 5
$ bin/nutch fetch -all
$ bin/nutch parse -all
$ bin/nutch updated
$ bin/nutch readdb











After running these steps it will create webpage folder in hbase which is table of hbase and store all crawl, fetch data.


Apache Solr Installation steps:

                       Download Solr

                       $ brew install solr
Start Solr
   
$ cd usr/local/Cellar/solr/4.8.1/libexec/example/

$ java -jar start.jar

                        Solr is running we can check.
      http://localhost:8983/solr/admin/


           Now feed the data solr with nutch
   
           $ bin/nutch solrindex http://localhost:8983/solr/ -all

       Using the crawl script

           $ bin/crawl urls/seed.txt testCrawl localhost:8983/solr/ 2


           After running this command it will create testCrawl_webpage folder in hbase which is table of hbase and store all data.


      Now we can  search over data in Solr
    
            http://localhost:8983/solr/#/collection1/query



Apache Nutch 2.x Commands:

                        $ bin/nutch readdb
                         (Read/dump crawl db)
Usage: WebTableReader (-stats | -url [url] | -dump <out_dir> [-regex regex])
                      [-crawlId <id>] [-content] [-headers] [-links] [-text]
    -crawlId <id>  - the id to prefix the schemas to operate on,
                     (default: storage.crawl.id)
    -stats [-sort] - print overall statistics to System.out
    [-sort]        - list status sorted by host
    -url <url>     - print information on <url> to System.out
    -dump <out_dir> [-regex regex] - dump the webtable to a text file in
                     <out_dir>
    -content       - dump also raw content
    -headers       - dump protocol headers
    -links         - dump links
    -text          - dump extracted text
                        [-regex]       - filter on the URL of the webtable entry



                        $ bin/nutch inject
                         (Inject new urls into the database)
                         Usage: InjectorJob <url_dir> [-crawlId <id>]

                        $ bin/nutch hostinject
                         (Inject new urls into the hostdatabase)

                       $ bin/nutch generate
                        (Generate new segments to fetch from crawldb)
 Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]
    -topN <N>      - number of top URLs to be selected, default is Long.MAX_VALUE
    -crawlId <id>  - the id to prefix the schemas to operate on,
                    (default: storage.crawl.id)");
    -noFilter      - do not activate the filter plugin to filter the url, default is true
    -noNorm        - do not activate the normalizer plugin to normalize the url, default is true
    -adddays       - Adds numDays to the current time to facilitate crawling urls already fetched sooner then db.default.fetch.interval. Default value is 0.


                       $ bin/nutch fetch
                       (Fetch a segment's pages)
Usage: FetcherJob (<batchId> | -all) [-crawlId <id>] [-threads N] [-resume] [-numTasks N]
       <batchId>     - crawl identifier returned by Generator, or -all for all
                    generated batchId-s
       -crawlId <id> - the id to prefix the schemas to operate on,
                    (default: storage.crawl.id)
       -threads N    - number of fetching threads per task
       -resume       - resume interrupted job
       -numTasks N   - if N > 0 then use this many reduce tasks for fetching
                    (default: mapred.map.tasks)


                       $ bin/nutch parse
                      (Parse a segment's pages)
Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]
    <batchId>     - symbolic batch ID created by Generator
    -crawlId <id> - the id to prefix the schemas to operate on,
                    (default: storage.crawl.id)
    -all          - consider pages from all crawl jobs
    -resume       - resume a previous incomplete job
    -force        - force re-parsing even if a page is already parsed

                      $ bin/nutch updatedb
                      (Update crawldb after fetching)

                     $ bin/nutch updatehostdb
                     (Update hostdb after fetching)

                     $ bin/nutch elasticindex
(Run the elastic search indexer on parsed batches)

                      $ bin/nutch solrindex
                     (Run the solr indexer on parsed segments and linkdb)
                     Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]

                       $ bin/nutch parsechecker
                       (Checks the parser for a given url)
$   bin/nutch plugin
(Loads a plugin and run one of its classes main())

$ bin/nutch NutchServer
(run a (local) Nutch server on a user defined port)
usage: NutchServer [-help] [-log <loging level>] [-port] [-stop <force>]
 -help                 Show this help
 -log <loging level>   Select a logging level for the
                       NutchServer.ALL|CONFIG|FINER|FINEST|INFO|OFF|SEVERE
                       |WARNING
 -port                 Use port for restful API
 -stop <force>         Stop running nutch server. Force stops server
                       despite running jobs

$ bin/nutch junit
             (Runs the given JUnit test)

           $  bin/nutch  CLASSNAME
           (run the class named CLASSNAME)

2 comments:

  1. Do u have the steps for Crawling RSS feeds using Nutch . Please let me know

    ReplyDelete
    Replies
    1. Please check,
      http://best4dev.blogspot.in/2015/04/rss-feeds-in-nutch.html

      Delete