For Developers: Big-Data (Apache-Nutch)

Apache-Nutch-2.2.1 with

Hbase-0.90.4 , Solr-4.8.1

Installation on Mac

About apache nutch:

Apache Nutch is an open source Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over.

Features

• Fetching and parsing are done separately by default, this reduces the risk of an error corrupting the fetch parse stage of a crawl with Nutch.

• Plugins have been overhauled as a direct result of removal of legacy Lucene dependency for indexing and search.

• The number of plugins for processing various document types being shipped with Nutch has been refined. Plain text, XML, OpenDocument (OpenOffice.org), Microsoft Office (Word, Excel, Powerpoint), PDF, RTF, MP3 (ID3 tags) are all now parsed by the Tika plugin. The only parser plugins shipped with Nutch now are Feed (RSS/Atom), HTML, Ext, JavaScript, SWF, Tika & ZIP.

• MapReduce ;

• Distributed filesystem (via Hadoop)

• Link-graph database

NTLM authentication

About apache solr:

Solr^TM is the popular, blazing fast open source enterprise search platform from the Apache Lucene^TMproject. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest.

internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.

Features:

• Advanced Full-Text Search Capabilities

• Optimized for High Volume Web Traffic

• Standards Based Open Interfaces - XML, JSON and HTTP

• Comprehensive HTML Administration Interfaces

• Server statistics exposed over JMX for monitoring

• Linearly scalable, auto index replication, auto failover and recovery

• Near Real-time indexing

• Flexible and Adaptable with XML configuration Extensible Plugin Architecture

About apache Hbase:

HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection). HBase is a type of "NoSQL" database.

Features:

• Linear and modular scalability.

• Strictly consistent reads and writes.

• Automatic and configurable sharding of tables

• Automatic failover support between RegionServers.

• Convenient base classes for backing Hadoop MapReduce jobs with Apache . HBase tables

• Easy to use Java API for client access.

• Block cache and Bloom Filters for real-time queries.

• Query predicate push down via server side Filters

• Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options

• Extensible jruby-based (JIRB) shell

• Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX

Installation

Hbase Installation steps:

Download Hbase using

$ wget http://archive.apache.org/dist/hbase/hbase-0.90.4/hbase-0.90.4.tar.gz

Untar file

$ tar -vxf hbase-0.90.4.tar.gz

Change /usr/local/Hbase/conf /hbase-site.xml as below

<name>hbase.rootdir</name>

<value>file:///usr/local/hbase</value>

</property>

<name>hbase.zookeeper.quorum</name>

<value>localhost</value>

</property>

Add JAVA_HOME to /usr/local/Hbase/conf/hbase-env.sh

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home

export HBASE_OPTS="-Djava.security.krb5.realm= -Djava.security.krb5.kdc="

Start Hbase-

$ ./bin/start-hbase.sh

Check if Hbase install correctly

$ ./bin/hbase shell

HBase Shell; enter 'help<RETURN>' for list of supported commands.

Type "exit<RETURN>" to leave the HBase Shell

Version 0.90.4, r1150278, Sun Jul 24 15:53:29 PDT 2011

Create table

hbase(main):001:0> create 'test', 'cf'

0 row(s) in 0.4340 seconds

Check table

hbase(main):002:0> list 'test'

TABLE

test

1 row(s) in 0.0580 seconds

Put data on table

hbase(main):003:0> put 'test', 'row1', 'cf:a', 'value1'

0 row(s) in 0.2130 seconds

hbase(main):004:0> put 'test', 'row2', 'cf:b', 'value2'

0 row(s) in 0.0140 seconds

hbase(main):005:0> put 'test', 'row3', 'cf:c', 'value3'

0 row(s) in 0.0130 seconds

Check records of table

hbase(main):006:0> scan 'test'

ROW COLUMN+CELL

row1 column=cf:a, timestamp=1403154436134, value=value1

row2 column=cf:b, timestamp=1403154448918, value=value2

row3 column=cf:c, timestamp=1403154456718, value=value3

3 row(s) in 0.0910 seconds

hbase(main):008:0> exit

Stop Hbase-

$ ./bin/stop-hbase.sh

Apache Nutch Installation steps:

Download apache-nutch-2.2.1 from

http://www.apache.org/dyn/closer.cgi/nutch/

Extract apache-nutch-2.2.1-src.tar.gz file

Move downloaded file to the directory-

$ mv apache-nutch-2.2.1 /usr/local/

Edit usr/local/apache-nutch-2.2.1/conf/nutch-site.xml file

<name>storage.data.store.class</name>

<value>org.apache.gora.hbase.store.HBaseStore</value>

<description>Default class for storing data</description>

</property>

<name>http.agent.name</name>

<value>NutchCrawler</value>

</property>

<name>http.robots.agents</name>

<value>NutchCrawler,*</value>

</property>

</configuration>

Edit usr/local/apache-nutch-2.2.1/conf/hbase-site.xml file

<name>hbase.rootdir</name>

<value>file:///usr/local/hbase</value>

</property>

<name>hbase.zookeeper.quorum</name>

<value>localhost</value>

</property>

<name>hbase.zookeeper.property.clientPort</name>

</property>

</configuration>

Edit usr/local/apache-nutch-2.2.1/conf/gora.properties

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

Uncomment the /usr/local/apache-nutch-2.2.1/ivy/ivy.xml

<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

Edit usr/local/apache-nutch-2.2.1/conf/regex-urlfiter.txt

+^http://work-at-google.com

Run ant

$ ant clean

$ ant runtime

It will create runtime folder in apache-nutch-2.2.1 folder

usr/local/apache-nutch-2.2.1/runtime

Create directory /usr/local/apache-nutch-2.2.1/runtime/local

$ mkdir urls

$ echo "http://work-at-google.com" > urls/seed.txt

Set path for Java_Home

$ export JAVA_HOME="$(/usr/libexec/java_home)"

Crawl with Nutch

$ bin/nutch inject urls

$ bin/nutch generate -topN 5

$ bin/nutch fetch -all

$ bin/nutch parse -all

$ bin/nutch updated

$ bin/nutch readdb

After running these steps it will create webpage folder in hbase which is table of hbase and store all crawl, fetch data.

Apache Solr Installation steps:

Download Solr

$ brew install solr

Start Solr

$ cd usr/local/Cellar/solr/4.8.1/libexec/example/

$ java -jar start.jar

Solr is running we can check.

http://localhost:8983/solr/admin/

Now feed the data solr with nutch

$ bin/nutch solrindex http://localhost:8983/solr/ -all

Using the crawl script

$ bin/crawl urls/seed.txt testCrawl localhost:8983/solr/ 2

After running this command it will create testCrawl_webpage folder in hbase which is table of hbase and store all data.

Now we can search over data in Solr

http://localhost:8983/solr/#/collection1/query

Apache Nutch 2.x Commands:

$ bin/nutch readdb

(Read/dump crawl db)

Usage: WebTableReader (-stats | -url [url] | -dump <out_dir> [-regex regex])

[-crawlId <id>] [-content] [-headers] [-links] [-text]

-crawlId <id> - the id to prefix the schemas to operate on,

(default: storage.crawl.id)

-stats [-sort] - print overall statistics to System.out

[-sort] - list status sorted by host

-url <url> - print information on <url> to System.out

-dump <out_dir> [-regex regex] - dump the webtable to a text file in

<out_dir>

-content - dump also raw content

-headers - dump protocol headers

-links - dump links

-text - dump extracted text

[-regex] - filter on the URL of the webtable entry

$ bin/nutch inject

(Inject new urls into the database)

Usage: InjectorJob <url_dir> [-crawlId <id>]

$ bin/nutch hostinject

(Inject new urls into the hostdatabase)

$ bin/nutch generate

(Generate new segments to fetch from crawldb)

Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]

-topN <N> - number of top URLs to be selected, default is Long.MAX_VALUE

-crawlId <id> - the id to prefix the schemas to operate on,

(default: storage.crawl.id)");

-noFilter - do not activate the filter plugin to filter the url, default is true

-noNorm - do not activate the normalizer plugin to normalize the url, default is true

-adddays - Adds numDays to the current time to facilitate crawling urls already fetched sooner then db.default.fetch.interval. Default value is 0.

$ bin/nutch fetch

(Fetch a segment's pages)

Usage: FetcherJob (<batchId> | -all) [-crawlId <id>] [-threads N] [-resume] [-numTasks N]

<batchId> - crawl identifier returned by Generator, or -all for all

generated batchId-s

-crawlId <id> - the id to prefix the schemas to operate on,

(default: storage.crawl.id)

-threads N - number of fetching threads per task

-resume - resume interrupted job

-numTasks N - if N > 0 then use this many reduce tasks for fetching

(default: mapred.map.tasks)

$ bin/nutch parse

(Parse a segment's pages)

Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]

<batchId> - symbolic batch ID created by Generator

-crawlId <id> - the id to prefix the schemas to operate on,

(default: storage.crawl.id)

-all - consider pages from all crawl jobs

-resume - resume a previous incomplete job

-force - force re-parsing even if a page is already parsed

$ bin/nutch updatedb

(Update crawldb after fetching)

$ bin/nutch updatehostdb

(Update hostdb after fetching)

$ bin/nutch elasticindex

(Run the elastic search indexer on parsed batches)

$ bin/nutch solrindex

(Run the solr indexer on parsed segments and linkdb)

Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]

$ bin/nutch parsechecker

(Checks the parser for a given url)

$ bin/nutch plugin

(Loads a plugin and run one of its classes main())

$ bin/nutch NutchServer

(run a (local) Nutch server on a user defined port)

usage: NutchServer [-help] [-log <loging level>] [-port] [-stop <force>]

-help Show this help

-log <loging level> Select a logging level for the

|WARNING

-port Use port for restful API

-stop <force> Stop running nutch server. Force stops server

despite running jobs

$ bin/nutch junit

(Runs the given JUnit test)

$ bin/nutch CLASSNAME

(run the class named CLASSNAME)

For Developers

Sunday, 22 June 2014

Big-Data (Apache-Nutch)

2 comments: