Apache-Nutch-2.2.1 with
Hbase-0.90.4 , Solr-4.8.1
Installation on Mac
About apache nutch:
Apache Nutch is an open source Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over.
Features
• Fetching and parsing are done separately by default, this reduces the risk of an error corrupting the fetch parse stage of a crawl with Nutch.
• Plugins have been overhauled as a direct result of removal of legacy Lucene dependency for indexing and search.
• The number of plugins for processing various document types being shipped with Nutch has been refined. Plain text, XML, OpenDocument (OpenOffice.org), Microsoft Office (Word, Excel, Powerpoint), PDF, RTF, MP3 (ID3 tags) are all now parsed by the Tika plugin. The only parser plugins shipped with Nutch now are Feed (RSS/Atom), HTML, Ext, JavaScript, SWF, Tika & ZIP.
• Distributed filesystem (via Hadoop)
• Link-graph database
NTLM authentication
About apache solr:
SolrTM is the popular, blazing fast open source enterprise search platform from the Apache LuceneTMproject. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest.
internet sites.
Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.
Features:
• Advanced Full-Text Search Capabilities
• Optimized for High Volume Web Traffic
• Standards Based Open Interfaces - XML, JSON and HTTP
• Comprehensive HTML Administration Interfaces
• Server statistics exposed over JMX for monitoring
• Linearly scalable, auto index replication, auto failover and recovery
• Near Real-time indexing
• Flexible and Adaptable with XML configuration Extensible Plugin Architecture
About apache Hbase:
HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection). HBase is a type of "NoSQL" database.
Features:
• Linear and modular scalability.
• Strictly consistent reads and writes.
• Automatic and configurable sharding of tables
• Automatic failover support between RegionServers.
• Convenient base classes for backing Hadoop MapReduce jobs with Apache . HBase tables
• Easy to use Java API for client access.
• Block cache and Bloom Filters for real-time queries.
• Query predicate push down via server side Filters
• Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
• Extensible jruby-based (JIRB) shell
• Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
Installation
Hbase Installation steps:
Download Hbase using
Untar file
$ tar -vxf hbase-0.90.4.tar.gz
Change /usr/local/Hbase/conf /hbase-site.xml as below
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///usr/local/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
Add JAVA_HOME to /usr/local/Hbase/conf/hbase-env.sh
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home
export HBASE_OPTS="-Djava.security.krb5.realm= -Djava.security.krb5.kdc="
Start Hbase-
$ ./bin/start-hbase.sh
Check if Hbase install correctly
$ ./bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.90.4, r1150278, Sun Jul 24 15:53:29 PDT 2011
Create table
hbase(main):001:0> create 'test', 'cf'
0 row(s) in 0.4340 seconds
Check table
hbase(main):002:0> list 'test'
TABLE
test
1 row(s) in 0.0580 seconds
Put data on table
hbase(main):003:0> put 'test', 'row1', 'cf:a', 'value1'
0 row(s) in 0.2130 seconds
hbase(main):004:0> put 'test', 'row2', 'cf:b', 'value2'
0 row(s) in 0.0140 seconds
hbase(main):005:0> put 'test', 'row3', 'cf:c', 'value3'
0 row(s) in 0.0130 seconds
Check records of table
hbase(main):006:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:a, timestamp=1403154436134, value=value1
row2 column=cf:b, timestamp=1403154448918, value=value2
row3 column=cf:c, timestamp=1403154456718, value=value3
3 row(s) in 0.0910 seconds
hbase(main):008:0> exit
Stop Hbase-
$ ./bin/stop-hbase.sh
Apache Nutch Installation steps:
Download apache-nutch-2.2.1 from
Extract apache-nutch-2.2.1-src.tar.gz file
Move downloaded file to the directory-
$ mv apache-nutch-2.2.1 /usr/local/
Edit usr/local/apache-nutch-2.2.1/conf/nutch-site.xml file
<configuration>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.agent.name</name>
<value>NutchCrawler</value>
</property>
<property>
<name>http.robots.agents</name>
<value>NutchCrawler,*</value>
</property>
</configuration>
Edit usr/local/apache-nutch-2.2.1/conf/hbase-site.xml file
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///usr/local/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
</configuration>
Edit usr/local/apache-nutch-2.2.1/conf/gora.properties
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
Uncomment the /usr/local/apache-nutch-2.2.1/ivy/ivy.xml
<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
Edit usr/local/apache-nutch-2.2.1/conf/regex-urlfiter.txt
+^http://work-at-google.com
Run ant
$ ant clean
$ ant runtime
It will create runtime folder in apache-nutch-2.2.1 folder
usr/local/apache-nutch-2.2.1/runtime
Create directory /usr/local/apache-nutch-2.2.1/runtime/local
$ mkdir urls
$ echo "http://work-at-google.com" > urls/seed.txt
Set path for Java_Home
$ export JAVA_HOME="$(/usr/libexec/java_home)"
$ bin/nutch inject urls
$ bin/nutch generate -topN 5
$ bin/nutch fetch -all
$ bin/nutch parse -all
$ bin/nutch updated
After running these steps it will create webpage folder in hbase which is table of hbase and store all crawl, fetch data.
Download Solr
$ brew install solr
Start Solr
$ cd usr/local/Cellar/solr/4.8.1/libexec/example/
$ java -jar start.jar
Solr is running we can check.
http://localhost:8983/solr/admin/
Now feed the data solr with nutch
$ bin/nutch solrindex http://localhost:8983/solr/ -all
Using the crawl script
$ bin/crawl urls/seed.txt testCrawl localhost:8983/solr/ 2
After running this command it will create testCrawl_webpage folder in hbase which is table of hbase and store all data.
Now we can search over data in Solr
http://localhost:8983/solr/#/collection1/query
Apache Nutch 2.x Commands:
$ bin/nutch readdb
(Read/dump crawl db)
Usage: WebTableReader (-stats | -url [url] | -dump <out_dir> [-regex regex])
[-crawlId <id>] [-content] [-headers] [-links] [-text]
-crawlId <id> - the id to prefix the schemas to operate on,
(default: storage.crawl.id)
-stats [-sort] - print overall statistics to System.out
[-sort] - list status sorted by host
-url <url> - print information on <url> to System.out
-dump <out_dir> [-regex regex] - dump the webtable to a text file in
<out_dir>
-content - dump also raw content
-headers - dump protocol headers
-links - dump links
-text - dump extracted text
[-regex] - filter on the URL of the webtable entry
$ bin/nutch inject
(Inject new urls into the database)
Usage: InjectorJob <url_dir> [-crawlId <id>]
$ bin/nutch hostinject
(Inject new urls into the hostdatabase)
$ bin/nutch generate
(Generate new segments to fetch from crawldb)
Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]
-topN <N> - number of top URLs to be selected, default is Long.MAX_VALUE
-crawlId <id> - the id to prefix the schemas to operate on,
(default: storage.crawl.id)");
-noFilter - do not activate the filter plugin to filter the url, default is true
-noNorm - do not activate the normalizer plugin to normalize the url, default is true
-adddays - Adds numDays to the current time to facilitate crawling urls already fetched sooner then db.default.fetch.interval. Default value is 0.
$ bin/nutch fetch
(Fetch a segment's pages)
Usage: FetcherJob (<batchId> | -all) [-crawlId <id>] [-threads N] [-resume] [-numTasks N]
<batchId> - crawl identifier returned by Generator, or -all for all
generated batchId-s
-crawlId <id> - the id to prefix the schemas to operate on,
(default: storage.crawl.id)
-threads N - number of fetching threads per task
-resume - resume interrupted job
-numTasks N - if N > 0 then use this many reduce tasks for fetching
(default: mapred.map.tasks)
$ bin/nutch parse
(Parse a segment's pages)
Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]
<batchId> - symbolic batch ID created by Generator
-crawlId <id> - the id to prefix the schemas to operate on,
(default: storage.crawl.id)
-all - consider pages from all crawl jobs
-resume - resume a previous incomplete job
-force - force re-parsing even if a page is already parsed
$ bin/nutch updatedb
(Update crawldb after fetching)
$ bin/nutch updatehostdb
(Update hostdb after fetching)
$ bin/nutch elasticindex
(Run the elastic search indexer on parsed batches)
$ bin/nutch solrindex
(Run the solr indexer on parsed segments and linkdb)
Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]
$ bin/nutch parsechecker
(Checks the parser for a given url)
$ bin/nutch plugin
(Loads a plugin and run one of its classes main())
$ bin/nutch NutchServer
(run a (local) Nutch server on a user defined port)
usage: NutchServer [-help] [-log <loging level>] [-port] [-stop <force>]
-help Show this help
-log <loging level> Select a logging level for the
NutchServer.ALL|CONFIG|FINER|FINEST|INFO|OFF|SEVERE
|WARNING
-port Use port for restful API
-stop <force> Stop running nutch server. Force stops server
despite running jobs
$ bin/nutch junit
(Runs the given JUnit test)
$ bin/nutch CLASSNAME
(run the class named CLASSNAME)
Do u have the steps for Crawling RSS feeds using Nutch . Please let me know
ReplyDeletePlease check,
Deletehttp://best4dev.blogspot.in/2015/04/rss-feeds-in-nutch.html