Content mining is the mining, extraction and integration of useful data, information and knowledge from Web page content. The heterogeneity and the lack of structure that permits much of the ever-expanding information sources on the World Wide Web, such as hypertext documents, makes automated discovery, organization, and search and indexing tools of the Internet and the World Wide Web such as Lycos, Alta Vista, WebCrawler, ALIWEB [6], MetaCrawler, and others provide some comfort to users, but they do not generally provide structural information nor categorize, filter, or interpret documents. In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent web agents, as well as to extend database and data mining techniques to provide a higher level of organization for semi-structured data available on the web. The agent-based approach to web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize web-based information.
Web content mining is differentiated from two different points of view Information Retrieval View and Database View. It shows that most of the researches use bag of words, which is based on the statistics about single words in isolation, to represent unstructured text and take single word found in the training corpus as features. For the semi-structured data, all the works utilize the HTML structures inside the documents and some utilized the hyperlink structure between the documents for document representation. As for the database view, in order to have the better information management and querying on the web, the mining always tries to infer the structure of the web site to transform a web site to become a database.
Quickscrape
It is designed to enable large-scale content mining. Scrapers are defined in separate
JSON files that follow a defined structure. This too has important benefits:
- No programming required! Non-programmers can make scrapers using a text editor and a web browser with an element inspector (e.g. Chrome).
- Large collections of scrapers can be maintained to retrieve similar sets of information from different pages. For example: newspapers or academic journals.
- Any other software supporting the same format could use the same scraper definitions.
quickscrape is being developed to allow the community early access to the technology that will drive ContentMine, such as ScraperJSON and our Node.js scraping library thresher.Installation
The simplest way to install Node.js on OSX is to go to http://nodejs.org/download/,download and run the Mac OS X Installer.
Alternatively, if you use the excellent Homebrew package manager, simply run:
brew update brew install node Then you can install quickscrape:sudo npm install --global --unsafe-perms quickscrape
Run quickscrape --help from the command line to get help:
Usage: quickscrape [options] Options:
-h, --help output usage information
-V, --version output the version number
-u, --url <url> URL to scrape
-r, --urllist <path> path to file with list of URLs to scrape (one per line)
-s, --scraper <path> path to scraper definition (in JSON format)
-d, --scraperdir <path> path to directory containing scraper definitions (in JSON format)
-o, --output <path> where to output results (directory will be created if it doesn't exist
-r, --ratelimit <int> maximum number of scrapes per minute (default 3)
-l, --loglevel <level> amount of information to log (silent, verbose, info*, data, warn, error
Extract data from a single URL with a predefined scraper
First, you'll want to grab some pre-cooked definitions:
git clone https://github.com/ContentMine/journal-scrapers.git
Change in
/usr/local/lib/node_modules/quickscrape/bin/quickscrape.js
in place ofvar ScraperBox = thresher.scraperbox;write
var ScraperBox = thresher.ScraperBox; Now just run quickscrape:
quickscrape \
--url https://peerj.com/articles/384 \
--scraper journal-scrapers/scrapers/peerj.json \
--output peerj-384
check output directory
peerj-384
we have https_peerj.com_articles_384 folder
open this folder
$ ls
fulltext.xml rendered.html results.json
Scraping a list of URLs
Create a file urls.txt:
http://www.mdpi.com/1420-3049/19/2/2042/htm
we want to extract basic metadata, PDFs, and all figures with captions. We can make a simple ScraperJSON scraper to do that, and save it as
molecules_figures.json:{
"url": "mdpi",
"elements": {
"dc.source": {
"selector": "//meta[@name='dc.source']",
"attribute": "content"
},
"figure_img": {
"selector": "//div[contains(@id, 'fig')]/div/img",
"attribute": "src",
"download": true
},
"figure_caption": {
"selector": "//div[contains(@class, 'html-fig_description')]"
},
"fulltext_pdf": {
"selector": "//meta[@name='citation_pdf_url']",
"attribute": "content",
"download": true
},
"fulltext_html": {
"selector": "//meta[@name='citation_fulltext_html_url']",
"attribute": "content",
"download": true
},
"title": {
"selector": "//meta[@name='citation_title']",
"attribute": "content"
},
"author": {
"selector": "//meta[@name='citation_author']",
"attribute": "content"
},
"date": {
"selector": "//meta[@name='citation_date']",
"attribute": "content"
},
"doi": {
"selector": "//meta[@name='citation_doi']",
"attribute": "content"
},
"volume": {
"selector": "//meta[@name='citation_volume']",
"attribute": "content"
},
"issue": {
"selector": "//meta[@name='citation_issue']",
"attribute": "content"
},
"firstpage": {
"selector": "//meta[@name='citation_firstpage']",
"attribute": "content"
},
"description": {
"selector": "//meta[@name='description']",
"attribute": "content"
}
}
}
Now run,
quickscrape --urllist url.txt --scraper journal-scrapers/scrapers/generic_open.json --output my_test
Now you are able to scrape data from any url or list of urls. We have option in quickscrape we can give the scraper directory path in command and scrape different type of urls in single command. like
/quickscrape --urllist url.txt --scraperdir journal-scrapers/scrapers/ --output my_test
/quickscrape --urllist url.txt --scraperdir journal-scrapers/scrapers/ --output my_test

No comments:
Post a Comment