Thursday, 18 September 2014

Content mining


Content mining is the mining, extraction and integration of useful data, information and knowledge from Web page content. The heterogeneity and the lack of structure that permits much of the ever-expanding information sources on the World Wide Web, such as hypertext documents, makes automated discovery, organization, and search and indexing tools of the Internet and the World Wide Web such as Lycos, Alta Vista, WebCrawler, ALIWEB [6], MetaCrawler, and others provide some comfort to users, but they do not generally provide structural information nor categorize, filter, or interpret documents. In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent web agents, as well as to extend database and data mining techniques to provide a higher level of organization for semi-structured data available on the web. The agent-based approach to web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize web-based information.
Web content mining is differentiated from two different points of view Information Retrieval View and Database View. It shows that most of the researches use bag of words, which is based on the statistics about single words in isolation, to represent unstructured text and take single word found in the training corpus as features. For the semi-structured data, all the works utilize the HTML structures inside the documents and some utilized the hyperlink structure between the documents for document representation. As for the database view, in order to have the better information management and querying on the web, the mining always tries to infer the structure of the web site to transform a web site to become a database.


Content mining Tool-

Quickscrape

It is designed to enable large-scale content mining. Scrapers are defined in separate 
JSON files that follow a defined structure. This too has important benefits:
  • No programming required! Non-programmers can make scrapers using a text editor and a web browser with an element inspector (e.g. Chrome).
  • Large collections of scrapers can be maintained to retrieve similar sets of information from different pages. For example: newspapers or academic journals.
  • Any other software supporting the same format could use the same scraper definitions.
quickscrape is being developed to allow the community early access to the technology that will drive ContentMine, such as ScraperJSON and our Node.js scraping library thresher.

 

Installation

The simplest way to install Node.js on OSX is to go to http://nodejs.org/download/

download and run the Mac OS X Installer.

Alternatively, if you use the excellent Homebrew package manager, simply run:
brew update brew install node 
Then you can install quickscrape:
sudo npm install --global --unsafe-perms quickscrape 
 
 
Run quickscrape --help from the command line to get help:
   Usage: quickscrape [options]    Options:    
 -h, --help               output usage information   
 -V, --version            output the version number  
  -u, --url <url>          URL to scrape    
-r, --urllist <path>     path to file with list of URLs to scrape (one per line)   
 -s, --scraper <path>     path to scraper definition (in JSON format)   
 -d, --scraperdir <path>  path to directory containing scraper definitions (in JSON format)    
-o, --output <path>      where to output results (directory will be created if it doesn't exist    
-r, --ratelimit <int>    maximum number of scrapes per minute (default 3)   
 -l, --loglevel <level>   amount of information to log (silent, verbose, info*, data, warn, error


Extract data from a single URL with a predefined scraper


First, you'll want to grab some pre-cooked definitions:
git clone https://github.com/ContentMine/journal-scrapers.git
Change in
/usr/local/lib/node_modules/quickscrape/bin/quickscrape.js
in place of
var ScraperBox = thresher.scraperbox;
write
var ScraperBox = thresher.ScraperBox; 

Now just run quickscrape:
quickscrape \  
--url https://peerj.com/articles/384 \  
--scraper journal-scrapers/scrapers/peerj.json \  

 
--output peerj-384

check output directory

peerj-384

we have https_peerj.com_articles_384 folder

open this folder
$ ls

fulltext.xml  rendered.html        results.json

Scraping a list of URLs

 Create a file urls.txt:

http://www.mdpi.com/1420-3049/19/2/2042/htm

we want to extract basic metadata, PDFs, and all figures with captions. We can make a simple ScraperJSON scraper to do that, and save it as molecules_figures.json:

{
  "url": "mdpi",
  "elements": {
    "dc.source": {
      "selector": "//meta[@name='dc.source']",
      "attribute": "content"
    },
    "figure_img": {
      "selector": "//div[contains(@id, 'fig')]/div/img",
      "attribute": "src",
      "download": true
    },
    "figure_caption": {
      "selector": "//div[contains(@class, 'html-fig_description')]"
 
    },
    "fulltext_pdf": {
      "selector": "//meta[@name='citation_pdf_url']",
      "attribute": "content",
      "download": true
    },
    "fulltext_html": {
      "selector": "//meta[@name='citation_fulltext_html_url']",
      "attribute": "content",
      "download": true
    },
    "title": {
      "selector": "//meta[@name='citation_title']",
      "attribute": "content"
    },
    "author": {
      "selector": "//meta[@name='citation_author']",
      "attribute": "content"
    },
    "date": {
      "selector": "//meta[@name='citation_date']",
      "attribute": "content"
    },
    "doi": {
      "selector": "//meta[@name='citation_doi']",
      "attribute": "content"
    },
    "volume": {
      "selector": "//meta[@name='citation_volume']",
      "attribute": "content"
    },
    "issue": {
      "selector": "//meta[@name='citation_issue']",
      "attribute": "content"
    },
    "firstpage": {
      "selector": "//meta[@name='citation_firstpage']",
      "attribute": "content"
    },
    "description": {
      "selector": "//meta[@name='description']",
      "attribute": "content"
  }
}
 }
Now run,
quickscrape --urllist url.txt --scraper journal-scrapers/scrapers/generic_open.json --output my_test

   

Now you are able to scrape data from any url or list of urls. We have option in quickscrape we can give the scraper directory path in command and scrape different type of urls in single command. like

/quickscrape --urllist url.txt --scraperdir journal-scrapers/scrapers/ --output my_test

 

No comments:

Post a Comment