\UseRawInputEncoding \documentclass[a4paper]{article} \usepackage{Sweave} \usepackage[margin=2cm]{geometry} \usepackage[round]{natbib} \usepackage{url} \usepackage{hyperref} \usepackage{listings} \let\code=\texttt \newcommand{\acronym}[1]{\textsc{#1}} \newcommand{\class}[1]{\mbox{\textsf{#1}}} \newcommand{\pkg}[1]{{\normalfont\fontseries{b}\selectfont #1}} \newcommand{\proglang}[1]{\textsf{#1}} \newcommand{\fkt}[1]{\code{#1()}} \newcommand{\todo}[1]{\begin{center}\code{}\end{center}} \newcommand{\field}[1]{\code{\$#1}} \sloppy %% \VignetteIndexEntry{Introduction to the tm.plugin.webmining Package} \SweaveOpts{prefix.string=webmining} \SweaveOpts{include=FALSE} \begin{document} <>= library(boilerpipeR) data(content) options(width = 60) @ \title{Short Introduction to \pkg{boilerpipeR}} \author{Mario Annau\\ \texttt{mario.annau@gmail.com}} \maketitle \abstract{ This vignette gives a short introduction to \pkg{boilerpipeR}, a package which interfaces the boilerpipe \proglang{Java} library by \cite{kohlschuetter:webextract}. It implements robust heuristics to extract the main content from \proglang{HTML} files, removing unwanted elements like ads, banners, headers and footers. } \section{Getting Started} \pkg{boilerpipeR} provides an \proglang{R} interface to the boilerpipe \proglang{Java} library. It implements various robust heuristics to extract the main content from arbitrary web sites. The more sophisticated algorithms included are based on decision trees and have been trained on a real--world data set (retrieved through Google News, \url{http://news.google.com}).\\ For a quick content extraction exercise, we first need to retrieve a webpage. After loading the packages for page extraction and retrieval <>= library(boilerpipeR) library(RCurl) @ we can retrieve the content of from a webpage using \pkg{RCurl}: <>= url <- "https://quantivity.wordpress.com/2012/11/09/multi-asset-market-regimes/" content <- getURL(url) @ The code above retrieves the posting from a quite popular finance blog hosted on Wordpress (\url{www.wordpress.com}), currently one of the most popular blogging engines on the Internet. An inspection of the retrieved content string reveals a lot of typical \proglang{HTML} markup, including regions like sidebars, headers, etc. (see also Figure \ref{blogpicture}). <>= cat(substr(content, 1, 80)) @ \begin{figure}[t] \centering \includegraphics{figures/blogpicture} \caption{Inspection of a typical Wordpress blog page (\href{https://quantivity.wordpress.com}{https://quantivity.wordpress.com}). On the bottom we can see the \proglang{HTML} DOM tree parsed by Firebug (\href{http://getfirebug.com/}{http://getfirebug.com/}). Only the main content (blue rectangle) is relevant for text mining purposes and should be extracted.} \label{blogpicture} \end{figure} A simple extraction of the \proglang{HTML}--body element and dropping all markup would still include a lot of unnecessary content which can be disturbing for text mining algorithms. We can therefore use one of our default extractors from \pkg{boilerpipeR}: <>= extract <- DefaultExtractor(content) cat(substr(extract, 1, 120)) @ \section{Implemented Extractors} The list below describes all currently implemented extractors in \pkg{boilerpipeR}: \begin{description} \item[ArticleExtractor]{A full-text extractor which is tuned towards news articles.} \item[ArticleSentencesExtractor]{A full-text extractor which is tuned towards extracting sentences from news articles.} \item[CanolaExtractor]{A full-text extractor trained on a \href{http://krdwrd.org/}{krdwrd}.} \item[DefaultExtractor]{A quite generic full-text extractor.} \item[KeepEverythingExtractor]{Marks everything as content.} \item[LargestContentExtractor]{A full-text extractor which extracts the largest text component of a page.} \item[NumWordsRulesExtractor]{A quite generic full-text extractor solely based upon the number of words per block.} \end{description} \newpage The following commands show, how the above mentioned extractors can be used: <>= articleextract <- ArticleExtractor(content) articlesentencesextract <- ArticleSentencesExtractor(content) canolaextract <- CanolaExtractor(content) defaultextract <- DefaultExtractor(content) keepeverythingextract <- KeepEverythingExtractor(content) largestcontentextract <- LargestContentExtractor(content) numwordsrulesextract <- NumWordsRulesExtractor(content) @ \section{Conclusion} This vignette has given a quick introduction to \pkg{boilerpipeR}, a package to extract the main content from \proglang{HTML} pages. Although \fkt{DefaultExtractor} fits quite well for most purposes and web pages, each page template may require specialized extraction algorithms or some time to fine tune existing ones. Provided the presented package, the user now has a nice playground to experiment with extraction algorithms from within \proglang{R}. Although \pkg{boilerpipeR} is interfacing \proglang{Java} code, it has proven to be very fast and memory efficient---even for larger extraction tasks. Please refer to \cite{kohlschuetter:webextract} for a detailed explanation of the implented extractors and \cite{textextractioncomparison} for a performance comparison of similar text extraction algorithms. \bibliographystyle{plainnat} \bibliography{references} \end{document}