\documentclass{chapman}
%%% copy Sweave.sty definitions
%%% keeps `sweave' from adding `\usepackage{Sweave}': DO NOT REMOVE
%\usepackage{Sweave}
\RequirePackage[T1]{fontenc}
\RequirePackage{graphicx,ae,fancyvrb}
\IfFileExists{upquote.sty}{\RequirePackage{upquote}}{}
\usepackage{relsize}
\DefineVerbatimEnvironment{Sinput}{Verbatim}{}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{fontfamily=courier,
fontshape=it,
fontsize=\relsize{-1}}
\DefineVerbatimEnvironment{Scode}{Verbatim}{}
\newenvironment{Schunk}{}{}
%%% environment for raw output
\newcommand{\SchunkRaw}{\renewenvironment{Schunk}{}{}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{fontfamily=courier,
fontshape=it,
fontsize=\small}
\rawSinput
}
%%% environment for labeled output
\newcommand{\nextcaption}{}
\newcommand{\SchunkLabel}{
\renewenvironment{Schunk}{\begin{figure}[ht] }{\caption{\nextcaption}
\end{figure} }
\DefineVerbatimEnvironment{Sinput}{Verbatim}{frame = topline}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{frame = bottomline,
samepage = true,
fontfamily=courier,
fontshape=it,
fontsize=\relsize{-1}}
}
%%% S code with line numbers
\DefineVerbatimEnvironment{Sinput}
{Verbatim}
{
%% numbers=left
}
\newcommand{\numberSinput}{
\DefineVerbatimEnvironment{Sinput}{Verbatim}{numbers=left}
}
\newcommand{\rawSinput}{
\DefineVerbatimEnvironment{Sinput}{Verbatim}{}
}
%%% R / System symbols
\newcommand{\R}{\textsf{R}}
\newcommand{\rR}{{R}}
\renewcommand{\S}{\textsf{S}}
\newcommand{\SPLUS}{\textsf{S-PLUS}}
\newcommand{\rSPLUS}{{S-PLUS}}
\newcommand{\SPSS}{\textsf{SPSS}}
\newcommand{\EXCEL}{\textsf{Excel}}
\newcommand{\ACCESS}{\textsf{Access}}
\newcommand{\SQL}{\textsf{SQL}}
%%\newcommand{\Rpackage}[1]{\hbox{\rm\textit{#1}}}
%%\newcommand{\Robject}[1]{\hbox{\rm\texttt{#1}}}
%%\newcommand{\Rclass}[1]{\hbox{\rm\textit{#1}}}
%%\newcommand{\Rcmd}[1]{\hbox{\rm\texttt{#1}}}
\newcommand{\Rpackage}[1]{\index{#1 package@{\fontseries{b}\selectfont #1} package} {\fontseries{b}\selectfont #1}}
\newcommand{\rpackage}[1]{{\fontseries{b}\selectfont #1}}
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rclass}[1]{\index{#1 class@\textit{#1} class}\textit{#1}}
\newcommand{\Rcmd}[1]{\index{#1 function@\texttt{#1} function}\texttt{#1}}
\newcommand{\Roperator}[1]{\texttt{#1}}
\newcommand{\Rarg}[1]{\texttt{#1}}
\newcommand{\Rlevel}[1]{\texttt{#1}}
%%% other symbols
\newcommand{\file}[1]{\hbox{\rm\texttt{#1}}}
%%\newcommand{\stress}[1]{\index{#1}\textit{#1}}
\newcommand{\stress}[1]{\textit{#1}}
\newcommand{\booktitle}[1]{\textit{#1}} %%'
%%% Math symbols
\usepackage{amstext}
\usepackage{amsmath}
\newcommand{\E}{\mathsf{E}}
\newcommand{\Var}{\mathsf{Var}}
\newcommand{\Cov}{\mathsf{Cov}}
\newcommand{\Cor}{\mathsf{Cor}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\y}{\mathbf{y}}
\renewcommand{\a}{\mathbf{a}}
\newcommand{\W}{\mathbf{W}}
\newcommand{\C}{\mathbf{C}}
\renewcommand{\H}{\mathbf{H}}
\newcommand{\X}{\mathbf{X}}
\newcommand{\B}{\mathbf{B}}
\newcommand{\V}{\mathbf{V}}
\newcommand{\I}{\mathbf{I}}
\newcommand{\D}{\mathbf{D}}
\newcommand{\bS}{\mathbf{S}}
\newcommand{\N}{\mathcal{N}}
\renewcommand{\L}{L}
\renewcommand{\P}{\mathsf{P}}
\newcommand{\K}{\mathbf{K}}
\newcommand{\m}{\mathbf{m}}
\newcommand{\argmin}{\operatorname{argmin}\displaylimits}
\newcommand{\argmax}{\operatorname{argmax}\displaylimits}
\newcommand{\bx}{\mathbf{x}}
\newcommand{\bbeta}{\mathbf{\beta}}
%%% links
\usepackage{hyperref}
\hypersetup{%
pdftitle = {A Handbook of Statistical Analyses Using R (3rd Edition)},
pdfsubject = {Book},
pdfauthor = {Torsten Hothorn and Brian S. Everitt},
colorlinks = {black},
linkcolor = {black},
citecolor = {black},
urlcolor = {black},
hyperindex = {true},
linktocpage = {true},
}
%%% captions & tables
%% : conflics with figure definition in chapman.cls
%%\usepackage[format=hang,margin=10pt,labelfont=bf]{caption}
%%
\usepackage{longtable}
\usepackage[figuresright]{rotating}
%%% R symbol in chapter 1
\usepackage{wrapfig}
%%% Bibliography
\usepackage[round,comma]{natbib}
\renewcommand{\refname}{References \addcontentsline{toc}{chapter}{References}}
\citeindexfalse
%%% texi2dvi complains that \newblock is undefined, hm...
\def\newblock{\hskip .11em plus .33em minus .07em}
%%% Example sections
\newcounter{exercise}[chapter]
\setcounter{exercise}{0}
\newcommand{\exercise}{\stepcounter{exercise} \item{Ex.~\arabic{chapter}.\arabic{exercise} }}
%% URLs
\newcommand{\curl}[1]{\begin{center} \url{#1} \end{center}}
%%% for manual corrections
%\renewcommand{\baselinestretch}{2}
%%% plot sizes
\setkeys{Gin}{width=0.95\textwidth}
%%% color
\usepackage{color}
%%% hyphenations
\hyphenation{drop-out}
\hyphenation{mar-gi-nal}
%%% new bidirectional quotes need
\usepackage[utf8]{inputenc}
%\usepackage{setspace}
\definecolor{sidebox_todo}{rgb}{1,1,0.2}
\newcommand{\todo}[1]{
\hspace{0pt}%
\marginpar{%
\fcolorbox{black}{sidebox_todo}{%
\parbox{\marginparwidth} {
\raggedright\sffamily\footnotesize{TODO: #1}%
}
}%
}
}
\begin{document}
%% Title page
\title{A Handbook of Statistical Analyses Using \R{} --- 3rd Edition}
\author{Torsten Hothorn and Brian S. Everitt}
\maketitle
%%\VignetteIndexEntry{An Introduction to R}
%%\VignetteDepends{sandwich}
\setcounter{chapter}{0}
\SweaveOpts{prefix.string=figures/HSAUR,eps=FALSE,keep.source=TRUE}
<>=
rm(list = ls())
s <- search()[-1]
s <- s[-match(c("package:base", "package:stats", "package:graphics", "package:grDevices",
"package:utils", "package:datasets", "package:methods", "Autoloads"), s)]
if (length(s) > 0) sapply(s, detach, character.only = TRUE)
if (!file.exists("tables")) dir.create("tables")
if (!file.exists("figures")) dir.create("figures")
set.seed(290875)
options(prompt = "R> ", continue = "+ ",
width = 63, # digits = 4,
show.signif.stars = FALSE,
SweaveHooks = list(leftpar = function()
par(mai = par("mai") * c(1, 1.05, 1, 1)),
bigleftpar = function()
par(mai = par("mai") * c(1, 1.7, 1, 1))))
HSAURpkg <- require("HSAUR3")
if (!HSAURpkg) stop("cannot load package ", sQuote("HSAUR3"))
rm(HSAURpkg)
### hm, R-2.4.0 --vanilla seems to need this
a <- Sys.setlocale("LC_ALL", "C")
###
book <- TRUE
refs <- cbind(c("AItR", "DAGD", "SI", "CI", "ANOVA", "MLR", "GLM",
"DE", "RP", "GAM", "SA", "ALDI", "ALDII", "SIMC", "MA", "PCA",
"MDS", "CA"), 1:18)
ch <- function(x) {
ch <- refs[which(refs[,1] == x),]
if (book) {
return(paste("Chapter~\\\\ref{", ch[1], "}", sep = ""))
} else {
return(paste("Chapter~", ch[2], sep = ""))
}
}
if (file.exists("deparse.R"))
source("deparse.R")
setHook(packageEvent("lattice", "attach"), function(...) {
lattice.options(default.theme =
function()
standard.theme("pdf", color = FALSE))
})
@
\pagestyle{headings}
<>=
book <- FALSE
@
\chapter{An Introduction to \R{} \label{AItR}}
\setcounter{page}{1}
\section{What is \R{}?}
The \R{} system for statistical computing is an environment for
data analysis and graphics.
%% #Z
%% and an implementation of the \R{} programming language.
The root of \R{} is the \S{} language, developed by John
Chambers and colleagues
\citep{HSAUR:Becker+Chambers+Wilks:1988,HSAUR:Chambers+Hastie:1992,HSAUR:Chambers:1998}
at Bell Laboratories (formerly AT\&T, now owned by
Lucent Technologies) starting in the 1960s. The \S{} language was designed
and developed as a programming language for data analysis tasks but in fact
it is a full-featured programming language in its current implementations.
The development of the \R{} system for statistical computing is heavily
influenced by the open source idea: The base distribution of \R{}
\index{Base distribution}
and a large number of
user-contributed extensions are available under the
terms of the Free Software Foundation's GNU General %%'
Public License in source code form.
\index{GNU General Public License}
This licence has two major implications
for the data analyst working with \R. The complete source code is
available and thus the practitioner can investigate the details of the
implementation of a special method, make changes, and distribute
modifications to colleagues. As a side effect, the \R{} system for statistical computing
is available to everyone. All scientists, including, in particular, those
working in developing countries, now have access to state-of-the-art tools for
statistical data analysis without additional costs. With the help of the
\R{} system for statistical computing, research really becomes
reproducible when both the data and the results of all data analysis steps reported in a paper
are available to the readers through an \R{} transcript file. \R{} is
most widely used for teaching undergraduate and graduate statistics classes
at universities all over the world because students can freely use the
statistical computing tools.
The base distribution of \R{} is maintained by a small group of statisticians,
the \R{} Development Core Team. A huge amount of additional functionality is
implemented in add-on packages
\index{Add-on package}
authored and maintained by a large group of
volunteers. The main source of information about the \R{} system is the World
Wide Web with the official home page of the \R{} project being
\curl{http://www.R-project.org}
All resources are available from this page:
the \R{} system itself, a collection of add-on packages, manuals,
documentation, and more.
The intention of this chapter is to
give a rather informal introduction to basic concepts
and data manipulation techniques for the \R{} novice. Instead of a rigid
treatment of the technical background, the most common tasks are illustrated
by practical examples and it is our hope that this will enable
readers to get started without too many problems.
\section{Installing \R{}}
\index{Base system|(}
The \R{} system for statistical computing consists of two major parts: the
base system and a collection of user-contributed add-on packages. The \R{} language is
implemented in the base system.
Implementations of statistical and graphical procedures are separated from
the base system and are organized in the form of \stress{packages}. A package is
\index{Add-on package}
a collection of functions, examples, and documentation. The functionality of a
package is often focused on a special statistical methodology. Both the base system
and packages are distributed via the Comprehensive \R{} Archive
Network (CRAN) accessible under \curl{http://CRAN.R-project.org}
\index{Comprehensive R Archive Network (CRAN)@Comprehensive \R{} Archive Network (CRAN)}
\subsection{The Base System and the First Steps \label{AItR:Base}}
The base system is available in source form and in precompiled form for
various Unix systems, Windows platforms and Mac OS X.
For the data analyst, it
is sufficient to download the precompiled binary distribution and install it
locally. Windows users follow the link
\curl{http://CRAN.R-project.org/bin/windows/base/release.htm}
download the
corresponding file (currently named \file{\Sexpr{HSAUR3:::exename()}}), execute it
locally, and follow the instructions given by the installer.
\index{Base system|)}
\begin{wrapfigure}{lH}[0cm]{2cm}
\includegraphics[width=1.95cm]{graphics/Rlogo_bw}
\end{wrapfigure}
Depending on the operating system, \R{} can be started either by typing
`\texttt{R}' on the shell (Unix systems) or by clicking on the %'
\R{} symbol (as shown left) created by the installer (Windows).
\R{} comes without any frills and on start up shows simply
a short introductory message including the version number
and a prompt `\texttt{>}': %'
\index{Prompt}
<>=
HSAUR3:::Rwelcome()
@
<>=
options(prompt = "> ")
@
One can change the appearance of the prompt by
<>=
options(prompt = "R> ")
@
and we will use the prompt \Rarg{R>} for the display of the code examples
throughout this book. A \texttt{+} sign at the very beginning of a line
indicates a continuing command after a newline.
Essentially, the \R{} system evaluates
commands typed on the \R{} prompt and returns the results of the computations.
The end of a command is indicated by the return key. Virtually all
introductory texts on \R{} start with an example using \R{} as a pocket
calculator, and so do we:
<>=
x <- sqrt(25) + 2
@
This simple statement asks the \R{} interpreter to calculate $\sqrt{25}$ and
then to add $2$. The result of the operation is assigned to
an \R{} object
\index{Object}
with variable name \Robject{x}. The assignment operator
\Roperator{<-} binds the value of its right-hand side to a variable name on the
left-hand side. The value of the object \Robject{x} can be inspected
simply by typing
<>=
x
@
which, implicitly, calls the \Rcmd{print} method:
<>=
print(x)
@
\subsection{Packages}
\index{Add-on package|(}
The base distribution already comes with some high-priority add-on packages, namely
\begin{center}
<>=
colwidth <- 4
ip <- installed.packages(priority = "high")
pkgs <- unique(ip[,"Package"])
pkgs <- paste("\\Rpackage{", pkgs, "}", sep = "")
nrows <- ceiling(length(pkgs) / colwidth)
pkgs <- c(pkgs, rep("", colwidth * nrows - length(pkgs)))
cat(paste(c("\\begin{tabular}{", paste(rep("l", colwidth), collapse=""), "}"), collapse = ""),
"\n", file = "tables/rec.tex", append = FALSE)
for (i in 1:nrows) {
cat(paste(pkgs[(1:colwidth) + (i-1)*colwidth], collapse = " & "),
file = "tables/rec.tex", append = TRUE)
cat("\\\\ \n", file = "tables/rec.tex", append = TRUE)
}
cat("\\end{tabular}\n", file = "tables/rec.tex", append = TRUE)
rm(ip, nrows)
@
\input{tables/rec}
\end{center}
Some of the packages listed here
%% #Z
%% are maintained by members of the \R{} core development team and
implement standard statistical functionality, for example linear models,
classical tests, a huge collection of high-level plotting functions or tools
for survival analysis; many of these will be
described and used in later chapters. Others provide basic infrastructure,
for example for graphic systems, code analysis tools, graphical-user interfaces
or other utilities.
<>=
cp <- available.packages(contriburl = "http://CRAN.r-project.org/src/contrib")
ncp <- sum(!rownames(cp) %in% pkgs)
rm(cp, pkgs)
@
Packages not included in the base distribution can be installed directly
from the \R{} prompt. At the time of writing this chapter,
$\Sexpr{ncp}$ user-contributed packages covering almost all
fields of statistical methodology were available. Certain so-called `task views'
for special topics, such as statistics in the social sciences, environmetrics,
robust statistics, etc., describe important and helpful packages and
are available from
\curl{http://CRAN.R-project.org/web/views/}
<>=
rm(ncp, colwidth, i)
@
Given that an Internet connection is available, a package is installed
by supplying the name of the package to the function \Rcmd{install.packages}.
If, for example,
add-on functionality for robust estimation of covariance matrices via sandwich
estimators
\index{Sandwich estimator}
is required (for example in \Sexpr{ch("ALDII")}),
the \Rpackage{sandwich} package \citep{PKG:sandwich}
can be downloaded and installed via
<>=
install.packages("sandwich")
@
The package functionality is available after \stress{attaching} the package
by
<>=
library("sandwich")
@
A comprehensive list of available packages can be obtained from
\curl{http://CRAN.R-project.org/web/packages/}
Note that on
Windows operating systems, precompiled versions of packages are downloaded
and installed.
%%Currently, the service of overnight compilation of all packages on
%%CRAN for the Windows platform is kindly offered by Uwe Ligges from the
%%University of Dortmund, Germany.
In contrast, packages are compiled locally
before they are installed on Unix systems.
\index{Add-on package|)}
\section{Help and Documentation \label{AItR:HDN}}
\index{Help system|(}
Roughly, three different forms of documentation for the \R{} system for
statistical computing may be distinguished: online help that comes with the
base distribution or packages, electronic manuals, and publications work
in the form of books, etc.
The help system is a collection of manual pages describing each
user-visible function and data set that comes with \R{}.
A manual page is shown in a pager or Web browser when the name
of the function we would like to get help for is supplied to the \Rcmd{help}
function
<>=
help("mean")
@
or, for short,
\begin{Verbatim}
R> ?mean
\end{Verbatim}
Each manual page consists of a general description, the argument list of the
documented function with a
description of each single argument, information about the return value of
the function and, optionally, references, cross-links and, in most cases,
executable examples. The function \Rcmd{help.search} is helpful for searching within
manual pages. An overview on documented topics in an add-on package is given,
for example for the \Rpackage{sandwich} package, by
<>=
help(package = "sandwich")
@
Often a package comes along with an additional document describing the
package functionality and giving examples. Such a document is called a
\Rclass{vignette} \citep{HSAUR:Leisch2003,HSAUR:Gentleman2005}.
For example, the \Rpackage{sandwich} package vignette is opened using
<>=
vignette("sandwich", package = "sandwich")
@
More extensive documentation is available electronically from the collection
of manuals at \curl{http://CRAN.R-project.org/manuals.html}
For the
beginner, at least the first and the second document
of the following four manuals
\citep{HSAUR:AItR,HSAUR:RDIE,HSAUR:RIA,HSAUR:WRE} are mandatory:
\begin{description}
\item[An Introduction to R] A more formal introduction to data analysis
with \R{} than this chapter.
\item[R Data Import/Export] A very useful description of how to read and
write various external data formats.
\item[R Installation and Administration] Hints for installing \R{} on
special platforms.
\item[Writing \R{} Extensions] The authoritative source on how to write \R{}
programs and packages.
\end{description}
Both printed and online publications are available, the most important ones
are \booktitle{Modern Applied Statistics with \S{}}
\citep{HSAUR:VenablesRipley2002},
\booktitle{Introductory Statistics with \R{}} \citep{HSAUR:Dalgaard2002},
\booktitle{\R{} Graphics}
\citep{HSAUR:Murrell2005} and the \R{} Newsletter, freely available from
\curl{http://CRAN.R-project.org/doc/Rnews/}
In case the electronically available documentation and the answers to
frequently asked questions (FAQ), available from
\curl{http://CRAN.R-project.org/faqs.html}
\index{Frequently asked questions (FAQ)}
have been consulted but a problem or question remains unsolved, the
\texttt{r-help} email list is the right place to get answers to well-thought-out
questions. It is helpful to read the posting guide
\curl{http://www.R-project.org/posting-guide.html}
before starting to ask.
\index{Help system|)}
\section{Data Objects in \R{}}
\index{Forbes 2000 ranking|(}
The data handling and manipulation techniques explained in this chapter will
be illustrated by means of a data set of $2000$ world leading companies, the
Forbes 2000 list for the year 2004 collected by \booktitle{Forbes Magazine}.
This list is originally available from
\curl{http://www.forbes.com}
and, as an \R{} data object, it is part of the \Rpackage{HSAUR3}
package (\textit{Source}: From Forbes.com, New York, New York, 2004. With
permission.). In a first step, we make the data available for computations within
\R. The \Rcmd{data} function searches for data objects of the specified
name (\Robject{"Forbes2000"})
in the package specified via the \Rarg{package} argument and, if the
search was successful, attaches the data object to the global environment:
\index{Forbes2000 data@\Robject{Forbes2000} data}
<>=
data("Forbes2000", package = "HSAUR3")
ls()
@
<>=
x <- c("x", "Forbes2000")
print(x)
@
The output of the \Rcmd{ls} function lists the names of all objects
currently stored in the global environment, and, as the result of the
previous command, a variable named \Robject{Forbes2000} is available for
further manipulation. The variable \Robject{x} arises from the pocket
calculator example in Subsection~\ref{AItR:Base}.
As one can imagine, printing a list of $2000$
companies via
<>=
print(Forbes2000)
@
<>=
print(Forbes2000[1:3,])
cat("...\n")
@
will not be particularly helpful
in gathering some initial information about the data; it is more
useful to look at a description of their structure found by using
the following command
<>=
str(Forbes2000)
@
<>=
str(Forbes2000, vec.len = 2, strict.width = "cut", width = 60)
@
The output of the \Rcmd{str} function
tells us that \Robject{Forbes2000} is an object of class
\Rclass{data.frame}, the most
important data structure for handling tabular statistical data in \R.
As
expected, information about $2000$ observations, i.e., companies,
are stored in this object.
For each observation, the following eight variables are available:
\begin{description}
\item[\Robject{rank}] the ranking of the company,
\item[\Robject{name}] the name of the company,
\item[\Robject{country}] the country the company is situated in,
\item[\Robject{category}] a category describing the products the company
produces,
\item[\Robject{sales}] the amount of sales of the company in billion
US dollars,
\item[\Robject{profits}] the profit of the company in billion US dollars,
\item[\Robject{assets}] the assets of the company in billion US dollars,
\item[\Robject{marketvalue}] the market value of the company in billion US dollars.
\end{description}
A similar but more detailed description is available from the help page for the
\Robject{Forbes2000} object:
<>=
help("Forbes2000")
@
or
\begin{Verbatim}
R> ?Forbes2000
\end{Verbatim}
All information provided by \Rcmd{str} can be obtained by specialized
functions as well and we will now have a closer look at the most important
of these.
The \R{} language is an object-oriented programming language,
\index{Object-oriented programming language}
so every object
is an instance of a class. The name of the class of an object can be
determined by
<>=
class(Forbes2000)
@
Objects of class \Rclass{data.frame} represent data the traditional table-oriented
way. Each row is associated with one single observation and each
column corresponds to one variable. The dimensions of such a table can be
extracted using the \Rcmd{dim} function
<>=
dim(Forbes2000)
@
Alternatively, the numbers of rows and columns
can be found using
<>=
nrow(Forbes2000)
ncol(Forbes2000)
@
The results of both statements show that \Robject{Forbes2000} has
$\Sexpr{nrow(Forbes2000)}$ rows, i.e., observations, the companies in our case,
with eight variables describing the observations.
The variable names are accessible from
<>=
names(Forbes2000)
@
The values of single variables can be extracted from the
\Robject{Forbes2000} object by their names, for example the ranking of the
companies
<>=
class(Forbes2000[,"rank"])
@
is stored as an integer variable. Brackets \Robject{[]} always indicate
a subset
\index{Subset} of a
larger object, in our case a single variable extracted from the whole table.
Because \Rclass{data.frame}s have two dimensions, observations and
variables, the comma is required in order to specify that we want a subset
of the second dimension, i.e., the variables. The rankings for all
$\Sexpr{nrow(Forbes2000)}$ companies are represented in a \Rclass{vector}
structure the length of which is given by
<>=
length(Forbes2000[,"rank"])
@
A \Rclass{vector} is the elementary structure for data handling in \R{} and is
a set of simple elements, all being objects of the same class. For example,
a simple vector of the numbers one to three can be constructed by one of the
following commands
<>=
1:3
c(1,2,3)
seq(from = 1, to = 3, by = 1)
@
The unique names of all $\Sexpr{nrow(Forbes2000)}$ companies are stored in a
character vector
\index{character vector@\Rclass{character} vector}
<>=
class(Forbes2000[,"name"])
length(Forbes2000[,"name"])
@
and the first element of this vector is
<>=
Forbes2000[,"name"][1]
@
Because the companies are ranked, Citigroup is the world's largest %'
company according to the Forbes 2000 list. Further details on vectors and
subsetting are given in Section~\ref{AItR:BDM}.
Nominal measurements are represented by \Rclass{factor} variables in \R,
such as the category of the company's business segment %%'
<>=
class(Forbes2000[,"category"])
@
Objects of class \Rclass{factor} and \Rclass{character} basically differ in
the way their values are stored internally.
Each element of a vector of
class \Rclass{character} is stored as a \Rclass{character} variable whereas
an integer variable indicating the level of a \Rclass{factor} is saved for
\Rclass{factor} objects. In our case, there are
<>=
nlevels(Forbes2000[,"category"])
@
different levels, i.e., business categories, which can be extracted by
<>=
levels(Forbes2000[,"category"])
@
<>=
levels(Forbes2000[,"category"])[1:3]
cat("...\n")
@
As a simple summary statistic,
the frequencies of the levels of such a \Rclass{factor} variable can be
found from
<>=
table(Forbes2000[,"category"])
@
<>=
table(Forbes2000[,"category"])[1:3]
cat("...\n")
@
The sales, assets, profits, and market value variables are of type
\Robject{numeric}, the natural data type for continuous or discrete
measurements, for example
<>=
class(Forbes2000[,"sales"])
@
and simple summary statistics such as the mean, median, and range can be
found from
<>=
median(Forbes2000[,"sales"])
mean(Forbes2000[,"sales"])
range(Forbes2000[,"sales"])
@
The \Rcmd{summary} method can be applied to
a numeric vector to give a set of useful summary statistics, namely
the minimum, maximum, mean, median, and the $25\%$ and $75\%$ quartiles;
for example
<>=
summary(Forbes2000[,"sales"])
@
\section{Data Import and Export}
\index{Data import and export|(}
In the previous section, the data from the Forbes 2000 list of the world's largest %%'
companies were loaded into \R{} from the
\Rpackage{HSAUR3} package but we will now explore practically more relevant
ways to import data into the \R{} system. The most frequent data formats
the data analyst is confronted with are comma separated files,
\index{Comma separated files}
\EXCEL{}
spreadsheets,
\index{Excel spreadsheets@\EXCEL{} spreadsheets}
files in \SPSS{} format
\index{SPSS file format@\SPSS{} file format}
and a variety of \SQL{} data base engines.
\index{SQL data bases@\SQL{} data bases}
Querying data bases is a nontrivial task and requires additional knowledge
about querying languages, and we therefore refer to the \booktitle{\R{} Data
Import/Export} manual -- see Section~\ref{AItR:HDN}.
<>=
pkgpath <- system.file(package = "HSAUR2")
mywd <- getwd()
filep <- file.path(pkgpath, "rawdata")
setwd(filep)
@
We assume that a comma-separated file containing the Forbes 2000 list is
available as \file{Forbes2000.csv} (such a file is part of the
\Rpackage{HSAUR3} source package in directory \file{HSAUR3/inst/rawdata}).
When the fields are separated by commas and each row begins with a name
(a text format typically created by \EXCEL{}), we can read in the data as
follows using the \Rcmd{read.table} function
<>=
csvForbes2000 <- read.table("Forbes2000.csv",
header = TRUE, sep = ",", row.names = 1)
@
The argument \Rarg{header = TRUE} indicates that the entries in the first line of the text
file \Robject{"Forbes2000.csv"} should be interpreted as variable names.
Columns are separated by a comma (\Rcmd{sep = ","}), users of continental
versions of \EXCEL{} should take care of the character symbol coding for decimal points
(by default \Rcmd{dec = "."}). Finally, the first column should be
interpreted as row names but not as a variable (\Rarg{row.names = 1}).
Alternatively, the function \Rcmd{read.csv} can be used to read comma-separated files.
The function \Rcmd{read.table} by default guesses the class of each variable from the
specified file. In our case, character variables are stored as factors
<>=
class(csvForbes2000[,"name"])
@
which is only suboptimal since the names of the companies are unique.
However, we can supply the types for each variable to the \Rarg{colClasses}
argument
<>=
csvForbes2000 <- read.table("Forbes2000.csv",
header = TRUE, sep = ",", row.names = 1,
colClasses = c("character", "integer", "character",
"factor", "factor", "numeric", "numeric", "numeric",
"numeric"))
class(csvForbes2000[,"name"])
@
and check if this object is identical to our previous Forbes 2000 list
object
<>=
all.equal(csvForbes2000, Forbes2000)
@
The argument \Rarg{colClasses} expects a character vector of length equal to
the number of columns in the file. Such a vector can be supplied by the
\Rcmd{c} function that combines the objects given in the parameter list into
a \Rclass{vector}
<>=
classes <- c("character", "integer", "character", "factor",
"factor", "numeric", "numeric", "numeric", "numeric")
length(classes)
class(classes)
@
An \R{} interface to the open data base
connectivity (ODBC) standard
\index{Open data base connectivity standard (ODBC)}
is available in package \Rpackage{RODBC} and
its functionality can be used to access \EXCEL{} and \ACCESS{} files directly:
<>=
library("RODBC")
cnct <- odbcConnectExcel("Forbes2000.xls")
sqlQuery(cnct, "select * from \"Forbes2000\\$\"")
@
The function \Rcmd{odbcConnectExcel} opens a connection to the specified
\EXCEL{} or \ACCESS{} file which can be used to send \SQL{} queries to the
data base engine and retrieve the results of the query.
<>=
setwd(mywd)
@
Files in \SPSS{} format are read in a way similar to reading comma-separated files, using the
function \Rcmd{read.spss} from package \Rpackage{foreign} (which comes with
the base distribution).
Exporting data from \R{} is now rather straightforward. A comma-separated file
readable by \EXCEL{} can be constructed from a \Rclass{data.frame} object via
<>=
write.table(Forbes2000, file = "Forbes2000.csv", sep = ",",
col.names = NA)
@
The function \Rcmd{write.csv} is one alternative and the functionality implemented
in the \Rpackage{RODBC} package can be used to write data directly into \EXCEL{}
spreadsheets as well.
\index{Saving R objects@Saving \R{} objects}
Alternatively, when data should be saved for later processing in \R{} only,
\R{} objects of arbitrary kind can be stored into an external binary file via
<>=
save(Forbes2000, file = "Forbes2000.rda")
@
where the extension \file{.rda} is standard. We can get the file names of
all files with extension \file{.rda} from the working directory
<>=
list.files(pattern = "\\.rda")
@
and we can load the contents of the file into \R{} by
<>=
load("Forbes2000.rda")
@
\index{Data import and export|)}
\section{Basic Data Manipulation \label{AItR:BDM}}
\index{Data manipulation|(}
The examples shown in the previous section have illustrated the
importance of \Rclass{data.frame}s for storing and handling
tabular data in \R. Internally, a \Rclass{data.frame} is a \Rclass{list} of vectors
of a common length $n$, the number of rows of the table. Each of those vectors
represents the measurements of one variable and we have seen that we
can access such a variable by its name, for example the names of the
companies
<>=
companies <- Forbes2000[,"name"]
@
Of course, the \Robject{companies} vector is of class \Rclass{character} and
of length $\Sexpr{length(companies)}$. A subset
\index{Subset}
of the elements of the
vector \Robject{companies} can be extracted using the \Rcmd{[]} subset
operator. For example, the largest of the $2000$ companies listed in
the Forbes 2000 list is
<>=
companies[1]
@
and the top three companies can be extracted utilizing an integer vector of
the numbers one to three:
<>=
1:3
companies[1:3]
@
In contrast to indexing with positive integers, negative indexing returns
\index{negative indexing}
all elements that are \stress{not} part of the index vector given in brackets. For
example, all companies except those with numbers four to two thousand, i.e.,
the top three companies, are again
<>=
companies[-(4:2000)]
@
The complete information about the top three companies can be printed in a
similar way. Because \Rclass{data.frame}s have a concept of rows and
columns, we need to separate the subsets corresponding to rows and columns
by a comma. The statement
<>=
Forbes2000[1:3, c("name", "sales", "profits", "assets")]
@
extracts the variables \Robject{name}, \Robject{sales}, \Robject{profits}
and \Robject{assets}
for the three largest companies. Alternatively, a single variable can be
extracted from a \Rclass{data.frame} by
<>=
companies <- Forbes2000$name
@
which is equivalent to the previously shown statement
<>=
companies <- Forbes2000[,"name"]
@
We might be interested in extracting the largest companies with respect to
an alternative ordering. The three top-selling companies can be computed
along the following lines.
First, we need to compute the ordering of the companies' sales %%'
<>=
order_sales <- order(Forbes2000$sales)
@
which returns the indices of the
ordered elements of the numeric vector \Robject{sales}.
Consequently the three companies with the lowest sales are
<>=
companies[order_sales[1:3]]
@
The indices of the three top sellers are the elements $1998, 1999$ and $2000$
of the integer vector \Robject{order\_sales}
<>=
Forbes2000[order_sales[c(2000, 1999, 1998)],
c("name", "sales", "profits", "assets")]
@
Another way of selecting vector elements is the use of a logical vector
being \Robject{TRUE} when the corresponding element is to be selected and
\Robject{FALSE} otherwise. The companies with assets of more than $1000$
billion US dollars are
<>=
Forbes2000[Forbes2000$assets > 1000,
c("name", "sales", "profits", "assets")]
@
where the expression \Robject{Forbes2000\$assets > 1000} indicates a logical
vector of length $2000$ with
<>=
table(Forbes2000$assets > 1000)
@
elements being either \Robject{FALSE} or \Robject{TRUE}. In fact, for some of the
companies the measurement of the \Robject{profits} variable are missing. In
\R, missing values are treated by a special symbol, \Robject{NA}, indicating
\index{NA symbol@\Robject{NA} symbol}
that this measurement is not available.
\index{Missing values}
The observations with profit information missing can be obtained
via
<>=
na_profits <- is.na(Forbes2000$profits)
table(na_profits)
Forbes2000[na_profits,
c("name", "sales", "profits", "assets")]
@
where the function \Rcmd{is.na} returns a logical vector being
\Robject{TRUE} when the corresponding element of the supplied vector is
\Robject{NA}. A more comfortable approach is available when we want to
remove all observations with at least one missing value from a
\Rclass{data.frame} object. The function \Rcmd{complete.cases} takes a
\Rclass{data.frame} and returns a logical vector being \Robject{TRUE} when
the corresponding observation does not contain any missing value:
<>=
table(complete.cases(Forbes2000))
@
Subsetting \Rclass{data.frame}s driven by logical expressions may induce a
lot of typing which can be avoided. The \Rcmd{subset} function takes a
\Rclass{data.frame} as first argument and a logical expression as second
argument. For example, we can select a subset of the Forbes 2000 list
consisting of all companies situated in the United Kingdom by
<>=
UKcomp <- subset(Forbes2000, country == "United Kingdom")
dim(UKcomp)
@
i.e., $\Sexpr{nrow(UKcomp)}$ of the $2000$ companies are from the UK. Note
that it is not necessary to extract the variable \Robject{country} from the
\Rclass{data.frame} \Robject{Forbes2000} when formulating the logical
expression with \Rcmd{subset}.
\index{Data manipulation|)}
\section{Computing with Data}
\subsection{Simple Summary Statistics}
Two functions are helpful for getting an overview about \R{} objects:
\Rcmd{str} and \Rcmd{summary}, where \Rcmd{str} is more detailed about data
types and \Rcmd{summary} gives a collection of sensible summary statistics.
For example, applying the \Rcmd{summary} method to the \Robject{Forbes2000}
data set,
<>=
summary(Forbes2000)
@
results in the following output
<>=
summary(Forbes2000)
@
From this output we can immediately see
that most of the companies are situated in the US and
that most of the companies are working in the banking sector as well as that
negative profits, or losses, up to
$\Sexpr{abs(round(min(Forbes2000$profits, na.rm = TRUE)))}$ billion
US dollars occur.
Internally, \Rcmd{summary} is a so-called \stress{generic function}
\index{Generic function} with
methods for a multitude of classes, i.e., \Rcmd{summary} can be applied to
objects of different classes and will report sensible results. Here, we supply a
\Rclass{data.frame} object to \Rcmd{summary} where it is natural to apply
\Rcmd{summary} to each of the variables in this \Rclass{data.frame}. Because
a \Rclass{data.frame} is a \Rclass{list} with each variable being an element of
that \Rclass{list}, the same effect can be achieved by
<>=
lapply(Forbes2000, summary)
@
\index{apply family@\Rcmd{apply} family}
The members of the \Rcmd{apply} family help to solve recurring tasks for each
element of a \Rclass{data.frame}, \Rclass{matrix}, \Rclass{list} or for each
level of a \Rclass{factor}. It might be interesting to compare the profits in each of the
$\Sexpr{nlevels(Forbes2000$category)}$ categories. To do so, we first
compute the median profit for each category from
<>=
mprofits <- tapply(Forbes2000$profits,
Forbes2000$category, median, na.rm = TRUE)
@
a command that should be read as follows. For each level of the factor
\Robject{category}, determine the corresponding elements of the
numeric vector \Robject{profits} and supply them to the
\Rcmd{median} function with additional argument \Rarg{na.rm = TRUE}. The
latter one is necessary because \Robject{profits} contains
missing values which would lead to a non-sensible result of the
\Rcmd{median} function
<>=
median(Forbes2000$profits)
@
The three categories with highest median profit are computed from the vector
of sorted median profits
<>=
rev(sort(mprofits))[1:3]
@
where \Rcmd{rev} rearranges the vector of median profits
\Rcmd{sort}ed from smallest to largest.
Of course, we can replace the \Rcmd{median} function with \Rcmd{mean} or
whatever is appropriate in the call to \Rcmd{tapply}.
In our situation, \Rcmd{mean} is not a good
choice, because the distributions of profits or sales are naturally skewed.
Simple graphical tools for the inspection of the empirical
distributions are introduced later on and in \Sexpr{ch("DAGD")}.
\subsection{Customizing Analyses}
\index{Functions|(}
In the preceding sections we have done quite complex analyses
on our data using functions available from \R{}. However, the
real power of the system comes to light when writing our own functions
for our own analysis tasks. Although \R{} is a full-featured
programming language, writing small helper functions
for our daily work is not too complicated. We'll study two example
cases.
At first, we want to add a robust measure of variability to the
location measures computed in the previous subsection. In
addition to the median profit, computed via
<>=
median(Forbes2000$profits, na.rm = TRUE)
@
we want to compute the inter-quartile range, i.e., the difference
between the 3rd and 1st quartile. Although a quick
search in the manual pages (via \texttt{help("interquartile")})
brings function \Rcmd{IQR} to our attention, we will approach this
task without making use of this tool, but using function \Rcmd{quantile}
for computing sample quantiles only.
A function in \R{} is nothing but an object, and all objects are
created equal. Thus, we `just' have to assign a \Rclass{function}
object to a variable. A \Rclass{function} object consists
of an argument list, defining arguments and possibly default
values, and a body defining the computations. The body
starts and ends with braces. Of course, the
body is assumed to be valid \R{} code. In most cases we expect
a function to return an object, therefore, the body will
contain one or more \Rcmd{return} statements the arguments
of which define the return values.
Returning to our example, we'll name our function \Rcmd{iqr}.
The \Rcmd{iqr} function should operate on numeric vectors,
therefore it should have an argument \Robject{x}. This
numeric vector will be passed on to the \Rcmd{quantile}
function for computing the sample quartiles. The required difference
between the $3^\text{rd}$ and $1^\text{st}$ quartile
can then be computed using \Rcmd{diff}. The definition
of our function reads as follows
<>=
iqr <- function(x) {
q <- quantile(x, prob = c(0.25, 0.75), names = FALSE)
return(diff(q))
}
@
A simple test on simulated data from a standard normal distribution
shows that our first
function actually works, a comparison with the
\Rcmd{IQR} function shows that the result is correct:
<>=
xdata <- rnorm(100)
iqr(xdata)
IQR(xdata)
@
However, when the numeric vector contains missing values,
our function fails as the following example shows:
<>=
xdata[1] <- NA
iqr(xdata)
@
<>=
xdata[1] <- NA
cat(try(iqr(xdata)))
@
In order to make our little function more flexible it would
be helpful to add all arguments of \Rcmd{quantile} to the
argument list of \Rcmd{iqr}. The copy-and-paste approach
that first comes to mind is likely to lead to inconsistencies
and errors, for example when the argument list of \Rcmd{quantile}
changes. Instead, the dot argument, a wildcard for any argument,
is more appropriate and we redefine our function accordingly:
<>=
iqr <- function(x, ...) {
q <- quantile(x, prob = c(0.25, 0.75), names = FALSE,
...)
return(diff(q))
}
iqr(xdata, na.rm = TRUE)
IQR(xdata, na.rm = TRUE)
@
Now, we can assess the variability of the profits using
our new \Rcmd{iqr} tool:
<>=
iqr(Forbes2000$profits, na.rm = TRUE)
@
Since there is no difference between functions that have been
written by one of the \R{} developers and user-created functions,
we can compute the inter-quartile range of profits for
each of the business categories by using our \Rcmd{iqr} function
inside a \Rcmd{tapply} statement;
<>=
iqr_profits <- tapply(Forbes2000$profits,
Forbes2000$category, iqr, na.rm = TRUE)
@
and extract the categories with the smallest and greatest variability
<>=
levels(Forbes2000$category)[which.min(iqr_profits)]
levels(Forbes2000$category)[which.max(iqr_profits)]
@
We observe less variable profits in tourism enterprises compared
with profits in the pharmaceutical industry.
As other members of the \Rcmd{apply} family, \Rcmd{tapply} is
very helpful when the same task is to be done more than
one time. Moreover, its use is more convenient compared to
the usage of \Rcmd{for} loops. For the sake of completeness,
we will compute the category-wise inter-quartile range
of the profits using a \Rcmd{for} loop.
\index{Functions|)}
\index{Loops|(}
Like a \Rclass{function}, a \Rcmd{for} loop consists of a body, i.e.,
a chain of \R{} commands to be executed. In addition, we need a
set of values and a variable that iterates over this set.
Here, the set we are interested in is the business categories:
<>=
bcat <- Forbes2000$category
iqr_profits2 <- numeric(nlevels(bcat))
names(iqr_profits2) <- levels(bcat)
for (cat in levels(bcat)) {
catprofit <- subset(Forbes2000, category == cat)$profit
this_iqr <- iqr(catprofit, na.rm = TRUE)
iqr_profits2[levels(bcat) == cat] <- this_iqr
}
@
Compared to the usage of \Rcmd{tapply}, the above code
is rather complicated. At first, we have to set up a vector
for storing the results and assign the appropriate names
to it. Next, inside the body of the \Rcmd{for} loop, the
\Rcmd{iqr} function has to be called on the appropriate subset
of all companies of the current business category \Robject{cat}.
The corresponding inter-quartile range must then be assigned to
the correct vector element in the result vector. Luckily,
such complicated constructs will be used in only one of the
remaining chapters of the book and are almost always avoidable
in practical data analyses.
\index{Loops|)}
\subsection{Simple Graphics}
The degree of skewness of a distribution can be investigated by constructing
histograms using the
\Rcmd{hist} function. (More sophisticated alternatives
such as smooth density estimates will be considered in
\Sexpr{ch("DE")}.)
\begin{figure}
\begin{center}
<>=
layout(matrix(1:2, nrow = 2))
hist(Forbes2000$marketvalue)
hist(log(Forbes2000$marketvalue))
@
\caption{Histograms of the market value and the
logarithm of the market value for the companies contained in the Forbes 2000
list. \label{AItR:densplot}}
\end{center}
\end{figure}
For example, the code for producing
Figure~\ref{AItR:densplot} first divides the plot
region into two equally spaced rows (the \Rcmd{layout} function) and then
plots the histograms of the raw market
values in the upper part using the \Rcmd{hist} function.
The lower part of the figure depicts the histogram
for the log-transformed market values which appear to be more symmetric.
Bivariate relationships of two continuous variables are usually
depicted as scatterplots. In \R, regression relationships are specified by
so-called \stress{model formulae} which, in a simple bivariate case, may look
like
<>=
fm <- marketvalue ~ sales
class(fm)
@
with the dependent variable on the left-hand side and the independent
variable on the right-hand side. The tilde separates left- and right-hand
sides.
Such a model formula can be passed to a model function (for example to
the linear model function as explained in
\Sexpr{ch("MLR")}).
The \Rcmd{plot} generic function implements a \Rclass{formula} method as well. Because
the distributions of both market value and sales are skewed we choose to
depict their logarithms. A raw scatterplot of $2000$ data points
(Figure~\ref{AItR:scatter-raw}) is
rather uninformative due to areas with very high density. This problem can
be avoided by choosing a transparent color for the dots
as shown in Figure~\ref{AItR:scatter}.
\begin{figure}
\begin{center}
<>=
plot(log(marketvalue) ~ log(sales), data = Forbes2000,
pch = ".")
@
\caption{Raw scatterplot of the logarithms of market value and sales.
\label{AItR:scatter-raw}}
\end{center}
\end{figure}
\begin{figure}
\begin{center}
<>=
plot(log(marketvalue) ~ log(sales), data = Forbes2000,
col = rgb(0,0,0,0.1), pch = 16)
@
\caption{Scatterplot with transparent shading of points
of the logarithms of market value and sales.
\label{AItR:scatter}}
\end{center}
\end{figure}
If the independent variable is a factor, a boxplot representation is a
natural choice. For four selected countries, the distributions of the
logarithms of the market value may be visually compared in
Figure~\ref{AItR:box}. Prior to calling the \Rcmd{plot} function
on our data, we have to remove empty levels from the \Robject{country}
variable, because otherwise the $x$-axis would show all and not only
the selected countries. This task is most easily performed
by subsetting the corresponding factor with additional
argument \Rcmd{drop = TRUE}.
\index{Boxplot}
\begin{figure}
\begin{center}
<>=
tmp <- subset(Forbes2000,
country %in% c("United Kingdom", "Germany",
"India", "Turkey"))
tmp$country <- tmp$country[,drop = TRUE]
plot(log(marketvalue) ~ country, data = tmp,
ylab = "log(marketvalue)", varwidth = TRUE)
@
\caption{Boxplots of the logarithms of the market value for four selected
countries, the width of the boxes is proportional to the
square roots of the number of companies. \label{AItR:box}}
\end{center}
\end{figure}
Here, the width of the boxes are proportional to the square root of the
number of companies for each country and extremely large or small market values
are depicted by single points. More elaborate graphical methods
will be discussed in \Sexpr{ch("DAGD")}.
\index{Forbes 2000 ranking|)}
\section{Organizing an Analysis}
<>=
file.create("analysis.R")
@
Although it is possible to perform an analysis typing all commands directly
on the \R{} prompt it is much more comfortable to maintain a separate text
file collecting all steps necessary to perform a certain data analysis task.
Such an \R{} transcript file,
for example called \file{analysis.R} created with your favorite text editor, can be
sourced into \R{} using the \Rcmd{source} command
<>=
source("analysis.R", echo = TRUE)
@
When all steps of a data analysis, i.e.,
data preprocessing, transformations, simple summary statistics and plots,
model building and inference as well as reporting, are collected in such
an \R{} transcript file, the analysis can be reproduced at any time, maybe with
corrected or updated data as it frequently happens in our consulting practice.
<>=
file.remove("analysis.R")
@
\section{Summary of Findings}
Data manipulation precedes every statistical analysis and is often more
complex than the final model fitting and display. The \R{} language in
itself is very powerful and allows efficient data manipulation. For really
large data sets that do not fit into the random access memory of the
computer, we have to store the data elsewhere, for example in database
systems or flat files. Packages for accessing the data from these sources
are described in the `Large memory and out-of-memory data' section of the
`High-performance and parallel computing' task view.
\section{Final Comments}
Reading data into \R{} is possible in many different ways, including direct
connections to data base engines. Tabular data are handled by
\Rclass{data.frame}s in \R{}, and the usual data manipulation techniques
such as sorting, ordering or subsetting can be performed by simple \R{} statements.
An overview on data stored in a \Rclass{data.frame} is given mainly by two
functions: \Rcmd{summary} and \Rcmd{str}.
Simple graphics such as histograms and scatterplots
can be constructed by applying the appropriate \R{} functions
(\Rcmd{hist} and \Rcmd{plot}) and we shall give many more examples
of these functions
and those that produce more interesting graphics in later chapters.
\section*{Exercises}
\begin{description}
\exercise
Calculate the median profit for the companies in the US and the
median profit for the companies in the UK, France, and Germany.
\exercise
Find all German companies with negative profit.
\exercise
To which business category do most of the Bermuda island companies belong?
\exercise
For the $50$ companies in the Forbes data set with the highest profits,
plot sales against assets (or some suitable transformation of each
variable), labeling each point with the appropriate country name which may
need to be abbreviated (using \Rcmd{abbreviate}) to avoid making the
plot look too `messy'. %%'
\exercise
Find the average value of sales for the companies in each country in
the Forbes data set, and find the number of companies in each country with
profits above $5$ billion US dollars.
\exercise
List all the products made by companies in the UK.
\exercise
Plot sales against market value for companies in the UK and in
Germany using different plotting symbols for the two countries.
\exercise
For the ten companies in the UK with the greatest profits
construct a bar chart of profits labeled with the companies' name.
\exercise
How many of the $20$ companies with the greatest market value are from the
US and how many are from the UK?
\exercise
Construct a histogram of profits for all companies in Germany with assets
above three billion dollars; how many such companies are there? And which
product does the company with the greatest profit make?
\end{description}
\bibliographystyle{LaTeXBibTeX/refstyle}
\bibliography{LaTeXBibTeX/HSAUR}
\end{document}