Since mlrCPO is a package with some depth to it, it
comes with a few vignettes that each explain different aspects of its
operation. These are the current document (“First Steps”), offering a
short introduction and information on where to get started, “mlrCPO Core”, describing all the
functions and tools offered by mlrCPO that are independent
from specific CPOs, “CPOs Built
Into mlrCPO”, listing all CPOs included in the
mlrCPO package, and “Building Custom CPOs”, describing the
process of creating new CPOs that offer new
functionality.
All vignettes also have a “compact version” with the R output suppressed for readability. They are linked in the navigation section at the top.
All vignettes assume that mlrCPO (and therefore its
requirement mlr) is installed successfully and loaded using
library("mlrCPO"). Help with installation is provided on
the project’s GitHub
page.
“Composable Preprocessing Operators”, “CPO”, are an extension for the
mlr (“Machine Learning in
R”) project which present preprocessing operations in the form of R
objects. These CPO objects can be composed to form complex operations,
they can be applied to data sets, and can be attached to mlr
Learner objects to generate machine learning pipelines that
combine preprocessing and model fitting.
“Preprocessing”, as understood by mlrCPO, is any
manipulation of data used in a machine learning process to get it from
its form as found in the wild into a form more fitting for the machine
learning algorithm (“Learner”) used for model fitting. It
is important that the exact method of preprocessing is kept track of, to
be able to perform this method when the resulting model is used to make
predictions on new data. It is also important, when evaluating
preprocessing methods e.g. using resampling, that the
parameters of these methods are independent of the validation dataset
and only depend on the training data set.
mlrCPO tries to support the user in all these aspects of
preprocessing:
CPOs
that can perform many different operations. Operations that go beyond
the provided toolset can be implemented in custom
CPOs.CPOTrained” objects that represent the
preprocessing done on training data that should, in that way, be
re-applied to new prediction data.mlr “Learner” objects that represent the
entinre machine learning pipeline to be tuned and evaluated.At the centre of mlrCPO are “CPO” objects.
To get a CPO object, it is necessary to call a CPO
Constructor. A CPO Constructor sets up the parameters of a
CPO and provides further options for its behaviour.
Internally, CPO Constructors are functions that have a common
interface and a friendly printer method.
cpoScale  # a cpo constructorcpoAddColscpoScale(center = FALSE)  # create a CPO object that scales, but does not center, datacpoAddCols(Sepal.Area = Sepal.Length * Sepal.Width)  #  this would add a columnCPOs exist first to be applied to data. Every
CPO represents a certain data transformation, and this
transformation is performed when the CPO is applied. This
can be done using the applyCPO function,
or the %>>% operator.
CPOs can be applied to data.frame objects, and
to mlr “Task” objects.
iris.demo = iris[c(1, 2, 3, 51, 52, 102, 103), ]
tail(iris.demo %>>% cpoQuantileBinNumerics())  # bin the data in below & above medianA useful feature of CPOs is that they can be
concatenated to form new operations. Two CPOs can be
combined using the composeCPO function or,
as before, the %>>% operator. When
two CPOs are combined, the product is a new
CPO that can itself be composed or applied. The result of a
composition represents the operation of first applying the first
CPO and then the second CPO. Therefore,
data %>>% (cpo1 %>>% cpo2) is the same as
(data %>>% cpo1) %>>% cpo2.
# first create three quantile bins, then as.numeric() all columns to
# get 1, 2 or 3 as the bin number
quantilenum = cpoQuantileBinNumerics(numsplits = 3) %>>% cpoAsNumeric()
iris.demo %>>% quantilenumThe last example shows that it is sometimes not a good idea to have a
CPO affect the whole dataset. Therefore, when a
CPO is created, it is possible to choose what columns the
CPO should affect. The CPO Constructor has a variety of
parameters, starting with affect., that can be used to
choose what columns the CPO operates on. To prevent
cpoAsNumeric from influencing the Species
column, we can thus do
quantilenum.restricted = cpoQuantileBinNumerics(numsplits = 3) %>>%
  cpoAsNumeric(affect.names = "Species", affect.invert = TRUE)
iris.demo %>>% quantilenum.restrictedA more convenient method in this case, however, is to use an
mlr “Task”, which keeps track of the target
column. “Feature Operation” CPOs (as all the ones shown) do
not influence the target column.
demo.task = makeClassifTask(data = iris.demo, target = "Species")
result = demo.task %>>% quantilenum
getTaskData(result)When performing preprocessing, it is sometimes necessary to change a
small aspect of a long preprocessing pipeline. Instead of having to
re-construct the whole pipeline, mlrCPO offers the
possibility to change hyperparameters of a CPO.
This makes it very easy e.g. for tuning of preprocessing in combination
with a machine learning algorithm.
Hyperparameters of CPOs can be manipulated in the same
way as they are manipulated for Learners in
mlr, using getParamSet (to
list the parameters), getHyperPars (to
list the parameter values), and
setHyperPars (to change these values). To
get the parameter set of a CPO, it is also possible to use
verbose printing using the ! (exclamation
mark) operator.
cpo = cpoScale()
cpogetHyperPars(cpo)  # list of parameter names and valuesgetParamSet(cpo)  # more detailed view of parameters and their type / range!cpo  # equivalent to print(cpo, verbose = TRUE)CPOs use copy semantics, therefore
setHyperPars creates a copy of a CPO that has
the changed hyperparameters.
cpo2 = setHyperPars(cpo, scale.scale = FALSE)
cpo2iris.demo %>>% cpo  # scales and centersiris.demo %>>% cpo2 # only centersWhen chaining many CPOs, it is possible for the many
hyperparameters to lead to very cluttered ParamSets, or
even for hyperparameter names to clash. mlrCPO has two
remedies for that.
First, any CPO also has an
id that is always prepended to the
hyperparameter names. It can be set during construction, using the
id parameter, or changed later using setCPOId.
The latter one only works on primitive, i.e. not compound,
CPOs. Set the id to NULL to use
the CPO’s hyperparameters without a prefix.
cpo = cpoScale(id = "a") %>>% cpoScale(id = "b")  # not very useful example
getHyperPars(cpo)The second remedy against hyperparameter clashes is different
“exports” of hyperparameters: The hyperparameters that can be changed
using setHyperPars, i.e. that are exported by a
CPO, are a subset of the parameters of the
CPOConstructor. For each kind of CPO, there is
a standard set of parameters that are exported, but during construction,
it is possible to influence the parameters that actually get exported
via the export parameter. export can be one of
a set of standard export settings (among them “export.all”
and “export.none”) or a character vector of
the parameters to export.
cpo = cpoPca(export = c("center", "rank"))
getParamSet(cpo)Manipulating data for preprocessing itself is relatively easy. A
challenge comes when one wants to integrate preprocessing into a
machine-learning pipeline: The same preprocessing steps that are
performed on the training data need to be performed on the new
prediction data. However, the transformation performed for prediction
often needs information from the training step. For example, if training
entail performing PCA, then for prediction, the data must not undergo
another PCA, instead it needs to be rotated by the rotation
matrix found by the training PCA. The process of obtaining the
rotation matrix will be called “training” the CPO, and the
object that contains the trained information is called
CPOTrained. For preprocessing operations that operate only
on features of a task (as opposed to the target column), the
CPOTrained will always be applied to new incoming data, and
hence be of class CPORetrafo and called a
“retrafo” object. To obtain this retrafo object, one
can use retrafo(). Retrafo objects can be
applied to data just as CPOs can, by using the
%>>% operator.
transformed = iris.demo %>>% cpoPca(rank = 3)
transformedret = retrafo(transformed)
retTo show that ret actually represents the exact same
preprocessing operation, we can feed the first line of
iris.demo back to it, to verify that the transformation is
the same.
iris.demo[1, ] %>>% retWe obviously would not have gotten there by feeding the first line to
cpoPca directly:
iris.demo[1, ] %>>% cpoPca(rank = 3)CPOTrained objects associated with an object are
automatically chained when another CPO is applied. To
prevent this from happening, it is necessary to “clear” the retrafos and
inverters associated with the object using
clearRI().
t2 = transformed %>>% cpoScale()
retrafo(t2)t3 = clearRI(transformed) %>>% cpoScale()
retrafo(t3)Note that clearRI has no influence on the
CPO operations themselves, and the resulting data is the
same:
all.equal(t2, t3, check.attributes = FALSE)It is also possible to chain CPOTrained object using
composeCPO() or %>>%. This can be useful
if the trafo chain loses access to the retrafo attribute
for some reason. In general, it is only recommended to compose
CPOTrained objects that were created in the same process
and in correct order, since they are usually closely associated with the
training data in a particular place within the preprocessing chain.
retrafo(transformed) %>>% retrafo(t3)  # is the same as retrafo(t2) above.So far only CPOs were introduced that change the feature
columns of a Task. (“Feature Operation
CPOs”–FOCPOs). There is another class of
CPOs, “Target Operation CPOs” or
TOCPOs, that can change a Task’s target
columns.
This comes at the cost of some complexity when performing prediction:
Since the training data that was ultimately fed into a
Learner had a transformed target column, the predictions
made by the resulting model will not be directly comparable to the
original target values. Consider cpoLogTrafoRegr, a
CPO that log-transforms the target variable of a regression
Task. The predictions made with a Learner on a
log-transformed target variable will be in log-space and need to be
exponentiated (or otherwise re-transformed). This inversion operation is
represented by an “inverter” object that is attached to
a transformation result similarly to a retrafo object, and can be
obtained using the inverter() function. It
is of class CPOInverter, a subclass of
CPOTrained.
iris.regr = makeRegrTask(data = iris.demo, target = "Petal.Width")
iris.logd = iris.regr %>>% cpoLogTrafoRegr()
getTaskData(iris.logd)  # log-transformed target 'Petal.Width'inv = inverter(iris.logd)  # inverter object
invThe inverter object is used by the invert() function
that inverts the prediction made by a model trained on the transformed
task, and re-transforms this prediction to fit the space of the original
target data. The inverter object caches the “truth” of the data being
inverted (iris.logd, in the example), so
invert can give information on the truth of the inverted
data.
logmodel = train("regr.lm", iris.logd)
pred = predict(logmodel, iris.logd)  # prediction on the task itself
predinvert(inv, pred)This procedure can also be done with new incoming data. In general,
more than just the cpoLogTrafoRegr operation could be done
on the iris.regr task in the example, so to perform the
complete preprocessing and inversion, one needs to use the
retrafo object as well. When applying the retrafo object, a new inverter
object is generated, which is specific to the exact new data that was
being retransformed:
newdata = makeRegrTask("newiris", iris[7:9, ], target = "Petal.Width",
  fixup.data = "no", check.data = FALSE)# the retrafo does the same transformation(s) on newdata that were
# done on the training data of the model, iris.logd. In general, this
# could be more than just the target log transformation.
newdata.transformed = newdata %>>% retrafo(iris.logd)
getTaskData(newdata.transformed)pred = predict(logmodel, newdata.transformed)
pred# the inverter of the newly transformed data contains information specific
# to the newly transformed data. In the current case, that is just the
# new "truth" column for the new data.
inv.newdata = inverter(newdata.transformed)
invert(inv.newdata, pred)The cpoLogTrafoRegr is a special case of TOCPO in that
its inversion operation is constant: It does not depend on the
new incoming data, so in theory it is not necessary to get a new
inverter object for every piece of data that is being transformed.
Therefore, it is possible to use the retrafo object for
inversion in this case. However, the “truth” column will not be
available in this case:
invert(retrafo(iris.logd), pred)Whether a retrafo object is capable of performing inversion can be
checked with the getCPOTrainedCapability()
function. It returns a vector with named elements "retrafo"
and "invert", indicating whether a CPOTrained
is capable of performing retrafo or inversion. A 1
indicates that the object can perform the action and has an effect, a
0 indicates that the action would have no effect (but also
throws no error), and a -1 means that the object is not
capable of performing the action.
getCPOTrainedCapability(retrafo(iris.logd))  # can do both retrafo and inversiongetCPOTrainedCapability(inv)  # a pure inverter, can not be used for retrafoAs an example of a CPO that does not have a constant
inverter, consider cpoRegrResiduals, wich fits a regression
model on training data and returns the residuals of this fit. When
performing prediction, the invert action is to add
predictions by the CPO’s model to the incoming predictions
made by a model trained on the residuals.
set.seed(123)  # for reproducibility
iris.resid = iris.regr %>>% cpoRegrResiduals("regr.lm")
getTaskData(iris.resid)model.resid = train("regr.randomForest", iris.resid)
newdata.resid = newdata %>>% retrafo(iris.resid)
getTaskData(newdata.resid)  # Petal.Width are now the residuals of lm model predictionspred = predict(model.resid, newdata.resid)
pred# transforming this prediction back to compare
# it to the original 'Petal.Width'
inv.newdata = inverter(newdata.resid)
invert(inv.newdata, pred)Besides FOCPOs and TOCPOs, there are also
“Retrafoless” CPOs (ROCPOs). These only
perform operation in the training part of a machine learning pipeline,
but in turn are the only CPOs that may change the number of
rows in a dataset. The goal of ROCPOs is to change the number of data
samples, but not to transform the data or target values themselves.
Examples of ROCPOs are cpoUndersample,
cpoSmote, and cpoSample.
sampled = iris %>>% cpoSample(size = 3)
sampledThere is no retrafo or inverter associated with the result. Instead, both of them are NULLCPO
retrafo(sampled)
inverter(sampled)Until now, the CPOs have been invoked explicitly to
manipulate data and get retrafo and inverter objects. It is good to be
aware of the data flows in a machine learning process involving
preprocessing, but mlrCPO makes it very easy to automatize
this. It is possible to attach a CPO to a
Learner using attachCPO or
the %>>%-operator. When a CPO is
attached to a Learner, a CPOLearner is
created. The CPOLearner performs the preprocessing
operation dictated by the CPO before training the
underlying model, and stores and uses the retrafo and inverter objects
necessary during prediction. It is possible to attach compound
CPOs, and it is possible to attach further
CPOs to a CPOLearner to extend the
preprocessing pipeline. Exported hyperparamters of a CPO
are also present in a CPOLearner and can be changed using
setHyperPars, as usual with other Learner
objects.
Recreating the pipeline from General
Inverters with a CPOLearner looks like the following.
Note the prediction pred made in the end is identical with
the one made above.
set.seed(123)  # for reproducibility
lrn = cpoRegrResiduals("regr.lm") %>>% makeLearner("regr.randomForest")
lrnmodel = train(lrn, iris.regr)
pred = predict(model, newdata)
predIt is possible to get the retrafo object from a model trained with a
CPOLearner using the retrafo() function. In
this example, it is identical with the retrafo(iris.resid)
gotten in the example in General
Inverters.
retrafo(model)Since the hyperparameters of a CPO are present in a
CPOLearner, is possible to tune hyperparameters of
preprocessing operations. It can be done using mlr’s
tuneParams() function and works
identically to tuning common Learner-parameters.
icalrn = cpoIca() %>>% makeLearner("classif.logreg")
getParamSet(icalrn)ps = makeParamSet(
    makeIntegerParam("ica.n.comp", lower = 1, upper = 8),
    makeDiscreteParam("ica.alg.typ", values = c("parallel", "deflation")))
# shorter version using pSS:
# ps = pSS(ica.n.comp: integer[1, 8], ica.alg.typ: discrete[parallel, deflation])tuneParams(icalrn, pid.task, cv5, par.set = ps,
  control = makeTuneControlGrid(),
  show.info = FALSE)Besides the %>>% operator, there are a few related
operators which are short forms of operations that otherwise take more
typing.
%<<% is similar to
%>>% but works in the other direction.
a %>>% b is the same as
b %<<% a.%<>>% and
%<<<% are the %>>% or
%<<% operators, combined with assignment.
a %<>>% b is the same as
a = a %>>% b. These operators perform the operations
on their right before they do the assignment, so it is not necessary to
use parentheses when writing
a = a %>>% b %>>% c as
a %<>>% b %>>% c.%>|% and %|<% feed
data in a CPO and gets the retrafo().
data %>|% a is the same as
retrafo(data %>>% a). The %>|%
operator performs the operation on its right before getting the retrafo,
so it is not necessary to use parentheses when writing
retrafo(data %>>% a %>>% b) as
data %>|% a %>>% b.As described before, it is possible to compose
CPOs to create relatively complex preprocessing pipelines.
It is therefore necessary to have tools to inspect a CPO
pipeline or related objects.
The first line of attack when inspecting a CPO is always
the print function. print(x, verbose = TRUE)
will often print more information about a CPO than the
ordinary print function. A shorthand alias for this is the exclamation
point “!”. When verbosely printing a
CPOConstructor, the transformation functions are shown.
When verbosely printing a CPO, the constituent elements are
separately printed, each showing their parameter sets.
cpoAsNumeric  # plain print
!cpoAsNumeric  # verbose printcpoScale() %>>% cpoIca()  # plain print
!cpoScale() %>>% cpoIca()  # verbose printWhen working with compound CPOs, it is sometimes
necessary to manipulate a CPO inside a compound
CPO pipeline. For this purpose, the
as.list() generic is implemented for both
CPO and CPOTrained for splitting a pipeline
into a list of the primitive elements. The inverse is
pipeCPO(), which takes a list of
CPO or CPOTrained and concatenates them using
composeCPO().
as.list(cpoScale() %>>% cpoIca())pipeCPO(list(cpoScale(), cpoIca()))CPOTrained objects contain information about the retrafo
or inversion to be performed for a CPO. It is possible to
access this information using
getCPOTrainedState(). The “state” of a
CPOTrained object often contains a $data slot
with information about the expected input and output format
(“ShapeInfo”) of incoming data, a slot for each of its
hyperparameters, and a $control slot that is specific to
the CPO in question. The cpoPca state, for
example, contains the PCA rotation matrix and a vector for scaling and
centering. The contents of a state’s $control object are
described in a CPO’s help page.
repca = retrafo(iris.demo %>>% cpoPca())
state = getCPOTrainedState(repca)
stateIt is even possible to change the “state” of a
CPOTrained and construct a new CPOTrained
using makeCPOTrainedFromState(). This is
fairly advanced usage and only recommended for users familiar with the
inner workings of the particular CPO. If we get familiar
with the cpoPca CPO using the
!-print (i.e. !cpoPca) to look at the retrafo
function, we notice that the control$center and
control$scale values are given to a call of
scale(). If we want to create a new CPOTrained
that does not perform centering or scaling during before
applying the rotation matrix, we can change these values.
state$control$center = FALSE
state$control$scale = FALSE
nosc.repca = makeCPOTrainedFromState(cpoPca, state)Comparing this to the original “repca” retrafo shows
that the result of applying repca has generally smaller
values because of the centering.
iris.demo %>>% repcairis.demo %>>% nosc.repcaThere is a large and growing variety of CPOs that
perform many different operations. It is advisable to browse through CPOs Built Into mlrCPO for an overview. To
get a list of all built-in CPOs, use
listCPO(). A few important or “meta”
CPOs that can be used to influence the behaviour of other
CPOs are described here.
The value associated with “no operation” is the NULLCPO
value. It is the neutral element of the %>>%
operations, and the value of retrafo() and
inverter() when there are otherwise no associated retrafo
or inverter values.
NULLCPOall.equal(iris %>>% NULLCPO, iris)
cpoPca() %>>% NULLCPOThe multiplexer makes it possible to combine many CPOs into one, with
an extra selected.cpo parameter that chooses between them.
This makes it possible to tune over many different tuner configurations
at once.
cpm = cpoMultiplex(list(cpoIca, cpoPca(export = "export.all")))
!cpmiris.demo %>>% setHyperPars(cpm, selected.cpo = "ica", ica.n.comp = 3)iris.demo %>>% setHyperPars(cpm, selected.cpo = "pca", pca.rank = 3)A simple CPO with one parameter which gets applied to the data as CPO. This is different from a multiplexer in that its parameter is free and can take any value that behaves like a CPO. On the downside, this does not expose the argument’s parameters to the outside.
cpa = cpoWrap()
!cpairis.demo %>>% setHyperPars(cpa, wrap.cpo = cpoScale())iris.demo %>>% setHyperPars(cpa, wrap.cpo = cpoPca())Attaching the cpo applicator to a learner gives this learner a “cpo” hyperparameter that can be set to any CPO.
getParamSet(cpoWrap() %>>% makeLearner("classif.logreg"))cbind other CPOs as operation. The cbinder
makes it possible to build DAGs of CPOs that perform different
operations on data and paste the results next to each other. It is often
useful to combine cpoCbind with cpoSelect to
filter out columns that would otherwise be duplciated.
scale = cpoSelect(pattern = "Sepal", id = "first") %>>% cpoScale(id = "scale")
scale.pca = scale %>>% cpoPca()
cbinder = cpoCbind(scale, scale.pca, cpoSelect(pattern = "Petal", id = "second"))cpoCbind recognises that "scale" happens
before "pca", but is also fed to the result directly. The
verbose print draws a (crude) ascii-art graph.
!cbinderiris.demo %>>% cbinderEven though CPOs are very flexible and can be combined
in many ways, it may be necessary to create completely custom
CPOs. Custom CPOs can be created using the
makeCPO() and related functions. “Building Custom CPOs” is a wide topic
which has its own vignette.
CPOs are built using
CPOConstructors by calling them like
functions.CPOConstructors can be found by using
listCPO() or consulting the relevant vignette.CPOs and many related objects is
available using the ! (exclamation mark)
operator.CPOs export hyperparameters that are accessible using
getParamSet() and
getHyperPars(), and mutable using
setHyperPars(). Which parameters are
exported can be controlled using the
export parameter during construction.composeCPO()),
applied to data (applyCPO()) and attached
to Learners (attachCPO())
using special functions for each of these operations, or using the
general %>>% operator.CPO:
FOCPO (Feature Operation CPOs),
TOCPO (Target Operation CPOs) and
ROCPO (Retrafoless CPOs). The first may
only change feature columns, the second only target columns. While the
last one may change both feature and target values and even the
number of rows of a dataset, it does so with the understanding that new
“prediction” data will not be transformed by it and is thus mainly
useful for subsampling.CPO
has a retrafo-CPOTrained object associated
with it that can be retrieved using
retrafo() and used to transform new
prediction data in similar way as the original training data.CPOTrained objects can themselves be composed using
composeCPO or
%>>%, although it is only
recommended to compose CPOTrained objects in the same order
as they were created, and only if they were created in the same
preprocessing pipeline.CPOTrained objects can be inspected using
getCPOTrainedState(), and re-built with
changed state using
makeCPOTrainedFromState().inverter(). An inverter is also created
during application of a retrafo CPOTrained.CPOTrained are created during
training and used on every prediction data set, inverter
CPOTrained are created anew during each CPO
and retrafo-CPOTrained application and are closely
associated with the data that they were created with.CPOTrained objects associated with data are stored in
their “attributes” and are automatically chained when more
CPOs are applied. clearRI()
is used to remove the associated CPOTrained objects and
prevent this chaining.CPOs can be attached to Learners to get
CPOLearners which automatically transform
training and prediction data and perform prediction
inversion.CPOLearners have the Learner’s
and the CPO’s hyperparameters and can thus be
manipulated using setHyperPars(), and can
be tuned using tuneParams().CPOs are NULLCPO
(the neutral element of %>>%),
cpoMultiplex,
cpoWrap, and
cpoCbind.CPOs using
makeCPO and similar functions. These are
described in their own vignette.