Rule Based Learning Ensembles

RuleFit with  R

 (7/07/12)

Installation

Overview

Model building :  rulefit

Model manipulation :  getmodel,  rfrestore,  rfannotate,  rfmodinfo,  runstats

Cross-validation :  rfxval

Prediction:  rfpred

Variable importance:  varimp

Interaction effects :  interact,  twovarint,  threevarint,  intnull, rfnullinfo

Display rules :  rules

Partial dependence plots :  singleplot,  pairplot

Other  :  rfversion


Installation Instructions:

Open R

AT the R command prompt enter:

   > platform = "PLATFORM"
   > rfhome = "RFHOME"
   > source("RFHOME/rulefit.r")
   > install.packages("akima", lib=rfhome)
   > library(akima, lib.loc=rfhome)

Here "PLATFORM" is either the text string "windows" , "linux", or "mac" depending on the running operating system, and RFHOME is a text string indicating the full path name (using forward slashes / ) of the directory where  rulefit.r and rf_go.exe are stored. This will be the RuleFit home directory. (examples: rfhome = "/R_RuleFit"; rfhome = "/home/jhf/R_RuleFit")

Notes: the computer must be connected to the internet to execute the install.packages command. Only the last command is needed every time R is entered. The others need only be entered the first time provided that on exit from R the "yes" option is selected at the "Save workspace image" prompt.


Overview:

RuleFit implements the model building and interpretational tools described in  Predictive Learning via Rule Ensembles (FP 2004). Some familiarity with this paper is recommended. The documentation refers to sections in the paper describing the various options in detail.

The  R/RuleFit interface consists of the R procedures described below. The principal procedure is rulefit. It builds the RuleFit model given the input data and various procedure parameters. This model is stored in the RuleFit home directory RFHOME and invisibly returned as a a RuleFit model object (list) to R. All other RuleFit procedures reference the current model and its input data as stored in the RuleFit home directory. Every time the procedure rulefit is invoked the resulting RuleFit model overwrites the previously stored model and its input data, thereby replacing it as the current model.

Previously constructed (and saved) models and their input data can be replaced in the RuleFit home directory for analysis at a later time using the procedure rfrestore, thereby overwriting the current model and its data. This replaced model and data then become the current ones in the RuleFit directory and all RuleFit procedures (other than rulefit ) will reference them until either a different previously constructed model and its data are placed in the directory, or the rulefit procedure is subsequently invoked. At any time, the current model in the RuleFit home directory can be obtained (and saved) as a named R object using the procedure getmodel. The properties of any RuleFit model object can be viewed using the procedure rfmodinfo.

With any RuleFit procedure input predictor variables can be referenced either by their  respective column numbers in the input data matrix or data frame, or by their corresponding character column variable names, if present. Character variable names can be associated with columns using the colnames feature in R or by providing them as part of an input data frame. If variable names are specified then all output will reference those names. If not, the column numbers will be used to reference the input variables.


rulefit: build a RuleFit Model

Usage:

rfmod = rulefit (x, y, wt=rep(1,nrow(x)), cat.vars=NULL, not.used=NULL, xmiss=9.0e30, rfmode="regress", sparse=1, test.reps=round(min(20,max(0.0,5200/neff-2))), test.fract=0.2, mod.sel=2,  model.type="both",  tree.size=4, max.rules=2000, max.trms=500, costs=c(1,1), trim.qntl=0.025, samp.fract=min(1,(11*sqrt(neff)+1)/neff), inter.supp=3.0, memory.par=0.01, conv.thr=1.0e-3, quiet=F, tree.store=10000000, cat.store=1000000)

Required arguments:

x = input predictor data matrix or data frame. Rows are observations and columns are variables. Must be a numeric matrix or a data frame.

y = input response values. For classification (rfmode="class", see below) values must only be +1 or -1.
If y is a single valued scalar it is interpreted as a label (number or name) referencing a column of x. Otherwise it is a vector of length nrow(x) containing the  numeric response values.

Optional arguments:

wt = observation weights.
If wt is a single valued scalar it is interpreted as a label (number or name) referencing a column of x. Otherwise it is a vector of length nrow(x) containing the numeric observation weights.

cat.vars = vector of column labels (numbers or names) indicating categorical variables (factors). All variables not so indicated are assumed to be orderable numeric. If x is a data frame and cat.vars is missing, then components of type factor are treated as categorical variables. Ordered factors should be input as type numeric with appropriate numerical scores. If cat.vars is present it will override the data frame typing.

not.used = vector of column labels (numbers or names) indicating predictor variables not to be used in the model.

xmiss = predictor variable missing value flag. Must be numeric and larger than any non missing predictor variable value.  Predictor variable values greater than or equal to xmiss are regarded as missing. Predictor variable data values of NA are internally set to the value of xmiss and thereby regarded as missing.

rfmode = regression /classification flag
rfmode="regress" => regression. The outcome or response variable is regarded as numerically valued and the model is used to predict the value of the response.
rfmode="class" => binary classification. The model produces a numeric score that predicts the log-odds of realizing a response value of  +1.

sparse = model sparsity control. Larger values tend to produce sparser models with fewer terms; smaller values tend to produce denser models with more terms.
0 < sparse < 1 => elastic net regression to fit final model with the corresponding parameter alpha set to the value of sparse
sparse = 1=> lasso regression to fit final model
sparse = 2 => lasso to select variable entry order for forward stepwise regression
sparse = 3 => regression: forward stepwise to select variables and fit model. Binary classification: forward statewise to select variables and fit model.

test.reps = number of cross-validation replications used for model selection.
test.reps = 0 => 1 - fold cross-validation; final model based on learning sample
test.reps > 0 => test.reps - fold cross-validation; final model based on whole training sample.
Note that the default value refers to the "effective"  number of training observations  (neff) given by neff = sum(wt)^2/sum(wt^2) for regression. For classification this is multiplied by 4*fpos*(1-fpos) where fpos is the fraction of positive labeled (y = 1) observations. (Use larger values of test.reps for smaller training samples.)

test.fract = learning / test sample partition
test.fract = fraction of input observations used it test sample.

mod.sel = model selection criterion
mod.sel = 1 => regression: average absolute error loss. binary classification: correlation criterion similar to 1 - AUC
mod.sel = 2 => regression: average squared-error loss: binary classification: average squared-error loss on predicted probabilities
mod.sel = 3 => binary classification: misclassification risk

model.type = rule generation flag for numeric variables
model.type = "linear" => only linear model for orderable numeric variables (no rules). Generate rules only for unorderable categorical variables (factors), if any. See cat.vars
model.type = "rules" => use only generated rules to fit model (no linear variables). This choice makes the model invariant to strictly monotone transformations of the predictor variables.
model.type = "both"  => use both to fit model

tree.size = average number of terminal nodes in generated trees. FP 2004 (Sec. 3.3) Tree.size = 2 produces an additive main effects model with no interactions allowed among the predictor variables. Tree.size > 2 permits higher order interactions with the permitted interaction order increasing with the value of tree.size.

max.rules = approximate total number of rules generated for regression fitting. Note: with missing values, the actual number of rules generated may be considerably larger than max.rules.

max.trms = maximum number of terms selected for final model

costs = binary classification: misclassificarion costs (mod.sel =3 only)
costs[1] = cost for class +1 error
costs [2]= cost for class -1 error

trim.qntl = linear variable conditioning factor. Ignored for model.type = "rules" . FP 2004 (Sec. 5)

samp.fract = fraction of randomly chosen training observations used to produce each tree. FP 2004 (Sec. 2). Note that the default value refers to the "effective" number of training observations  (neff) given by neff = sum(wt)^2/sum(wt^2) for regression. For classification this is multiplied by 4*fpos*(1-fpos) where fpos is the fraction of positive labeled (y = 1) observations.

inter.supp = incentive factor for using fewer variables in tree based rules. FP 2004 (Sec. 8.2)

memory.par = scale multiplier (shrinkage factor or learning rate) applied to each new tree when sequentially induced. FP 2004 (Sec. 2)

conv.thr = convergence threshold for regression solutions. Iterations stop when the maximum standardized coefficient change from the previous iteration is less than conv.thr

quiet = T/F=> do/don't surpress progress monitor as program executes

 tree.store = size of internal tree storage. Decrease value in response to memory allocation error. Increase value for very large values of max.rules and/or tree.size, or in response to diagnostic message or erratic program behaivor.

cat.store = size of internal categorical value storage. Decrease value in response to memory allocation error. Increase value for very large values of max.rules and/or tree.size in the presence of many categorical variables (factors) with many levels, or in response to diagnostic message or erratic program behaivor.

Output:

rfmod = RuleFit model object representing the model placed in the RuleFit home directory. Can be replaced at a later time using rfrestore.

Printed output at the command line giving the cross-validated model selection criterion value with standard error (test.reps > 1), and number of terms in the resulting model.

Examples:

rulefit(x, y);  rfxyw = rulefit(x, y, w);
rfxycls = rulefit(x, 14, 33, cat.vars=c(2,4,5,7,9), not.used=c(1,3), rfmode="class", tree.size=2)
rfbosthouse=rulefit(bostdat,"MEDV", cat.vars="CHAS", sparse=3, model.type='linear')

Reference: Friedman, J. H. and Popescu, B. E. (2004). Predictive learning via rule ensembles.


getmodel: retrieve current model from RuleFit home directory

Usage:

rfmod = getmodel ()

Arguments: none

Output:

rfmod = RuleFit model object representing the model currently  stored in the RuleFit home directory. Can be replaced in the home directory at a later time using rfrestore.


rfrestore: replace (change) the current model in RuleFit home directory

Usage:

rfrestore (model, x=NULL, y=NULL, wt=rep(1,nrow(x)))

Required argument:

model = RuleFit model object output from rulefit or getmodel

Optional arguments:

x = input predictor data matrix or data frame used to construct model.

y = input response values used to construct model.
If y is a single valued scalar it  is interpreted as a label (number or name) referencing a column of x. Otherwise it is a vector of length nrow(x) containing the  numeric response values.

wt = observation weights used to construct model.
If wt is a single valued scalar it is interpreted as a label (number or name) referencing a column of x. Otherwise it is a vector of length nrow(x) containing the  numeric observation weights.

Output:  none

Examples:  rfrestore (rfmod);  rfrestore (rfbosthouse, bostdat, "MEDV")

Comment: each of the optional arguments need only be included if the corresponding quantities in the RuleFit home directory have changed since the model was created. That is if rulefit or rfrestore  were invoked with different data after model was created.


rfannotate: replace RuleFit model object text description

Usage:

rfmod = rfannotate (rfmod, "text")

Required arguments:

rfmod = RuleFit model object output from rulefit or getmodel

text = character string

Output:

rfmod = same RuleFit model object as input model with new text description

Examples:  rfbosthouse = rfannotate(rfbosthouse, "This is a RuleFit model for Boston housing data")

Comment: the model description text is printed along with other model information at the command line in response to the command rfmodinfo. The original text for a  rulefit model object is the R command that gave rise to it.


rfmodinfo: view the properties of a RuleFit model object

Usage:

rfmodinfo (model)

Required argument:

model = RuleFit model object

Output: none

Examples:  rfmodinfo (rfbosthouse); rfmodinfo (getmodel ())

Comment: prints at the command line  the model description text, the date and time the model was created , fit summary, and all parameter values used to construct the model.


runstats: obtain fit statistics of a RuleFit model

Usage:

stats = runstats (model)

Required arguments: none

If the arguement is missing the current model in the rulefit home directory is used.

Optional argument:

model = supplied RuleFit model object

Output: list

stats$cri = cross-validated criterion value

stats$err = associated uncertainity estimate

stats$terms = number of terms in the model

Examples:  stats=runstats (rfbosthouse)

Comment: the output quantities are those printed at the command line by rulefit, rfmodinfo, and rfrestore.


rfxval: full cross-validation of RuleFit model

Usage:

xval = rfxval (nfold=10, quiet=F)

Optional arguments:

nfold = number of folds (>/= 2)

quiet = T/F => do/don't surpress progress monitor as program executes

Output:  list

Regression:

xval$yp = cross-validated response y  predicted values for each of the training observations.

xval$aae = average-absolute prediction error

xval$rms = root-mean-squared prediction error

Classification:

xval$lo = cross-validated estimates of the log-odds that y = +1 for each of the training observations.

xval$omAUC = 1 - area under ROC curve

xval$errave = average error rate

xval$errpos = positive (y = +1) error rate

xval$errneg = negative (y = -1) error rate

Examples:  rfxval ();  xval= rfxval(20, T)

Comment: Uses current model in the RuleFit home directory. All errors are computed using the observation weights.


rfpred: predict using the RuleFit model

Usage:

yp = rfpred (xp)

Required argument:

xp = values of the  input variables for the observation(s) to be predicted. Must be a data frame if a data frame was used to construct the model in the RuleFit home directory. Otherwise it must be a numeric vector or matrix.

Output:

yp = vector of length nrow(xp) containing the output predictions for each of the observations.

Regression:  yp is used to predict the  response value(s).

Classification: yp is a numeric score representing the estimated log-odds of y = +1. The corresponding probability estimates can be computed as probs = 1.0/(1.0+exp(-yp)).

Example:  yp = rfpred (xp)

Comment: Uses current model in the RuleFit home directory.


varimp: RuleFit model input variable importances

Usage:

vi = varimp (range=NULL, impord=T, x=NULL, wt=rep(1,nrow(x)), rth=0, plot=T, horiz=F, cex.names=1, col='grey', donames=T, las=2)

Optional arguments:

range = indicies of the range of variables to be plotted. If there are 100 input variables, then range=1:20 would plot the importances of the first 20 variables, and range=81:100 would plot the importances of the last 20. The default plots the first 30 variables.

impord = flag specifying order of listing and plotting variable importances.
impord = TRUE  => list and display in order of  descending variable importance.
impord = FALSE => list and display in data matrix column order.

x = subset of observations over which importances are to be computed. Must be a data frame if a data frame was used to construct the model in the RuleFit home directory. Otherwise it must be a numeric vector or matrix. If missing then all training observations are used. FP 2004 (Sec. 7)

wt = weights for observations stored in x.
If wt is a single valued scalar it is interpreted as a label (number or name) referencing a column of x. Otherwise it is a vector of length nrow(x) containing the numeric observation weights.

rth = rule importance threshold. Variable importances are computed only using those rules whose importances are greater than rth * (largest rule importance)

plot = plotting flag.
plot = TRUE / FALSE
=> do/don't display barplot

horiz = horizontal plotting flag
horiz = TRUE / FALSE
=> do/don't display barplot horizantally

cex.names = expansion factor for variable names (bar labels)

col = color of  barplot

donames =  barplot variable label flag (horiz = F only)
donames = TRUE / FALSE => do/don't display variable labels on barplot

las = label orientation flag (horiz = F only)
las =1 => horizontal orientation of variable labels
las =2 => vertical orientation of variable labels

Output: list

vi$imp = vector of importances for all variables.

vi$ord = vector of data matrix column numbers corresponding to the elements of vi$imp. vi$imp[k] is the importance of variable (column number) vi$ord[k].

Examples:   varimp ();  varimp (31:40, impord = F, x=xhigh);  vi = varimp(plot = F)

Comment: Uses current model in the RuleFit home directory.


interact: overall strengths of interaction effects for selected variables

Usage:

int = interact (vars, null.mods=NULL, nval=100, plot=T, horiz=F, las=2, cex.names=1, col=c("red","yellow"), ymax=NULL)

Required argument:

vars = vector of variable identifiers (column names or numbers) specifying selected variables to be considered.

Optional arguments:

null.mods = RuleFit null-model object returned from  procedure intnull. FP 2004 (Sec. 8.3)

nval = number of evaluation points used for calculation (larger values provide higher accuracy with a diminishing return; computation grows as nval^2)

plot = plotting flag.
plot = TRUE / FALSE
=> do/don't display barplot

horiz = horizontal plotting flag
horiz = TRUE / FALSE
=> do/don't display barplot horizantally

las = label orientation flag
las =1 => horizontal orientation of variable labels
las =2 => vertical orientation of variable labels

cex.names = expansion factor for variable names (bar labels)

col = foreground and background barplot colors. If null.mods is missing then interaction strengths are plotted using col[2]. If null.mods is specified then the null standard deviations are plotted in col[1] and the difference between the interaction strengths and their expected null values are plotted in col[2]. Note that the col[1] bars are plotted over the col[2] bars, so that the absence of a col[2] bar indicates that the corresponding interaction strength is less than one standard deviation above its expected null value.

ymax = specified vertical scale upper limit for barplot. If missing then maximum plotted interaction strength value is used.

Output:

If null.mods is missing:

int = vector of interaction strengths: int[k] is the interaction strength of  input variable vars[k]

If null.mods is specified: (list)

int$int = vector of interaction strengths: int$int[k] is the interaction strength of input variable vars[k]

int$nullave = vector of expected null interaction strengths: int$nullave[k] is the expected null interaction strength of  variable vars[k]

int$nullstd = vector of null standard deviations: int$nullstd[k] is the standard deviation of the null interaction strength of  variable vars[k]

Examples:

interact (1:10);  interact (vi$ord(1:10), null.mods)
int = interact(c("RM", "NOX", "PTRATIO", "LSTAT"), null.bost, ymax=0.4)

Comment: Uses current model in the RuleFit home directory. See FP 2004 (Sec. 9) for illustrations.


twovarint: two-variable interaction strengths of a target variable with selected other variables

Usage:

int2var = twovarint (tvar, vars, null.mods=NULL, nval=100, import=F, plot=T, horiz=F, las=2, cex.names=1, col=c("red","yellow"), ymax=NULL)

Required arguments:

tvar = variable identifier (column name or number) specifying the target variable.

vars = vector of variable identifiers (column names or numbers) specifying other selected variables. Must not contain tvar.

Optional arguments:

null.mods = RuleFit null-model object returned from  procedure intnull. FP 2004 (Sec. 8.3)

nval = number of evaluation points used for calculation (larger values provide higher accuracy with a diminishing return; computation grows as nval^2)

import = interaction importance flag
import= TRUE / FALSE => do/don't scale interaction strengths according to their importance to the model. FP 2004 (Sec. 8.1)

plot = plotting flag.
plot = TRUE / FALSE
=> do/don't display barplot

horiz = horizontal plotting flag
horiz = TRUE / FALSE
=> do/don't display barplot horizantally

las = label orientation flag
las =1 => horizontal orientation of variable labels
las =2 => vertical orientation of variable labels

cex.names = expansion factor for variable names (bar labels)

col = foreground and background barplot colors. If null.mods is missing then interaction strengths are plotted using col[2]. If null.mods is specified then the null standard deviations are plotted in col[1] and the difference between the interaction strengths and their expected null values are plotted in col[2]. Note that the col[1] bars are plotted over the col[2] bars, so that the absence of a col[2] bar indicates that the corresponding interaction strength is less than one standard deviation above its expected null value.

ymax = specified vertical scale upper limit for barplot. If missing then maximum plotted interaction strength value is used.

Output:

If null.mods is missing:

int2var = vector of interaction strengths: int2var[k] is the two-variable interaction strength of tvar with input variable vars[k]

If null.mods is specified: (list)

int2var$int = vector of interaction strengths: int2var$int[k] is the interaction strength of tvar with input variable vars[k]

int2var$nullave = vector of expected null interaction strengths: int2var$nullave[k] is the expected null interaction strength of tvar with variable vars[k]

int2var$nullstd = vector of null standard deviations: int2var$nullstd[k] is the standard deviation of the null interaction strength of tvar with variable vars[k]

Examples:

twovarint (6, c(1:5,7:13));  int2var = twovarint ("Var 1", c("Var 2", "Var 3"), null.mods)
int2var= twovarint ("PTRATIO", c("RM", "NOX", "LSTAT"), null.bost, ymax=0.3)

Comment: Uses current model in the RuleFit home directory. See FP 2004 (Sec. 9) for illustrations.


threevarint: three-variable interaction strengths of two target variables and selected other variables

Usage:

int3var = threevarint (tvar1, tvar2, vars, null.mods=NULL, nval=100, import=F, plot=T, horiz=F, las=2, cex.names=1, col=c("red","yellow"), ymax=NULL)

Required arguments:

tvar1 = variable identifier (column name or number) specifying the first target variable.

tvar2 = variable identifier (column name or number) specifying the second target variable. Must be different that tvar1.

vars = vector of variable identifiers (column names or numbers) specifying other selected variables. Must not contain tvar1 or tvar2.

Optional arguments:

null.mods = RuleFit null-model object returned from  procedure intnull. FP 2004 (Sec. 8.3)

nval = number of evaluation points used for calculation (larger values provide higher accuracy with a diminishing return; computation grows as nval^2)

import = interaction importance flag
import= TRUE / FALSE => do/don't scale interaction strengths according to their importance to the model. FP 2004 (Sec. 8.1)

plot = plotting flag.
plot = TRUE / FALSE
=> do/don't display barplot

horiz = horizontal plotting flag
horiz = TRUE / FALSE
=> do/don't display barplot horizantally

las = label orientation flag
las =1 => horizontal orientation of variable labels
las =2 => vertical orientation of variable labels

cex.names = expansion factor for variable names (bar labels)

col = foreground and background barplot colors. If null.mods is missing then interaction strengths are plotted using col[2]. If null.mods is specified then the null standard deviations are plotted in col[1] and the difference between the interaction strengths and their expected null values are plotted in col[2]. Note that the col[1] bars are plotted over the col[2] bars, so that the absence of a col[2] bar indicates that the corresponding interaction strength is less than one standard deviation above its expected null value.

ymax = specified vertical scale upper limit for barplot. If missing then maximum plotted interaction strength value is used.

Output:

If null.mods is missing:

int3var = vector of interaction strengths: int3var[k] is the three-variable interaction strength of tvar1, tvar2, and input variable vars[k]

If null.mods is specified: (list)

int3var$int = vector of interaction strengths: int3var$int[k] is the three-variable interaction strength of tvar1, tvar2, and input variable vars[k]

int3var$nullave = vector of expected null interaction strengths: int3var$nullave[k] is the expected null three-variable interaction strength of tvar1, tvar2, and variable vars[k]

int3var$nullstd = vector of null standard deviations: int3var$nullstd[k] is the standard deviation of the null three-variable interaction strength of tvar1, tvar2, and variable vars[k]

Examples:

threevarint (5,6, c(1:4,7:13));  int3var = threevarint ("Var 1",  "Var 2", c("Var 3", "Var 4"), null.mods)
int3var= threevarint ("RM", "PTRATIO", c("DIS", "NOX", "LSTAT"), null.bost, ymax=0.2)

Comment: Uses current model in the RuleFit home directory. See FP 2004 (Sec. 9) for illustrations.


intnull: compute boostrapped null interaction models to calibrate interaction effects

Usage:

null.models = intnull (ntimes=10, null.mods=NULL, quiet=F)

Optional arguments:

ntimes = number of null models produced

null.mods = RuleFit null-model object previously produced by intnull. If missing, a new null-model object is created. If present, the new null-models will be added to those contained in the input null model object.

quiet = T/F=> do/don't surpress progress monitor as program executes

Output:

null.models = RuleFit null-model object containing the generated  bootstrap null models. It can be used as input to interact, twovarint, and threevarint to calibrate interaction effects.

Examples: bost.null= intnull ();   bost.null= intnull(5, bost.null)

Comment: Uses current  RuleFit model in the RuleFit home directory. The produced null-model object can only be used as input to interact, twovarint, threevarint or intnull  when this RuleFit model and its input data are stored in the RuleFit home directory (see rfmodinfo and rfrestore).  See FP 2004 (Sec. 8.3).


rfnullinfo: view identifier of RuleFit null-model object

Usage:

rfnullinfo (null.models)

Required argument:

null.models = RuleFit null-model object previously produced by intnull.

Output: none

Example:  rfnullinfo (bost.null)

Comment: prints at the command line the number of bootstrapped null interaction models contained in null.mods, and the date and time associated with the RuleFit model that was in the RuleFit home directory at the time the null-model object was created by intnull. It can only be used as input to interact, twovarint, threevarint,or intnull when this RuleFit model and its input data are stored in the RuleFit home directory (see rfmodinfo and rfrestore).


rules: print RuleFit rules in order of importance

Usage:

rules(beg=1, end=beg+9, x=NULL, wt=rep(1,nrow(x)))

Optional arguments:

beg = first rule to be printed

end = last rule to be printed

x = subset of observations over which importances are to be computed. Must be a data frame if a data frame was used to construct the model in the RuleFit home directory. Otherwise it must be a numeric vector or matrix. If missing then all training observations are used. FP 2004 (Sec. 6)

wt = weights for observations stored in x.
If wt is a single valued scalar it is interpreted as a label (number or name) referencing a column of x. Otherwise it is a vector of length nrow(x) containing the numeric observation weights.

Output: none

Examples: rules ();  rules (11);  rules (21, 25); rules (x=xhigh)

Comment: Uses current model in the RuleFit home directory. If a referenced variable is of type factor in a data frame used to construct the RuleFit model, then its values correspond to the codes underlying the factor levels, not the numeric representation of the labels. Otherwise they represent the actual values encoded in the input data interpreted as type numeric.


singleplot: display single variable partial dependence plots

Usage:

singleplot (vars, qntl=0.025, nval=200, nav=500, catvals=NULL, samescale=F, horiz=F, las=2, cex.names=1, col="cyan", denqnt=0.1)

Required argument:

vars = vector of variable identifiers (column names or numbers) specifying selected variables to be plotted.

Optional arguments:

qntl = trimming factor for plotting numeric variables. Plots are shown for variable values in the range [quantile (qntl) - quantile(1-qntl)]. (Ignored for categorical variables (factors).)

nval = maximum number of abscissa evaluation points for numeric variables. (Ignored for categorical variables (factors).)

nav = maximum number of observations used for averaging calculations. (larger values provide higher accuracy with a diminishing return; computation grows linearly with nav)

catvals = vector of names for values (levels) of categorical variable (factor). (Ignored for numeric variables or length(vars) > 1)

samescale = plot vertical scaling flag .
samescale = TRUE / FALSE
=> do/don't require same vertical scale for all plots.

horiz = plot orientation flag for categorical variable barplots
horiz = T/F => do/don't plot bars horizontally

las = label orientation flag for categorical variable plots (horiz = F, only)
las = 1 => horizontal orientation of value (level) names stored in catvals (if present)
las = 2 => vertical orientation of value (level) names stored in catvals (if present)

cex.names = expansion factor for axis names (bar labels) for categorical variable barplots

col = color of barplot for categorical variables

denqnt = quantile for data density tick marks along upper plot boundary  for numeric variables ( < 1)
denqnt <= 0 => no data density tick marks displayed

Output: none

Examples: singleplot ("DIS");  singleplot (1:5); singleplot(4, catvals=levels(boston[[4]]))
singleplot(c("CRIM","NOX","RM","PTRATIO","LSTAT"), samescale=T)

Comment: Uses current model in the RuleFit home directory. Data density tick marks for tied quantiles are slightly jittered (one percent of plot range). If a categorical variable is of type factor in a data frame used to construct the RuleFit model, then its values correspond to the codes underlying the factor levels, not the numeric representation of the labels. Otherwise they represent the actual values encoded in the input data interpreted as type numeric. See FP 2004 (Sec. 8.1).


pairplot: display a two variable partial dependence plot

Usage:

pairplot (var1, var2, type="image", chgvars=F, qntl=0.025, nval=200, nav=500, vals1=NULL, vals2=NULL, theta=30, phi=15, col='cyan', horiz=F, las=2, cex.names=1)

Required arguments:

var1= variable identifier (column name or number) specifying one of the variables to be plotted.

var2= variable identifier (column name or number) specifying the other variable to be plotted. Must not be the same as as var1.

Optional arguments:

type = flag for type of plot when both var1 and var2 are numeric
type = "image"
=> heat map plot
type = "persp" => perspective mesh plot
type = "contour" => contour plot

chgvars = flag for changing plotting relationship when both var1 and var2 are categorical (factors)
chgvars = FALSE => plot the partial dependence on the variable (factor) with the most values (levels), for each of the  respective values (levels) of the other variable (factor)
chgvars = TRUE => reverse this relationship

qntl = trimming factor for plotting numeric variables. Plots are shown for variable values in the range [quantile (qntl) - quantile(1-qntl)]. (Ignored for categorical variables (factors).)

nval = maximum number of evaluation points for numeric variables. (Ignored for categorical variables).

nav = maximum number of observations used for averaging calculations. (larger values provide higher accuracy with a diminishing return; computation grows linearly with nav)

vals1 = vector of names for values (levels) of var1 if it is categorical (factor). (Ignored if var1 is numeric)

vals2 = vector of names for values (levels) of var2 if it is categorical (factor). (Ignored if var2 is numeric)

theta, phi = angles defining the viewing direction for perspective mesh plot. theta gives the azimuthal direction and phi the colatitude. (Ignored unless both var1 and var2 are numeric and type = "persp")

col = color of barplots for two categorical variables (factors) or perspective mesh plot for two numeric variables.

horiz = plot orientation for categorical variable barplots
horiz = T/F => do/don't plot bars horizontally

las = label orientation flag for categorical variable plots (horiz = F, only)
las =1 => horizontal orientation of value (level) names stored in vals1 and/or vals2 (if present).
las =2 => vertical orientation of value (level) names stored in vals1 and/or vals2 (if present).

cex.names = expansion factor for axis names (bar labels)  for categorical variable barplots

Output: none

Examples: pairplot ("NOX", "RM");  pairplot (1,5, type="persp");
pairplot ("SEX", "DOMICILE", vals1=c("MALE", "FEMALE"), vals2=c("HOUSE", "CONDO", "TRAILER"))
pairplot(3, 14, vals2=levels(x[[14]])); pairplot(7, 14, vals1=levels(x[[7]]), vals2=levels(x[[14]]))

Comment: Uses current model in the RuleFit home directory. If a categorical variable is of type factor in a data frame used to construct the RuleFit model, then its values correspond to the codes underlying the factor levels, not the numeric representation of the labels. Otherwise they represent the actual values encoded in the input data interpreted as type numeric.


rfversion: print date and version number of current RuleFit installation

Usage:

rfversion ()

arguments: none

Output: none

Example: rfversion ()


www@stat.stanford.edu