Title: | Efficiency Analysis Trees |
---|---|
Description: | Functions are provided to determine production frontiers and technical efficiency measures through non-parametric techniques based upon regression trees. The package includes code for estimating radial input, output, directional and additive measures, plotting graphical representations of the scores and the production frontiers by means of trees, and determining rankings of importance of input variables in the analysis. Additionally, an adaptation of Random Forest by a set of individual Efficiency Analysis Trees for estimating technical efficiency is also included. More details in: <doi:10.1016/j.eswa.2020.113783>. |
Authors: | Miriam Esteve [cre, aut] , Víctor España [aut] , Juan Aparicio [aut] , Xavier Barber [aut] |
Maintainer: | Miriam Esteve <[email protected]> |
License: | GPL-3 |
Version: | 0.1.2 |
Built: | 2024-10-27 06:29:33 UTC |
Source: | https://github.com/miriamesteve/eat |
This function gets the minimum alpha for each subtree evaluated during the pruning procedure of the Efficiency Analysis Trees technique.
alpha(tree)
alpha(tree)
tree |
A |
Numeric value corresponding to the minimum alpha associated with a suitable node to be pruned.
Bootstrap aggregating for data.
bagging(data, x, y)
bagging(data, x, y)
data |
Dataframe containing the variables in the model. |
x |
Column input indexes in data. |
y |
Column output indexes in data. |
List containing training dataframe and list with binary response as 0 if the observations have been selected for training and 0 in any other case.
This function generates a barplot with the importance of each predictor.
barplot_importance(m, threshold)
barplot_importance(m, threshold)
m |
Dataframe with the importance of each predictor. |
threshold |
Importance score value in which a line should be graphed. |
Barplot representing each variable on the x-axis and its importance on the y-axis.
This funcion computes the root mean squared error (RMSE) for a set of Efficiency Analysis Trees models built with a grid of given hyperparameters.
bestEAT( training, test, x, y, numStop = 5, fold = 5, max.depth = NULL, max.leaves = NULL, na.rm = TRUE )
bestEAT( training, test, x, y, numStop = 5, fold = 5, max.depth = NULL, max.leaves = NULL, na.rm = TRUE )
training |
Training |
test |
Test |
x |
Column input indexes in |
y |
Column output indexes in |
numStop |
Minimum number of observations in a node for a split to be attempted. |
fold |
Folds in which the dataset to apply cross-validation during the pruning is divided. |
max.depth |
Maximum depth of the tree. |
max.leaves |
Maximum number of leaf nodes. |
na.rm |
|
A data.frame
with the sets of hyperparameters and the root mean squared error (RMSE) associated for each model.
data("PISAindex") n <- nrow(PISAindex) # Observations in the dataset selected <- sample(1:n, n * 0.7) # Training indexes training <- PISAindex[selected, ] # Training set test <- PISAindex[- selected, ] # Test set bestEAT(training = training, test = test, x = 6:9, y = 3, numStop = c(3, 5, 7), fold = c(5, 7, 10))
data("PISAindex") n <- nrow(PISAindex) # Observations in the dataset selected <- sample(1:n, n * 0.7) # Training indexes training <- PISAindex[selected, ] # Training set test <- PISAindex[- selected, ] # Test set bestEAT(training = training, test = test, x = 6:9, y = 3, numStop = c(3, 5, 7), fold = c(5, 7, 10))
This funcion computes the root mean squared error (RMSE) for a set of Random FOrest + Efficiency Analysis Trees models built with a grid of given hyperparameters.
bestRFEAT( training, test, x, y, numStop = 5, m = 50, s_mtry = c("5", "BRM"), na.rm = TRUE )
bestRFEAT( training, test, x, y, numStop = 5, m = 50, s_mtry = c("5", "BRM"), na.rm = TRUE )
training |
Training |
test |
Test |
x |
Column input indexes in |
y |
Column output indexes in |
numStop |
Minimum number of observations in a node for a split to be attempted. |
m |
Number of trees to be built. |
s_mtry |
|
na.rm |
|
A data.frame
with the sets of hyperparameters and the root mean squared error (RMSE) associated for each model.
data("PISAindex") n <- nrow(PISAindex) # Observations in the dataset selected <- sample(1:n, n * 0.7) # Training indexes training <- PISAindex[selected, ] # Training set test <- PISAindex[- selected, ] # Test set bestRFEAT(training = training, test = test, x = 6:9, y = 3, numStop = c(3, 5), m = c(20, 30), s_mtry = c("1", "BRM"))
data("PISAindex") n <- nrow(PISAindex) # Observations in the dataset selected <- sample(1:n, n * 0.7) # Training indexes training <- PISAindex[selected, ] # Training set test <- PISAindex[- selected, ] # Test set bestRFEAT(training = training, test = test, x = 6:9, y = 3, numStop = c(3, 5), m = c(20, 30), s_mtry = c("1", "BRM"))
Banker, Charnes and Cooper programming model with input orientation for a Convexified Efficiency Analysis Trees model.
CEAT_BCC_in(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
CEAT_BCC_in(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
j |
Number of DMUs. |
scores |
|
x_k |
|
y_k |
|
atreeTk |
|
ytreeTk |
|
nX |
Number of inputs. |
nY |
Number of outputs. |
N_leaves |
Number of leaf nodes. |
A numerical vector with scores.
Banker, Charnes and Cooper programming model with output orientation for a Convexified Efficiency Analysis Trees model.
CEAT_BCC_out(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
CEAT_BCC_out(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
j |
Number of DMUs. |
scores |
|
x_k |
|
y_k |
|
atreeTk |
|
ytreeTk |
|
nX |
Number of inputs. |
nY |
Number of outputs. |
N_leaves |
Number of leaf nodes. |
A numerical vector with efficiency scores.
Directional Distance Function for a Convexified Efficiency Analysis Trees model.
CEAT_DDF(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
CEAT_DDF(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
j |
Number of DMUs. |
scores |
|
x_k |
|
y_k |
|
atreeTk |
|
ytreeTk |
|
nX |
Number of inputs. |
nY |
Number of outputs. |
N_leaves |
Number of leaf nodes. |
A numerical vector with scores.
Russell Model with input orientation for a Convexified Efficiency Analysis Trees model.
CEAT_RSL_in(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
CEAT_RSL_in(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
j |
Number of DMUs. |
scores |
|
x_k |
|
y_k |
|
atreeTk |
|
ytreeTk |
|
nX |
Number of inputs. |
nY |
Number of outputs. |
N_leaves |
Number of leaf nodes. |
A numerical vector with scores.
Russell Model with output orientation for a Convexified Efficiency Analysis Trees model.
CEAT_RSL_out(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
CEAT_RSL_out(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
j |
Number of DMUs. |
scores |
|
x_k |
|
y_k |
|
atreeTk |
|
ytreeTk |
|
nX |
Number of inputs. |
nY |
Number of outputs. |
N_leaves |
Number of leaf nodes. |
A numerical vector with scores.
Weighted Additive Model for a Convexified Efficiency Analysis Trees model.
CEAT_WAM(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves, weights)
CEAT_WAM(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves, weights)
j |
Number of DMUs. |
scores |
|
x_k |
|
y_k |
|
atreeTk |
|
ytreeTk |
|
nX |
Number of inputs. |
nY |
Number of outputs. |
N_leaves |
Number of leaf nodes. |
weights |
|
A numerical vector with scores.
This function verifies if a specific tree keeps to Pareto-dominance properties.
checkEAT(tree)
checkEAT(tree)
tree |
A |
Message indicating if the tree is acceptable or warning in case of breaking any Pareto-dominance relationship.
This function denotes if a node dominates another one or if there is no Pareto-dominance relationship.
comparePareto(t1, t2)
comparePareto(t1, t2)
t1 |
A first node. |
t2 |
A second node. |
-1 if t1 dominates t2, 1 if t2 dominates t1 and 0 if there are no Pareto-dominance relationships.
This function creates a deep Efficiency Analysis Tree and a set of possible prunings by the weakest-link pruning procedure.
deepEAT(data, x, y, numStop = 5, max.depth = NULL, max.leaves = NULL)
deepEAT(data, x, y, numStop = 5, max.depth = NULL, max.leaves = NULL)
data |
|
x |
Column input indexes in |
y |
Column output indexes in |
numStop |
Minimum number of observations in a node for a split to be attempted. |
max.depth |
Maximum depth of the tree. |
max.leaves |
Maximum number of leaf nodes. |
A list
containing each possible pruning for the deep tree and its associated alpha value.
This function estimates a stepped production frontier through regression trees.
EAT( data, x, y, numStop = 5, fold = 5, max.depth = NULL, max.leaves = NULL, na.rm = TRUE )
EAT( data, x, y, numStop = 5, fold = 5, max.depth = NULL, max.leaves = NULL, na.rm = TRUE )
data |
|
x |
Column input indexes in data. |
y |
Column output indexes in data. |
numStop |
Minimum number of observations in a node for a split to be attempted. |
fold |
Set of number of folds in which the dataset to apply cross-validation during the pruning is divided. |
max.depth |
Depth of the tree. |
max.leaves |
Maximum number of leaf nodes. |
na.rm |
|
The EAT function generates a regression tree model based on CART (Breiman et al. 1984) under a new approach that guarantees obtaining a stepped production frontier that fulfills the property of free disposability. This frontier shares the aforementioned aspects with the FDH frontier (Deprins and Simar 1984) but enhances some of its disadvantages such as the overfitting problem or the underestimation of technical inefficiency. More details in Esteve et al. (2020).
An EAT
object containing:
data
df
: data frame containing the variables in the model.
x
: input indexes in data.
y
: output indexes in data.
input_names
: input variable names.
output_names
: output variable names.
row_names
: rownames in data.
control
fold
: fold hyperparameter value.
numStop
: numStop hyperparameter value.
max.leaves
: max.leaves hyperparameter value.
max.depth
: max.depth hyperparameter value.
na.rm
: na.rm hyperparameter value.
tree
: list structure containing the EAT nodes.
nodes_df
: data frame containing the following information for each node.
id
: node index.
SL
: left child node index.
N
: number of observations at the node.
Proportion
: proportion of observations at the node.
the output predictions.
R
: the error at the node.
index
: observation indexes at the node.
model
nodes
: total number of nodes at the tree.
leaf_nodes
: number of leaf nodes at the tree.
a
: lower bound of the nodes.
y
: output predictions.
Breiman L, Friedman J, Stone CJ, Olshen RA (1984).
Classification and regression trees.
CRC press.
Deprins D, Simar L (1984).
“Measuring labor efficiency in post offices, The Performance of Public Enterprises: Concepts and Measurements, M. Marchand, P. Pestieau and H. Tulkens.”
Esteve M, Aparicio J, Rabasa A, Rodriguez-Sala JJ (2020).
“Efficiency analysis trees: A new methodology for estimating production frontiers through decision trees.”
Expert Systems with Applications, 162, 113783.
# ====================== # # Single output scenario # # ====================== # simulated <- Y1.sim(N = 50, nX = 3) EAT(data = simulated, x = c(1, 2, 3), y = 4, numStop = 10, fold = 5, max.leaves = 6) # ====================== # # Multi output scenario # # ====================== # simulated <- X2Y2.sim(N = 50, border = 0.1) EAT(data = simulated, x = c(1,2), y = c(3, 4), numStop = 10, fold = 7, max.depth = 7)
# ====================== # # Single output scenario # # ====================== # simulated <- Y1.sim(N = 50, nX = 3) EAT(data = simulated, x = c(1, 2, 3), y = 4, numStop = 10, fold = 5, max.leaves = 6) # ====================== # # Multi output scenario # # ====================== # simulated <- X2Y2.sim(N = 50, border = 0.1) EAT(data = simulated, x = c(1,2), y = c(3, 4), numStop = 10, fold = 7, max.depth = 7)
Banker, Charnes and Cooper programming model with input orientation for an Efficiency Analysis Trees model.
EAT_BCC_in(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
EAT_BCC_in(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
j |
Number of DMUs. |
scores |
|
x_k |
|
y_k |
|
atreeTk |
|
ytreeTk |
|
nX |
Number of inputs. |
nY |
Number of outputs. |
N_leaves |
Number of leaf nodes. |
A numerical vector with efficiency scores.
Banker, Charnes and Cooper programming model with output orientation for an Efficiency Analysis Trees model.
EAT_BCC_out(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
EAT_BCC_out(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
j |
Number of DMUs. |
scores |
|
x_k |
|
y_k |
|
atreeTk |
|
ytreeTk |
|
nX |
Number of inputs. |
nY |
Number of outputs. |
N_leaves |
Number of leaf nodes. |
A numerical vector with efficiency scores.
Directional Distance Function for an Efficiency Analysis Trees model.
EAT_DDF(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
EAT_DDF(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
j |
Number of DMUs. |
scores |
|
x_k |
|
y_k |
|
atreeTk |
|
ytreeTk |
|
nX |
Number of inputs. |
nY |
Number of outputs. |
N_leaves |
Number of leaf nodes. |
A numerical vector with efficiency scores.
This function returns the frontier output levels for an Efficiency Analysis Trees model.
EAT_frontier_levels(object)
EAT_frontier_levels(object)
object |
An |
A data.frame
with the frontier output levels at the leaf nodes of the Efficiency Analysis Trees model introduced.
simulated <- Y1.sim(N = 50, nX = 3) EAT_model <- EAT(data = simulated, x = c(1, 2, 3), y = 4, numStop = 10, fold = 5) EAT_frontier_levels(EAT_model)
simulated <- Y1.sim(N = 50, nX = 3) EAT_model <- EAT(data = simulated, x = c(1, 2, 3), y = 4, numStop = 10, fold = 5) EAT_frontier_levels(EAT_model)
This function returns a descriptive summary statistics table for each output variable calculated from the leaf nodes observations of an Efficiency Analysis Trees model. Specifically, it computes the number of observations, the proportion of observations, the mean, the variance, the standard deviation, the minimum, the first quartile, the median, the third quartile, the maximum and the root mean squared error.
EAT_leaf_stats(object)
EAT_leaf_stats(object)
object |
An |
A list
or a data.frame
(for 1 output scenario) with the following summary statistics:
N
: number of observations.
Proportion
: proportion of observations.
mean
: mean.
var
: variance.
sd
: standard deviation.
min
: minimun.
Q1
: first quartile.
median
: median.
Q3
: third quartile.
max
: maximum.
RMSE
: root mean squared error.
simulated <- Y1.sim(N = 50, nX = 3) EAT_model <- EAT(data = simulated, x = c(1, 2, 3), y = 4, numStop = 10, fold = 5) EAT_leaf_stats(EAT_model)
simulated <- Y1.sim(N = 50, nX = 3) EAT_model <- EAT(data = simulated, x = c(1, 2, 3), y = 4, numStop = 10, fold = 5) EAT_leaf_stats(EAT_model)
This function saves information about the Efficiency Analysis Trees model.
EAT_object( data, x, y, rownames, numStop, fold, max.depth, max.leaves, na.rm, tree )
EAT_object( data, x, y, rownames, numStop, fold, max.depth, max.leaves, na.rm, tree )
data |
|
x |
Column input indexes in |
y |
Column output indexes in |
rownames |
|
numStop |
Minimum number of observations in a node for a split to be attempted. |
fold |
Set of number of folds in which the dataset to apply cross-validation during the pruning is divided. |
max.depth |
Maximum number of leaf nodes. |
max.leaves |
Depth of the tree. |
na.rm |
|
tree |
|
An EAT
object.
Russell Model with input orientation for an Efficiency Analysis Trees model.
EAT_RSL_in(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
EAT_RSL_in(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
j |
Number of DMUs. |
scores |
|
x_k |
|
y_k |
|
atreeTk |
|
ytreeTk |
|
nX |
Number of inputs. |
nY |
Number of outputs. |
N_leaves |
Number of leaf nodes. |
A numerical vector with efficiency scores.
Russell Model with output orientation for an Efficiency Analysis Trees model.
EAT_RSL_out(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
EAT_RSL_out(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves)
j |
Number of DMUs. |
scores |
|
x_k |
|
y_k |
|
atreeTk |
|
ytreeTk |
|
nX |
Number of inputs. |
nY |
Number of outputs. |
N_leaves |
Number of leaf nodes. |
A numerical vector with efficiency scores.
This function returns the number of leaf nodes for an Efficiency Analysis Trees model.
EAT_size(object)
EAT_size(object)
object |
An |
Number of leaf nodes of the Efficiency Analysis Trees model introduced.
simulated <- Y1.sim(N = 50, nX = 3) EAT_model <- EAT(data = simulated, x = c(1, 2, 3), y = 4, numStop = 10, fold = 5) EAT_size(EAT_model)
simulated <- Y1.sim(N = 50, nX = 3) EAT_model <- EAT(data = simulated, x = c(1, 2, 3), y = 4, numStop = 10, fold = 5) EAT_size(EAT_model)
Weighted Additive Model for an Efficiency Analysis Trees model.
EAT_WAM(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves, weights)
EAT_WAM(j, scores, x_k, y_k, atreeTk, ytreeTk, nX, nY, N_leaves, weights)
j |
Number of DMUs. |
scores |
|
x_k |
|
y_k |
|
atreeTk |
|
ytreeTk |
|
nX |
Number of inputs. |
nY |
Number of outputs. |
N_leaves |
Number of leaf nodes. |
weights |
Character. |
A numerical vector with efficiency scores.
This function computes the efficiency scores for each DMU through a Convexified Efficiency Analysis Trees model.
efficiencyCEAT( data, x, y, object, scores_model, digits = 3, DEA = TRUE, print.table = FALSE, na.rm = TRUE )
efficiencyCEAT( data, x, y, object, scores_model, digits = 3, DEA = TRUE, print.table = FALSE, na.rm = TRUE )
data |
|
x |
Column input indexes in |
y |
Column output indexes in |
object |
An |
scores_model |
Mathematical programming model to calculate scores.
|
digits |
Decimal units for scores. |
DEA |
|
print.table |
|
na.rm |
|
A data.frame
with the efficiency scores computed through a Convexified Efficiency Analysis Trees model. Optionally, a summary descriptive table of the efficiency scores can be displayed.
simulated <- X2Y2.sim(N = 50, border = 0.2) EAT_model <- EAT(data = simulated, x = c(1,2), y = c(3, 4)) efficiencyCEAT(data = simulated, x = c(1, 2), y = c(3, 4), object = EAT_model, scores_model = "BCC.OUT", digits = 2, DEA = TRUE, print.table = TRUE, na.rm = TRUE)
simulated <- X2Y2.sim(N = 50, border = 0.2) EAT_model <- EAT(data = simulated, x = c(1,2), y = c(3, 4)) efficiencyCEAT(data = simulated, x = c(1, 2), y = c(3, 4), object = EAT_model, scores_model = "BCC.OUT", digits = 2, DEA = TRUE, print.table = TRUE, na.rm = TRUE)
Density plot for efficiency scores.
efficiencyDensity(df_scores, model = c("EAT", "FDH"))
efficiencyDensity(df_scores, model = c("EAT", "FDH"))
df_scores |
|
model |
|
Density plot for efficiency scores.
simulated <- X2Y2.sim(N = 50, border = 0.2) EAT_model <- EAT(data = simulated, x = c(1,2), y = c(3, 4)) scores <- efficiencyEAT(data = simulated, x = c(1, 2), y = c(3, 4), object = EAT_model, scores_model = "BCC.OUT", digits = 2, FDH = TRUE, na.rm = TRUE) efficiencyDensity(df_scores = scores, model = c("EAT", "FDH"))
simulated <- X2Y2.sim(N = 50, border = 0.2) EAT_model <- EAT(data = simulated, x = c(1,2), y = c(3, 4)) scores <- efficiencyEAT(data = simulated, x = c(1, 2), y = c(3, 4), object = EAT_model, scores_model = "BCC.OUT", digits = 2, FDH = TRUE, na.rm = TRUE) efficiencyDensity(df_scores = scores, model = c("EAT", "FDH"))
This function computes the efficiency scores for each DMU through an Efficiency Analysis Trees model.
efficiencyEAT( data, x, y, object, scores_model, digits = 3, FDH = TRUE, print.table = FALSE, na.rm = TRUE )
efficiencyEAT( data, x, y, object, scores_model, digits = 3, FDH = TRUE, print.table = FALSE, na.rm = TRUE )
data |
|
x |
Column input indexes in |
y |
Column output indexes in |
object |
An |
scores_model |
Mathematical programming model to calculate scores.
|
digits |
Decimal units for scores. |
FDH |
|
print.table |
|
na.rm |
|
A data.frame
with the efficiency scores computed through an Efficiency Analysis Trees model. Optionally, a summary descriptive table of the efficiency scores can be displayed.
simulated <- X2Y2.sim(N = 50, border = 0.2) EAT_model <- EAT(data = simulated, x = c(1,2), y = c(3, 4)) efficiencyEAT(data = simulated, x = c(1, 2), y = c(3, 4), object = EAT_model, scores_model = "BCC.OUT", digits = 2, FDH = TRUE, print.table = TRUE, na.rm = TRUE)
simulated <- X2Y2.sim(N = 50, border = 0.2) EAT_model <- EAT(data = simulated, x = c(1,2), y = c(3, 4)) efficiencyEAT(data = simulated, x = c(1, 2), y = c(3, 4), object = EAT_model, scores_model = "BCC.OUT", digits = 2, FDH = TRUE, print.table = TRUE, na.rm = TRUE)
This function returns a jitter plot from ggplot2
. This graphic shows how DMUs are grouped into leaf nodes in a model built using the EAT
function. Each leaf node groups DMUs with the same level of resources. The dot and the black line represent, respectively, the mean value and the standard deviation of the scores of its node. Additionally, efficient DMU labels always are displayed based on the model entered in the scores_model
argument. Finally, the user can specify an upper bound upn
and a lower bound lwb
in order to show, in addition, the labels whose efficiency score lies between them.
efficiencyJitter(object, df_scores, scores_model, upb = NULL, lwb = NULL)
efficiencyJitter(object, df_scores, scores_model, upb = NULL, lwb = NULL)
object |
An |
df_scores |
|
scores_model |
Mathematical programming model to calculate scores.
|
upb |
Numeric. Upper bound for labeling. |
lwb |
Numeric. Lower bound for labeling. |
Jitter plot with DMUs and scores.
simulated <- X2Y2.sim(N = 50, border = 0.2) EAT_model <- EAT(data = simulated, x = c(1,2), y = c(3, 4)) EAT_scores <- efficiencyEAT(data = simulated, x = c(1, 2), y = c(3, 4), object = EAT_model, scores_model = "BCC.OUT", digits = 2, na.rm = TRUE) efficiencyJitter(object = EAT_model, df_scores = EAT_scores, scores_model = "BCC.OUT")
simulated <- X2Y2.sim(N = 50, border = 0.2) EAT_model <- EAT(data = simulated, x = c(1,2), y = c(3, 4)) EAT_scores <- efficiencyEAT(data = simulated, x = c(1, 2), y = c(3, 4), object = EAT_model, scores_model = "BCC.OUT", digits = 2, na.rm = TRUE) efficiencyJitter(object = EAT_model, df_scores = EAT_scores, scores_model = "BCC.OUT")
This function computes the efficiency scores for each DMU through a Random Forest + Efficiency Analysis Trees model and the Banker Charnes and Cooper mathematical programming model with output orientation. Efficiency level at 1.
efficiencyRFEAT( data, x, y, object, digits = 3, FDH = TRUE, print.table = FALSE, na.rm = TRUE )
efficiencyRFEAT( data, x, y, object, digits = 3, FDH = TRUE, print.table = FALSE, na.rm = TRUE )
data |
|
x |
Column input indexes in |
y |
Column output indexes in |
object |
A |
digits |
Decimal units for scores. |
FDH |
|
print.table |
|
na.rm |
|
A data.frame
with the efficiency scores computed through a Random Forest + Efficiency Analysis Trees model. Optionally, a summary descriptive table of the efficiency scores can be displayed.
simulated <- X2Y2.sim(N = 50, border = 0.2) RFEAT_model <- RFEAT(data = simulated, x = c(1,2), y = c(3, 4)) efficiencyRFEAT(data = simulated, x = c(1, 2), y = c(3, 4), object = RFEAT_model, digits = 2, FDH = TRUE, na.rm = TRUE)
simulated <- X2Y2.sim(N = 50, border = 0.2) RFEAT_model <- RFEAT(data = simulated, x = c(1,2), y = c(3, 4)) efficiencyRFEAT(data = simulated, x = c(1, 2), y = c(3, 4), object = RFEAT_model, digits = 2, FDH = TRUE, na.rm = TRUE)
This function gets the estimation of the response variable and updates Pareto-coordinates and the observation index for both new nodes.
estimEAT(data, leaves, t, xi, s, y)
estimEAT(data, leaves, t, xi, s, y)
data |
Data to be used. |
leaves |
List structure with leaf nodes or pending expansion nodes. |
t |
Node which is being split. |
xi |
Variable index that produces the split. |
s |
Value of xi variable that produces the split. |
y |
Column output indexes in data. |
Left and right children nodes.
This function displays a plot with the frontier estimated by Efficiency Analysis Trees in a scenario of one input and one output.
frontier( object, FDH = FALSE, observed.data = FALSE, observed.color = "black", pch = 19, size = 1, rwn = FALSE, max.overlaps = 10 )
frontier( object, FDH = FALSE, observed.data = FALSE, observed.color = "black", pch = 19, size = 1, rwn = FALSE, max.overlaps = 10 )
object |
An EAT object. |
FDH |
Logical. If |
observed.data |
Logical. If |
observed.color |
String. Color for observed DMUs. |
pch |
Integer. Point shape. |
size |
Integer. Point size. |
rwn |
Logical. If |
max.overlaps |
Exclude text labels that overlap too many things. |
Plot with estimated production frontier
simulated <- Y1.sim(N = 50, nX = 1) model <- EAT(data = simulated, x = 1, y = 2) frontier <- frontier(object = model, FDH = TRUE, observed.data = TRUE, rwn = TRUE) plot(frontier)
simulated <- Y1.sim(N = 50, nX = 1) model <- EAT(data = simulated, x = 1, y = 2) frontier <- frontier(object = model, FDH = TRUE, observed.data = TRUE, rwn = TRUE) plot(frontier)
This function splits the original data in two new data sets: a train set and a test set.
generateLv(data, fold)
generateLv(data, fold)
data |
Data to be split into train and test subsets. |
fold |
Parts in which the original set is divided, to perform Cross-Validation. |
A list
structure with the train and the test set.
This function recalculates all the possible splits, with the exception of the one being used, and for each node and variable gets the best split based on their degree of importance.
imp_var_EAT(data, tree, x, y, digits)
imp_var_EAT(data, tree, x, y, digits)
data |
Data from EAT object. |
tree |
Tree from EAT object. |
x |
Column input indexes in data. |
y |
Column output indexes in data. |
digits |
Decimal units. |
A dataframe with the best split for each node and its variable importance.
Variable Importance through Random Forest + Efficiency Analysis Trees.
imp_var_RFEAT(object, digits = 2)
imp_var_RFEAT(object, digits = 2)
object |
A |
digits |
Decimal units. |
Vector of input importance scores
This function evaluates a node and checks if it fulfills the conditions to be a final node.
isFinalNode(obs, data, numStop)
isFinalNode(obs, data, numStop)
obs |
Observation in the evaluated node. |
data |
Data with predictive variable. |
numStop |
Minimum number of observations in a node to be split. |
True if the node is a final node and false in any other case.
This function modifies the coordinates of the nodes in the plotEAT function to overcome overlapping.
layout(py)
layout(py)
py |
a party object. |
Dataframe with suitable modifications of the node layout.
This function evaluates the importance of each predictor by the notion of surrogate splits.
M_Breiman(object, digits)
M_Breiman(object, digits)
object |
An EAT object. |
digits |
Decimal units. |
Dataframe with one column and the importance of each variable in rows.
This function calculates the Mean Square Error between the predicted value and the observations in a given node.
mse(data, t, y)
mse(data, t, y)
data |
Data to be used. |
t |
A given node. |
y |
Column output indexes in data. |
Mean Square Error at a node.
This function randomly selects the variables that are evaluated to divide a node and removes those that do not present variability.
mtry_inputSelection(data, x, t, mtry)
mtry_inputSelection(data, x, t, mtry)
data |
|
x |
Column input indexes in data. |
t |
Node which is being split. |
mtry |
Number of inputs selected for a node to be split. |
Index of the variables by which the node is divided.
A dataset containing the PISA score in mathematics, reading and science and 13 variables related to the social index by country for 2018.
PISAindex
PISAindex
A data frame with 72 rows and 18 variables:
Country name
Country continent
PISA score in Science
PISA score in Reading
PISA score in Mathematics
Nutritional and Basic Medical Care
Water and Sanitation
Shelter
Personal Safety
Access to Basic Knowledge
Access to Information and Communication
Health and Wellness
Environmental Quality
Personal Rights
Personal Freedom and Choice
Inclusiveness
Access to Advanced Education
Gross Domestic Product per capita adjusted by purchasing power parity
https://www.socialprogress.org/
https://www.oecd.org/pisa/Combined_Executive_Summaries_PISA_2018.pdf
Plot a tree-structure for an Efficiency Analysis Trees model.
plotEAT(object)
plotEAT(object)
object |
An |
Plot object with the following elements for each node:
id: node index.
R: error at the node.
n(t): number of observations at the node.
an input name: splitting variable.
y: output prediction.
simulated <- X2Y2.sim(N = 50, border = 0.2) EAT_model <- EAT(data = simulated, x = c(1,2), y = c(3, 4)) plotEAT(EAT_model)
simulated <- X2Y2.sim(N = 50, border = 0.2) EAT_model <- EAT(data = simulated, x = c(1,2), y = c(3, 4)) plotEAT(EAT_model)
Plot a graph with the Out-of-Bag error for a forest consisting of m trees.
plotRFEAT(object)
plotRFEAT(object)
object |
A |
Line plot with the OOB error and the number of trees in the forest.
simulated <- Y1.sim(N = 150, nX = 6) RFmodel <- RFEAT(data = simulated, x = 1:6, y = 7, numStop = 10, m = 50, s_mtry = "BRM", na.rm = TRUE) plotRFEAT(RFmodel)
simulated <- Y1.sim(N = 150, nX = 6) RFmodel <- RFEAT(data = simulated, x = 1:6, y = 7, numStop = 10, m = 50, s_mtry = "BRM", na.rm = TRUE) plotRFEAT(RFmodel)
This function finds the node where a register is located.
posIdNode(tree, idNode)
posIdNode(tree, idNode)
tree |
A list containing EAT nodes. |
idNode |
Id of a specific node. |
Position of the node or -1 if it is not found.
This function predicts the expected output by an EAT
object.
## S3 method for class 'EAT' predict(object, newdata, x, ...)
## S3 method for class 'EAT' predict(object, newdata, x, ...)
object |
An |
newdata |
|
x |
Inputs index. |
... |
further arguments passed to or from other methods. |
data.frame
with the original data and the predicted values.
simulated <- X2Y2.sim(N = 50, border = 0.2) EAT_model <- EAT(data = simulated, x = c(1,2), y = c(3, 4)) predict(object = EAT_model, newdata = simulated, x = c(1, 2))
simulated <- X2Y2.sim(N = 50, border = 0.2) EAT_model <- EAT(data = simulated, x = c(1,2), y = c(3, 4)) predict(object = EAT_model, newdata = simulated, x = c(1, 2))
This function predicts the expected output by a RFEAT
object.
## S3 method for class 'RFEAT' predict(object, newdata, x, ...)
## S3 method for class 'RFEAT' predict(object, newdata, x, ...)
object |
A |
newdata |
|
x |
Inputs index. |
... |
further arguments passed to or from other methods. |
data.frame
with the original data and the predicted values.
simulated <- X2Y2.sim(N = 50, border = 0.2) RFEAT_model <- RFEAT(data = simulated, x = c(1, 2), y = c(3, 4)) predict(object = RFEAT_model, newdata = simulated, x = c(1, 2))
simulated <- X2Y2.sim(N = 50, border = 0.2) RFEAT_model <- RFEAT(data = simulated, x = c(1, 2), y = c(3, 4)) predict(object = RFEAT_model, newdata = simulated, x = c(1, 2))
This function predicts the expected output by a Free Disposal Hull model.
predictFDH(data, x, y)
predictFDH(data, x, y)
data |
Dataframe or matrix containing the variables in the model. |
x |
Vector. Column input indexes in data. |
y |
Vector. Column output indexes in data. |
Data frame with the original data and the predicted values through a Free Disposal Hull model.
This function predicts the expected value based on a set of inputs.
predictor(tree, register)
predictor(tree, register)
tree |
|
register |
Set of independent values. |
The expected value of the dependent variable based on the given register.
This function arranges the data in the required format and displays error messages.
preProcess( data, x, y, numStop = 5, fold = 5, max.depth = NULL, max.leaves = NULL, na.rm = TRUE )
preProcess( data, x, y, numStop = 5, fold = 5, max.depth = NULL, max.leaves = NULL, na.rm = TRUE )
data |
|
x |
Column input indexes in data. |
y |
Column output indexes in data. |
numStop |
Minimum number of observations in a node for a split to be attempted. |
fold |
Set of number of folds in which the dataset to apply cross-validation during the pruning is divided. |
max.depth |
Depth of the tree. |
max.leaves |
Maximum number of leaf nodes. |
na.rm |
|
It returns a data.frame
in the required format.
This function builds an individual tree for Random Forest
RandomEAT(data, x, y, numStop, s_mtry)
RandomEAT(data, x, y, numStop, s_mtry)
data |
Dataframe containing the training set. |
x |
Vector. Column input indexes in data. |
y |
Vector. Column output indexes in data. |
numStop |
Integer. Minimum number of observations in a node for a split to be attempted. |
s_mtry |
Number of variables randomly sampled as candidates at each split. The available options are: |
List of m trees in forest and the error that will be used in the ranking of the importance of the variables.
This function computes the variable importance through an Efficiency Analysis Trees model.
rankingEAT(object, barplot = TRUE, threshold = 70, digits = 2)
rankingEAT(object, barplot = TRUE, threshold = 70, digits = 2)
object |
An |
barplot |
|
threshold |
Importance score value in which a line is graphed. |
digits |
Decimal units. |
data.frame
with the importance scores and a barplot representing the the variable importance if barplot = TRUE
.
simulated <- X2Y2.sim(N = 50, border = 0.2) EAT_model <- EAT(data = simulated, x = c(1,2), y = c(3, 4)) rankingEAT(object = EAT_model, barplot = TRUE, threshold = 70, digits = 2)
simulated <- X2Y2.sim(N = 50, border = 0.2) EAT_model <- EAT(data = simulated, x = c(1,2), y = c(3, 4)) rankingEAT(object = EAT_model, barplot = TRUE, threshold = 70, digits = 2)
This function calculates variable importance through a Random Forest + Efficiency Analysis Trees model.
rankingRFEAT(object, barplot = TRUE, digits = 2)
rankingRFEAT(object, barplot = TRUE, digits = 2)
object |
A |
barplot |
|
digits |
Decimal units. |
data.frame
with the importance scores and a barplot representing the variable importance if barplot = TRUE
.
simulated <- X2Y2.sim(N = 50, border = 0.2) RFEAT_model <- RFEAT(data = simulated, x = c(1,2), y = c(3, 4)) rankingRFEAT(object = RFEAT_model, barplot = TRUE, digits = 2)
simulated <- X2Y2.sim(N = 50, border = 0.2) RFEAT_model <- RFEAT(data = simulated, x = c(1,2), y = c(3, 4)) rankingRFEAT(object = RFEAT_model, barplot = TRUE, digits = 2)
This function computes the error of a branch as the sum of the errors of its child nodes.
RBranch(t, tree)
RBranch(t, tree)
t |
|
tree |
A |
A list
containing (1) the sum of the errors of the child nodes of the pruned node and (2) the total number of leaf nodes that come from it.
RCV
RCV(N, Lv, y, alphaIprim, fold, TAiv)
RCV(N, Lv, y, alphaIprim, fold, TAiv)
N |
Number of rows in data. |
Lv |
Test set. |
y |
Column output indexes in data. |
alphaIprim |
Alpha obtained as the square root of the product of two consecutive alpha values in tree_alpha list. It is used to find the best pruning tree. |
fold |
Parts in which the original data is divided into to perform Cross-Validation. |
TAiv |
List with each possible pruning for the deep tree generated with the train set and its associated alpha values. |
Set of best pruning and the associated error calculated with test sets.
This function predicts the expected value based on a set of inputs.
RF_predictor(forest, xn)
RF_predictor(forest, xn)
forest |
|
xn |
Row indexes in data. |
Vector of predictions.
This function builds m
individual Efficiency Analysis Trees in a forest structure.
RFEAT(data, x, y, numStop = 5, m = 50, s_mtry = "BRM", na.rm = TRUE)
RFEAT(data, x, y, numStop = 5, m = 50, s_mtry = "BRM", na.rm = TRUE)
data |
|
x |
Column input indexes in data. |
y |
Column output indexes in data. |
numStop |
Minimum number of observations in a node for a split to be attempted. |
m |
Number of trees to be built. |
s_mtry |
Number of variables randomly sampled as candidates at each split. The available options are:
|
na.rm |
|
A RFEAT
object containing:
data
df
: data frame containing the variables in the model.
x
: input indexes in data.
y
: output indexes in data.
input_names
: input variable names.
output_names
: output variable names.
row_names
: rownames in data.
control
numStop
: numStop hyperparameter value.
m
: m hyperparameter value.
s_mtry
: s_mtry hyperparameter value.
na.rm
: na.rm hyperparameter value.
forest
: list structure containing the individual EAT models.
error
: Out-of-Bag error at the forest.
OOB
: list containing Out-of-Bag set for each tree.
simulated <- X2Y2.sim(N = 50, border = 0.1) RFmodel <- RFEAT(data = simulated, x = c(1,2), y = c(3, 4), numStop = 5, m = 50, s_mtry = "BRM", na.rm = TRUE)
simulated <- X2Y2.sim(N = 50, border = 0.1) RFmodel <- RFEAT(data = simulated, x = c(1,2), y = c(3, 4), numStop = 5, m = 50, s_mtry = "BRM", na.rm = TRUE)
This function saves information about the Random Forest for Efficiency Analysis Trees model.
RFEAT_object( data, x, y, rownames, numStop, m, s_mtry, na.rm, forest, error, OOB )
RFEAT_object( data, x, y, rownames, numStop, m, s_mtry, na.rm, forest, error, OOB )
data |
|
x |
Column input indexes in |
y |
Column output indexes in |
rownames |
|
numStop |
Minimun number of observations in a node for a split to be attempted. |
m |
Number of trees to be built. |
s_mtry |
Select number of inputs in each split.
|
na.rm |
|
forest |
|
error |
Error in Random Forest for Efficiency Analysis Trees. |
OOB |
|
A RFEAT
object.
This function calculates the score for each pruning of tree_alpha_list.
scores(N, Lv_notLv, x, y, fold, numStop, Tk, tree_alpha_list)
scores(N, Lv_notLv, x, y, fold, numStop, Tk, tree_alpha_list)
N |
Number of rows in data. |
Lv_notLv |
List with train and test sets. |
x |
Column input indexes in data. |
y |
Column output indexes in data. |
fold |
Parts in which the original data set is divided to perform Cross-Validation. |
numStop |
Minimum number of observations on a node to be split. |
Tk |
Best pruned tree. |
tree_alpha_list |
List with all the possible pruning and its associated alpha. |
List with the best pruning for each fold, the pruning with a lower score and tree_alpha_list with scores updated.
This function selects the number of inputs for a split in Random Forest.
select_mtry(s_mtry, t, nX, nY)
select_mtry(s_mtry, t, nX, nY)
s_mtry |
Select number of inputs. It could be: |
t |
Node which is being split. |
nX |
Number of inputs in data. |
nY |
Number of outputs in data. |
Number of inputs selected according to the specified rule.
This function tries to find a new pruned tree with a shorter length and a score in the range generated for SE.
selectTk(Tk, tree_alpha_list, SE)
selectTk(Tk, tree_alpha_list, SE)
Tk |
Best pruned tree score. |
tree_alpha_list |
List with all the possible pruning and its associated alpha and scores. |
SE |
Value to get a range where new prunings is found. |
The same best tree or a new suitable one.
Based on Validation tests over BestTivs, a new range of scores is obtained to find new pruned trees.
SERules(N, Lv, y, fold, Tk_score, BestTivs)
SERules(N, Lv, y, fold, Tk_score, BestTivs)
N |
Number of rows in data. |
Lv |
Test set. |
y |
Column output indexes in data. |
fold |
Parts in which the original data set is divided to perform Cross-Validation. |
Tk_score |
Best pruned tree score. |
BestTivs |
List of best pruned trees for each training set. |
Value to get a range where new pruning is found.
This function gets the variable and split value to be used in estimEAT, selects the best split and updates VarInfo, node indexes and leaves list.
split(data, tree, leaves, t, x, y, numStop)
split(data, tree, leaves, t, x, y, numStop)
data |
Data to be used. |
tree |
List structure with the tree nodes. |
leaves |
List with leaf nodes or pending expansion nodes. |
t |
Node which is being split. |
x |
Column input indexes in data. |
y |
Column output indexes in data. |
numStop |
Minimum number of observations in a node to be split. |
Leaves and tree lists updated with the new child nodes.
This function gets the variable and split value to be used in estimEAT, selects the best split, node indexes and leaf list.
split_forest(data, tree, leaves, t, x, y, numStop, arrayK)
split_forest(data, tree, leaves, t, x, y, numStop, arrayK)
data |
Data to be used. |
tree |
List structure with the tree nodes. |
leaves |
List with leaf nodes or pending expansion nodes. |
t |
Node which is being split. |
x |
Column input indexes in data. |
y |
Column output indexes in data. |
numStop |
Minimum number of observations on a node to be split. |
arrayK |
Column input indexes in data selected by s_mtry. |
Leaves and tree lists updated with the new child nodes.
This function generates a deep EAT and all pruning for each train set.
treesForRCV(notLv, x, y, fold, numStop)
treesForRCV(notLv, x, y, fold, numStop)
notLv |
Train set. |
x |
Column input indexes in data. |
y |
Column output indexes in data. |
fold |
Parts in which the original set is divided to perform Cross-Validation. |
numStop |
Minimum number of observations in a node to be split. |
List with each possible pruning for the deep tree generated with train set and its associated alpha values.
This function is used to simulate the data in a scenario with 2 inputs and 2 outputs.
X2Y2.sim(N, border, noise = NULL)
X2Y2.sim(N, border, noise = NULL)
N |
Sample size. |
border |
Percentage of DMUs in the frontier. |
noise |
Random noise. |
data.frame
with simulated data.
This function is used to simulate the data in a single output scenario.
Y1.sim(N, nX)
Y1.sim(N, nX)
N |
Sample size. |
nX |
Number of inputs. |
data.frame
with simulated data.