This study examines garment factory productivity using daily team-level data. The central question is whether teams achieve the productivity targets set by management and how incentives, overtime, staffing, and workflow interact with this outcome. We clean and prepare the dataset, summarize key structural patterns, and then move to formal modeling. The goal is not only predictive accuracy, but also interpretability—so that findings can inform real-world management decisions.
We began by preparing a model-ready dataset that preserves operational meaning while correcting clear inconsistencies.
Definition of \(Y\).
We define achievement of the daily target as
\[ Y = \begin{cases} 1, & \text{if } \text{Actual Productivity} \geq \text{Target Productivity}, \\ 0, & \text{otherwise}. \end{cases} \]
Both actual and target productivity are rounded to two decimals before comparison. This avoids spurious cases caused by tiny rounding differences and mirrors how managers evaluate performance in practice.
trimws() and parsed them
using as.POSIXct to ensure consistent ordering and
reproducibility."sweing"
was corrected to "sewing". We retain this as a labeled
factor for tables and plots, and also use a binary coding (sewing = 0,
finishing = 1) when simpler interpretation is needed.After these adjustments, the dataset includes:
Assumptions and risks.
The correction of 0.07 to 0.70 assumes a data entry error. Rounding
worker counts sacrifices some detail on partial shifts but improves
clarity. wip_sewing enforces a zero effect for finishing lines.
All of these choices were checked against model fit and can be revisited
in sensitivity analyses.
Departments. The two departments differ structurally in both task complexity and responsiveness to management incentives (Fig. A1). Sewing operates with higher task times (average SMV ≈ 23 minutes) and larger crews, while Finishing handles shorter tasks (average SMV ≈ 4 minutes) with smaller groups. Incentives play a central role in Sewing: once even modest bonuses are introduced, most sewing teams consistently achieve their productivity targets. Finishing, by contrast, records little incentive use, and performance remains below benchmark levels regardless of bonuses. This divergence reflects not only differences in task complexity but also the downstream position of Finishing, which depends on upstream sewing output.
Teams. Within departments, variation across teams is
pronounced (Figs. A2–A3).
- High performers: Teams 1, 3, 4, and 12 consistently
achieve targets. Teams 1 and 12 stand out for maintaining strong
performance without overtime, indicating efficient baseline
organization.
- Mid-level performers: Teams 2, 5, 9, 10, and 11
cluster in the middle. Their productivity often fluctuates with
incentives and overtime, with moderate overtime sometimes reducing
output before improvements appear at extreme levels.
- Underperformers: Teams 6, 7, and 8 lag behind across
all dimensions. Team 6, for example, operates with a smaller headcount
(≈35 workers versus the norm of 55–58), while Teams 7 and 8 show little
productivity response even with overtime or incentives applied.
Operational indicators reinforce these patterns. Work-in-progress varies widely across sewing teams, from about 800 to 1,600 units (Fig. A4), with Teams 1 and 10 carrying particularly heavy loads. Headcounts are uneven as well, amplifying performance gaps between teams.
Summary. Departmental differences highlight structural constraints (sewing is responsive to incentives, finishing less so), while team-level analysis reveals that sustained performance depends less on overtime and more on staffing balance and workflow organization.
This section aims to explain the process taken to identify a set of
predictors for the binary outcome \(Y\)
(achievement of daily productivity target). We combined shrinkage
methods (lasso, elastic net), subset selection (via
L0Learn), and manual specification to balance statistical
fit with managerial interpretability.
We began by applying lasso and elastic net regularization on the full design matrix. Both methods penalize coefficients toward zero, helping control multicollinearity while selecting the most relevant predictors. We obtain two sets of results depending on the choice of shrinkage parameter λ:
To complement the shrinkage methods, we used the L0Learn
package to perform best subset selection under logistic loss. Unlike
lasso or elastic net, which rely on \(\ell_1\) or mixed penalties,
L0Learn directly approximates the classic best subset
problem by applying an \(\ell_0\)
penalty, effectively limiting the number of active predictors. We ran
the function L0Learn.fit(), which evaluates predictive
performance across held-out folds while progressively varying the subset
size.
The cross-validation curves showed that models with only a few active variables already achieved near-optimal predictive accuracy. Across the solution path, department and team 8 consistently entered early and remained stable, marking them as the most robust predictors. Additional variables (style changes, incentives, temporal dummies) appeared in larger subsets but with less stability. This ranking closely mirrored the results from lasso and elastic net, strengthening our confidence in the final variable selection.
These results suggest a clear baseline: department and team 8 form the core explanatory set across all methods when prioritizing robustness (1-SE). The additional predictors identified under the best λ are informative but less stable, and will be revisited in exploratory specifications to test whether they materially improve performance beyond the core model.
Starting from the recommendations agreed by the shrinkage methods, we created a series of logistic regression models to test how different predictor blocks behaved:
department × workers and team × incentives.
These highlighted that certain teams reacted differently to style shocks
or incentive pay.Based on these steps, we defined a portfolio of final models, each serving a different purpose:
Together these models represent the trade-off between predictive accuracy and interpretability. The interpretable specification will be used for managerial insights, while the best-fit versions are retained for statistical benchmarking. Comparative AIC/BIC values and likelihood ratio tests confirmed that while complex models offer marginally better fit, simpler models retain most explanatory power with clearer interpretation.
This section evaluates the predictive performance of several classification methods for the daily productivity target variable \(Y\). To ensure comparability, all models were tested under the same repeated cross-validation scheme, with results summarized both numerically and graphically.
For each candidate model we implemented stratified \(k\)-fold cross-validation, repeated multiple times to reduce variance. The evaluation metric was the mean classification error rate, with model stability assessed across folds and repetitions. To accommodate different model families, two functions were developed: one for generalized linear and discriminant models (logit, LDA, RDA) and another tailored for k-nearest neighbors (kNN).
Figure outputs (see Appendix B, Fig. B1) display the fold-wise cross-validation errors for each model family. Across repetitions, logistic regression and LDA performed best, with nearly identical mean error rates around 19.5% (accuracy ≈ 80%). RDA trailed slightly at 19.8% error, while kNN was clearly weaker at 26.2% error (accuracy ≈ 74%).
| Model | Accuracy | Error | Formula |
|---|---|---|---|
| LOGIT | 0.805 | 0.194 | best_optimized |
| LDA | 0.805 | 0.194 | best_optimized |
| RDA | 0.801 | 0.198 | parsimonious |
| kNN | 0.737 | 0.262 | best |
Table 1. Mean Cross-validation error over 1000 Monte-Carlo repetitions for LOGIT, LDA, RDA, and kNN models.
The comparison shows that parametric methods (LOGIT and LDA) are most effective for this dataset, achieving both strong predictive performance and interpretability. RDA offered no additional benefit over LDA, while kNN underperformed even with extensive preprocessing. Based on these results, subsequent analysis emphasizes the logistic regression framework, as it balances statistical efficiency with explanatory value for managerial recommendations.
In this section we interpret the regression results into predicted probabilities under controlled scenarios. Using the interpretable model specification, we generated baseline scenarios and then varied one factor at a time to visualize its marginal effect. Full figures are provided in Appendix C (Fig. C1).
At the baseline configuration, the probability of achieving the daily target was estimated at ~0.92. Deviations from this baseline reveal important contrasts. Certain teams (notably Teams 6, 8, 10, and 11) are consistently below the benchmark, while others (Teams 1, 3, 4, 12) remain at or above it. Calendar effects are weaker: Sundays and Thursdays carry a slightly lower probability, suggesting modest fatigue, but the difference is small relative to team-level heterogeneity. Departments diverge, with Sewing consistently outperforming Finishing.
Continuous variables highlight operational thresholds. Incentives show a steep increase in the probability of success at low levels, with diminishing returns beyond ~50 units; after this point, additional incentives add little. Work-in-progress (WIP) displays a similar pattern: once ~2,000 units are in progress, additional WIP does not meaningfully change the likelihood of meeting targets.
Interactions reveal where risks are concentrated. Larger teams in the Finishing department gain disproportionately from additional workers, reflecting tighter staffing constraints. By contrast, several weaker teams respond poorly to incentives, with probabilities declining as bonuses rise (Team 2 and Team 10 are notable examples). Style changes amplify these effects: Team 8 responds well to style changes, while Team 10’s probability of success drops. Such asymmetric responses suggest that team-level dynamics drive much of the observed variation.
Conclusion.
The probability analysis confirms that Sewing outperforms Finishing,
small incentives are highly effective, and bottlenecks are largely
structural. Interactions underscore that underperforming teams require
tailored interventions because overtime or incentive policies are
insufficient. The detailed probability plots can be found in Appendix C (Fig. C1).
Cross–validation provides an unbiased estimate of model error under repeated partitioning of data, but it does not quantify the sampling variability of those estimates. To address this, we applied non-parametric bootstrap resampling to our key models. The idea is straightforward: instead of relying on a single training–test split or even repeated folds, we resample the dataset with replacement many times, re-fit the model, and record the error each time. This gives an empirical distribution of the error rate, from which confidence intervals can be constructed.
Formally, let \(\hat{\theta}\) denote the cross-validated error rate for a given model. Bootstrapping generates resampled datasets \(\{D^*_1,\dots,D^*_R\}\) of the same size as the original data. Each resample yields \(\hat{\theta}^*_r\), and the empirical distribution of \(\hat{\theta}^*_r\) approximates the sampling distribution of \(\hat{\theta}\). We then compute percentile confidence intervals to reflect the expected range of error rates if the study were replicated on new data from the same process.
The bootstrap distributions confirmed that error rates are stable across repeated samples. For the LOGIT models, the best_optimized specification produced an average error of ~0.190 with a narrow 95% percentile interval, closely aligned with its cross-validation estimate of 0.192. Similarly, best_optimized model under LDA specification yielded an error of ~0.190 with overlapping bootstrap and CV values. Parsimonious models showed wider intervals, reflecting higher variance when too few predictors are retained.
A compact comparison is shown in Appendix D (Fig. D1). The black dot marks the bootstrap mean, the thin line shows the 95% bootstrap percentile interval, and the open blue circle indicates the cross-validation mean. The fact that the blue markers lie well within the bootstrap intervals reassures us that cross-validation error estimates are not optimistic and that the models generalize consistently.
To complement the graph, Table 1 reports cross-validation error, bootstrap mean error, and 95% percentile intervals for each model. These results confirm that the bootstrap distribution centers closely around the cross-validation estimate, with only small variability across resamples.
| Model | CV_Error | Bootstrap_Mean | CI_Lower | CI_Upper |
|---|---|---|---|---|
| LOGIT – Best | 0.193 | 0.189 | 0.166 | 0.212 |
| LOGIT – Best Opt. | 0.194 | 0.190 | 0.167 | 0.213 |
| LOGIT – Interp. | 0.227 | 0.212 | 0.186 | 0.236 |
| LOGIT – Interp. Log | 0.210 | 0.201 | 0.179 | 0.225 |
| LOGIT – Parsim. | 0.241 | 0.241 | 0.229 | 0.256 |
| LDA – Best | 0.193 | 0.189 | 0.168 | 0.212 |
| LDA – Best Opt. | 0.194 | 0.190 | 0.167 | 0.215 |
| LDA – Interp. | 0.227 | 0.222 | 0.201 | 0.241 |
| LDA – Interp. Log | 0.210 | 0.205 | 0.182 | 0.229 |
| LDA – Parsim. | 0.241 | 0.241 | 0.229 | 0.254 |
Table 2. Cross-validation error and bootstrap confidence intervals for LOGIT and LDA models (1000 bootstrap replications).
Bootstrapping strengthens our results in three ways:
It is important to note, however, that we applied the ordinary bootstrap. This method resamples the same observations repeatedly, which means that some data points may appear multiple times in a given resample while others are omitted. This reduces the effective variability of the samples and can understate the true uncertainty of model error. For this reason, bootstrap results should not be taken as final evidence on their own. Our primary validation remains the cross-validation procedure repeated 1000 times, which provides a more reliable baseline. The fact that bootstrap means and CV results closely match in this study supports the robustness of our findings, but we explicitly acknowledge the drawback of the ordinary bootstrap approach.
Receiver Operating Characteristic (ROC) analysis provides a complementary perspective to error rates by focusing on the trade-off between sensitivity (true positive rate) and specificity (false positive rate). The Area Under the Curve (AUC) serves as a scalar summary: values closer to 1 indicate strong discriminatory power, while values near 0.5 suggest random guessing.
Cross–validated ROC curves (Fig. E1) show that both Logistic Regression (LOGIT) and Linear Discriminant Analysis (LDA) models achieve consistently high AUC values across specifications. For LOGIT, the best and best_optimized models achieved mean AUCs of 0.856 and 0.858, respectively, while the interpretable variants were slightly lower (0.808 for interpretable and 0.832 for interpretable_log). The parsimonious model lagged behind, with an AUC of 0.682. LDA results closely mirrored this pattern, with best and best_optimized models yielding AUCs of 0.856 and 0.858, respectively, interpretable and interpretable_log models scored 0.776 and 0.828, and the parsimonious model saw no improvement, remaining at 0.682.
Bootstrap confidence intervals for AUC (Fig. E2)
further confirmed these findings. For LOGIT, the best_optimized
model had a mean bootstrap AUC of 0.840 with a 95% interval of [0.785,
0.888], overlapping almost perfectly with the cross–validation mean. The
parsimonious specification was substantially weaker, with a
mean AUC of 0.699 [0.629, 0.764].
Similarly, for LDA, the best_optimized model recorded a mean
bootstrap AUC of 0.828 [0.771, 0.882], while the parsimonious
model again fell to 0.700 [0.634, 0.763].
Beyond overall AUC values, bootstrap confidence bands for the ROC curves (Fig. E3) highlight how performance varies across the false positive rate spectrum. The best and best_optimized models maintain high sensitivity across most thresholds, while the parsimonious models display wider uncertainty bands and weaker separation from the diagonal, confirming lower classification scores.
A compact summary of AUC estimates is shown in Table 3.| Model | CV_AUC | Bootstrap_Mean | CI_Lower | CI_Upper |
|---|---|---|---|---|
| LOGIT – Best | 0.856 | 0.838 | 0.778 | 0.886 |
| LOGIT – Optimized | 0.858 | 0.840 | 0.785 | 0.888 |
| LOGIT – Interp. | 0.808 | 0.805 | 0.744 | 0.863 |
| LOGIT – Interp._log | 0.832 | 0.828 | 0.770 | 0.879 |
| LOGIT – Parsim. | 0.682 | 0.699 | 0.629 | 0.764 |
| LDA – Best | 0.856 | 0.826 | 0.765 | 0.880 |
| LDA – Optimized | 0.858 | 0.828 | 0.771 | 0.882 |
| LDA – Interp. | 0.776 | 0.785 | 0.713 | 0.848 |
| LDA – Interp._log | 0.828 | 0.822 | 0.761 | 0.875 |
| LDA – Parsim. | 0.682 | 0.700 | 0.634 | 0.763 |
Table 3. Cross–validated AUC and bootstrap percentile confidence intervals for LOGIT and LDA models (2000 bootstrap replications).
The AUC results reinforce earlier findings from error–rate analysis. The best and best_optimized models achieve the highest discrimination, while interpretable variants remain competitive and the parsimonious specification underperforms. Importantly, bootstrap confidence intervals closely overlap with cross–validation means, strengthening the assumption that our CV estimates are reliable and not optimistic.
At the same time, ROC bootstrap intervals highlight the cost of model simplicity: interpretable and parsimonious specifications carry visibly wider uncertainty bands, confirming greater variability in classification ability. This illustrates the trade-off between interpretability and performance, consistent with the patterns observed in error rates.
Finally, it is worth noting that the bootstrap approach employed here is the ordinary bootstrap, meaning that resamples can include repeated observations from the original dataset. This limits the method’s quality compared to stratified or out-of-bag resampling methods. For this reason, we initially relied on extensive cross-validation (1000 iterations) to establish baseline estimates. The similarity in results between bootstrap and CV iterations gave us confidence in our results, but it’s worth keeping this caveat in mind.
Figures of ROC curves, bootstrap AUC intervals, and confidence bands are provided in Appendix E (Figs. E1–E3).
This study combined rigorous statistical modeling with practical interpretation to evaluate the drivers of productivity in a garment factory. The binary outcome was analyzed through a multi-stage process: data cleaning, structural exploration, variable selection via shrinkage and subset methods, logistic regression modeling, probability estimation, and validation using bootstrap and ROC analysis. Each step was chosen to ensure that the resulting models are not only statistically robust but also meaningful for managerial decision-making.
From a methodological standpoint, shrinkage methods (lasso, elastic
net) and subset selection (L0Learn) highlighted
department and Team 8 as persistent
predictors, with style changes, incentives, and temporal dummies
providing additional explanatory power. Manual model specification
allowed us to enrich these core predictors with interpretable
interaction terms, balancing predictive performance with clarity.
Validation through repeated cross-validation and bootstrap confirmed
that our error estimates are stable, while AUC/ROC analysis demonstrated
that the best models are consistent in classifying between successful
and unsuccessful days, with mean AUC values around 0.83–0.84 for logit
and 0.82–0.83 for LDA.
For management, several actionable insights emerge:
Department-level differences dominate
outcomes.
Sewing consistently outperforms Finishing in achieving targets, even
after adjusting for incentives and workload. This indicates that
structural differences in tasks and workflow are not the only drivers of
productivity. Management should recognize that expectations for two
departments must be set differently and resourcing adjusted
accordingly.
Team-level heterogeneity is critical.
Teams 6, 7, 8, 9, 10, and 11 underperform relative to the baseline, and
their response to incentives is weaker or inconsistent. By contrast,
high-performing teams (1, 3, 4, 12) reliably exceed their targets
without requiring additional overtime. This suggests that managerial
attention should focus on training, supervision, or incentive redesign
for the weaker teams, rather than applying uniform policies across the
board.
Incentives are effective but non-linear.
The probability plots show that modest bonuses can quickly push Sewing
teams above the target threshold, but additional increments yield
diminishing returns. For weaker teams, incentives alone do not guarantee
improvement suggesting at structural changes.
Overtime is reactive, not proactive.
The models suggest that overtime is often used to compensate for
inefficiencies rather than to enhance productivity. Heavy reliance on
overtime may therefore mask underlying workflow issues. Management
should view overtime as a red flag rather than a solution, and
investigate the root causes of delays.
Operational variables matter selectively.
Style changes consistently reduce the probability of hitting targets,
but their impact varies by team. Teams 7, 8, and 10 show sharp drops
when style changes occur, indicating that some groups are less adaptable
to switching production lines. Targeted support during these transitions
could mitigate disruptions.
Overall, the modeling framework shows how statistical methods can directly support managerial decision-making. Shrinkage techniques helped us identify the most influential predictors with minimal noise, logistic regression provided interpretable relationships between predictors and success rates, and ROC analysis confirmed the models’ ability to reliably distinguish between high and low performance. Together, these steps ensure that our findings are not artifacts of overfitting, but reflect stable, reproducible patterns. For management, this translates into actionable guidance: incentives can be targeted where they are most effective, staffing adjustments can be based on evidence rather than intuition, and departmental differences can be managed with a clearer understanding of structural constraints. In short, the combination of rigorous modeling and robust validation strengthens confidence that recommended interventions will lead to measurable improvements in productivity.
Figure A1. Department-Level Summary of Productivity and Staffing.
Figure A2. Team-Level Incentive Summary.
Figure A3. Team-Level Overtime Summary.
Figure A4. Team-Level Operational Summary.
Figure B1. Cross-Validation Error Rates for Classification Models.
Figure E1. CV ROC Curves
Figure E2. Bootstrapped AUC at 95%
Figure E3. Bootstrapped ROC at 95%
Function F1. Estimating Classification Error via s-Fold, r-Repetition Cross-Validation.
cv.iteration_fn <- function(formulas = NULL, model_spec, data, s = 5, r = 50, rda_lambda = NA, rda_gamma = NA) {
m <- length(formulas)
cv.means <- matrix(data = NA, nrow = s, ncol = m, dimnames = list(NULL, paste(names(formulas))))
for (f in 1:m) {
cv.error.rep <- matrix(data = NA, nrow = r, ncol = s)
# Stratified folds
for (r in 1:r) {
folds <- caret::createFolds(data$y, k = s, list = F)
cv.errors <- rep(0, s)
# Function
for (j in 1:s) {
train_subset <- data[folds != j, ]
test_subset <- data[folds == j, ]
if (model_spec == "LOGIT") {
fit <- stats::glm(formula = formulas[[f]], family = "binomial", data = train_subset)
probs <- predict(object = fit, newdata = test_subset, type = "response")
pred <- ifelse(probs > 0.5, 1, 0)
}
if (model_spec == "LDA") {
fit <- MASS::lda(formula = formulas[[f]], data = transform(train_subset, y = factor(y)))
probs <- predict(object = fit, newdata = test_subset)
pred <- as.integer(as.character(probs$class))
}
if (model_spec == "RDA") {
fit <- klaR::rda(formula = formulas[[f]], data = transform(train_subset, y = factor(y)),
lambda = rda_lambda, gamma = rda_gamma)
probs <- predict(object = fit, newdata = test_subset)
pred <- as.integer(as.character(probs$class))
}
cv.errors[j] <- 1 - mean(pred == test_subset$y) # Level 1: Storing error rates in cv.errors vector
}
cv.error.rep[r, ] <- cv.errors # Level 2: We take the mean CV error rate over r repetitions and store it in rows
}
cv.means[, f] <- colMeans(cv.error.rep) # Level 3: We take the the mean of error rates across all repetitions for each model m
}
return(cv.means)
}
Function F2. Estimating kNN Classification Error via s-Fold, r-Repetition Cross-Validation.
cv.iteration.knn_fn <- function(dataframes, k = 1, s = 5, r = 20) {
df <- length(dataframes)
cv.means <- matrix(data = NA, nrow = s, ncol = df, dimnames = list(NULL, paste(names(dataframes))))
for (d in 1:df) {
data <- dataframes[[d]]
cv.error.rep <- matrix(NA, nrow = r, ncol = s)
for (rep in 1:r) {
folds <- caret::createFolds(data$y, k = s, list = F)
cv.errors <- numeric(s)
for (j in 1:s) {
# Data
train_subset <- data[folds != j, ]
test_subset <- data[folds == j, ]
# Requirements
X.train <- as.matrix(train_subset[, names(train_subset) != "y"])
X.test <- as.matrix(test_subset[, names(test_subset) != "y"])
Y.train <- factor(train_subset$y)
# Scale
mu = colMeans(X.train)
std = apply(X.train, 2, sd);std[std == 0] <- 1
X.train <- scale(X.train, center = mu, scale = std)
X.test <- scale(X.test, center = mu, scale = std)
# Predictions
pred <- class::knn(train = X.train, test = X.test, cl = Y.train, k = k)
pred <- as.integer(as.character(pred))
cv.errors[j] <- 1 - mean(pred == test_subset$y)
}
cv.error.rep[rep, ] <- cv.errors
}
cv.means[, d] <- colMeans(cv.error.rep)
}
return(cv.means)
}
Function F3. Bootstrap Estimate of Classification Error via s-Fold CV.
boot.fn <- function(formula, data, idx, model_spec, s = 5) {
# data = chosen data frame
# idx = placeholder for boot()
# model_spec = Specify so we obtain model fit logic
# s = Number of folds, 5 is the default value
# k = Number of kNN's for knn algorithm, 1 is the default value
data <- data[idx, ] # Classic bootstrap, it's not the best
folds <- caret::createFolds(y = data$y, k = s, list = F)
cv.errors <- numeric(s)
for (j in 1:s) {
train_subset <- data[folds != j, ]
test_subset <- data[folds == j, ]
if (model_spec == "LOGIT") {
fit <- stats::glm(formula = formula, family = "binomial",
data = train_subset)
probs <- predict(object = fit, newdata = test_subset, type = "response")
pred <- ifelse(probs > 0.5, 1, 0)
}
if (model_spec == "LDA") {
fit <- MASS::lda(formula = formula, data = transform(train_subset, y = factor(y)))
probs <- predict(object = fit, newdata = test_subset)
pred <- as.integer(as.character(probs$class))
}
if (model_spec == "RDA") {
fit <- klaR::rda(formula = formula, data = transform(train_subset, y = factor(y)))
probs <- predict(object = fit, newdata = test_subset)
pred <- as.integer(as.character(probs$class))
}
cv.errors[j] <- 1 - mean(pred == test_subset$y)
}
mean(cv.errors)
}
Function F4. Bootstrap Estimate of k-NN Classification Error via s-Fold CV.
boot.fn_knn <- function(dataframe, idx, k = 1, s = 5) {
data <- dataframe[idx, ]
folds <- caret::createFolds(y = data$y, k = 5, list = F)
cv.errors <- numeric(s)
#
for (j in 1:s) {
# Data
train_subset <- data[folds != j, ]
test_subset <- data[folds == j, ]
# Requirements
X.train <- as.matrix(train_subset[, names(train_subset) != "y"])
X.test <- as.matrix(test_subset[, names(test_subset) != "y"])
Y.train <- factor(train_subset$y)
# Scale
mu <- colMeans(X.train)
std <- apply(X.train, 2, sd);std[std == 0] <- 1
X.train <- scale(X.train, center = mu, scale = std)
X.test <- scale(X.test, center = mu, scale = std)
# Predict
pred <- class::knn(train = X.train, test = X.test, cl = Y.train, k = k)
pred <- as.integer(as.character(pred))
cv.errors[j] <- 1 - mean(pred == test_subset$y)
}
mean(cv.errors)
}
Function F5. ROC AUC Estimation via s fold CV.
roc.fn <- function(formulas, data, model_spec, s = 5) {
# Stratified folds
folds <- caret::createFolds(y = data$y, k = s, list = F)
# Storage
results <- vector(mode = "list", length(formulas))
# Function
for (f in seq_along(formulas)) {
auc_values <- numeric(s)
all_roc <- vector(mode = "list", length = s)
for (j in 1:s) {
# Data
train_subset <- data[folds != j, ]
test_subset <- data[folds == j, ]
# Function
if (model_spec == "LOGIT") {
fit <- stats::glm(formula = formulas[[f]], family = "binomial", data = train_subset)
probs <- predict(fit, newdata = test_subset, type = "response")
}
if (model_spec == "LDA") {
fit <- MASS::lda(formula = formulas[[f]], data = transform(train_subset, y = factor(y)))
probs <- predict(fit, newdata = test_subset)$posterior[, 2]
}
if (model_spec == "RDA") {
fit <- klaR::rda(formula = formulas[[f]], data = transform(train_subset, y = factor(y)))
probs <- predict(fit, newdata = test_subset)$posterior[, 2]
}
roc_obj <- pROC::roc(test_subset$y, probs)
auc_values[j] <- pROC::auc(roc_obj)
all_roc[[j]] <- roc_obj
}
# Return list
results[[f]] <- list(
formula = deparse(formulas[[f]]),
auc_mean = mean(auc_values),
auc_sd = sd(auc_values),
rocs = all_roc
)
}
results
}