将复制活动的序列号添加到 Blob

Question

TarJae

Asked: 2024-08-30 17:54:20 +0800 CST2024-08-30 17:54:20 +0800 CST 2024-08-30 17:54:20 +0800 CST

在使用 XGBoost 的 Tidymodels 工作流中应用配方后，validate_column_names() 中出现错误：缺少必需的列

772

我在工作流中使用 tidymodels 和 xgboost 时遇到问题。在应用包括step_dummy()将分类变量转换为虚拟变量的配方后，我在尝试进行预测时收到以下错误：

Error in `validate_column_names()`:
! The following required columns are missing: 'A', 'B', 'C', 'D'.

这是我的代码的简化版本：

library(tidymodels)
library(xgboost)
library(dplyr)

set.seed(123)
datensatz <- tibble(
  outcome = rnorm(100, mean = 60, sd = 10),
  A = factor(sample(c("h", "i", "j"), 100, replace = TRUE)),
  B = factor(sample(c("e", "f", "g"), 100, replace = TRUE)),
  C = factor(sample(1:3, 100, replace = TRUE)),
  D = factor(sample(c("a", "b"), 100, replace = TRUE))
)

# splitting
data_split <- initial_split(datensatz, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)


# Rezept
recipe_obj <- recipe(outcome ~ ., data = train_data) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%  
  step_zv(all_predictors()) %>%  
  step_normalize(all_numeric_predictors())  

prepared_recipe <- prep(recipe_obj)
test_data_prepared <- bake(prepared_recipe, new_data = test_data)

# XGBoost Modell Spezifikation
xgboost_spec <- boost_tree(
  trees = 1000,                    
  tree_depth = 6,                  
  min_n = 10,                      
  loss_reduction = 0.01,           
  sample_size = 0.8,               
  mtry = 0.8,                      
  learn_rate = 0.01                
) %>%
  set_mode("regression") %>%
  set_engine("xgboost", count = FALSE, colsample_bytree = 0.8)

# Workflow
workflow_obj <- workflow() %>%
  add_recipe(recipe_obj) %>%
  add_model(xgboost_spec)

# Modell trainieren
xgboost_fit <- fit(workflow_obj, data = train_data)

# Modellvorhersage auf den vorbereiteten Testdaten
predictions <- predict(xgboost_fit, new_data = test_data_prepared)

# Ergebnisse 
predictions
# Error occurs here

我怀疑问题与step_dummy()删除原始分类列(A, B, C, D)并用虚拟变量替换有关。但是，工作流程似乎在进行预测时需要原始列。

我该如何解决这个问题并确保预测步骤正确使用创建的虚拟变量step_dummy()？

附加信息：

I'm using the `xgboost engine` within the `tidymodels` framework.
The error message suggests that the workflow expects the original categorical variables, but these are no longer present after applying `step_dummy()`.

1 个回答

Voted

EmilHvitfeldt · Answer 1 · 2024-08-31T03:55:09+08:00

如果您在工作流程中使用配方，则无需手动prep()设置bake()测试数据集。因此，您可以删除以下几行

prepared_recipe <- prep(recipe_obj)
test_data_prepared <- bake(prepared_recipe, new_data = test_data)

并预测predict(xgboost_fit, new_data = test_data)而不是predict(xgboost_fit, new_data = test_data_prepared)

library(tidymodels)
library(xgboost)
library(dplyr)

set.seed(123)
datensatz <- tibble(
  outcome = rnorm(100, mean = 60, sd = 10),
  A = factor(sample(c("h", "i", "j"), 100, replace = TRUE)),
  B = factor(sample(c("e", "f", "g"), 100, replace = TRUE)),
  C = factor(sample(1:3, 100, replace = TRUE)),
  D = factor(sample(c("a", "b"), 100, replace = TRUE))
)

# splitting
data_split <- initial_split(datensatz, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)

# Rezept
recipe_obj <- recipe(outcome ~ ., data = train_data) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%  
  step_zv(all_predictors()) %>%  
  step_normalize(all_numeric_predictors())  

# XGBoost Modell Spezifikation
xgboost_spec <- boost_tree(
  trees = 1000,                    
  tree_depth = 6,                  
  min_n = 10,                      
  loss_reduction = 0.01,           
  sample_size = 0.8,               
  mtry = 0.8,                      
  learn_rate = 0.01                
) %>%
  set_mode("regression") %>%
  set_engine("xgboost", count = FALSE, colsample_bytree = 0.8)

# Workflow
workflow_obj <- workflow() %>%
  add_recipe(recipe_obj) %>%
  add_model(xgboost_spec)

# Modell trainieren
xgboost_fit <- fit(workflow_obj, data = train_data)

# Modellvorhersage auf den vorbereiteten Testdaten
predictions <- predict(xgboost_fit, new_data = test_data)

# Ergebnisse 
predictions
#> # A tibble: 25 × 1
#>    .pred
#>    <dbl>
#>  1  62.9
#>  2  58.2
#>  3  57.8
#>  4  59.5
#>  5  60.0
#>  6  61.9
#>  7  58.2
#>  8  61.4
#>  9  60.7
#> 10  54.9
#> # ℹ 15 more rows

^{创建于 2024-08-30，使用reprex v2.1.1}

在使用 XGBoost 的 Tidymodels 工作流中应用配方后，validate_column_names() 中出现错误：缺少必需的列

Vue 3：创建时出错“预期标识符但发现‘导入’”[重复]

为什么这个简单而小的 Java 代码在所有 Graal JVM 上的运行速度都快 30 倍，但在任何 Oracle JVM 上却不行？

具有指定基础类型但没有枚举器的“枚举类”的用途是什么？

如何修复未手动导入的模块的 MODULE_NOT_FOUND 错误？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

何时应使用 std::inplace_vector 而不是 std::vector？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

我正在尝试仅使用海龟随机和数学模块来制作吃豆人游戏

在使用 XGBoost 的 Tidymodels 工作流中应用配方后，validate_column_names() 中出现错误：缺少必需的列

1 个回答

相关问题