AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / coding / 问题 / 78933232
Accepted
Simon
Simon
Asked: 2024-08-31 01:47:59 +0800 CST2024-08-31 01:47:59 +0800 CST 2024-08-31 01:47:59 +0800 CST

继续在新数据上训练 PyTorch 模型

  • 772

我正在研究文本分类任务,并决定为此使用 PyTorch 模型。该过程主要涉及以下步骤:

  1. 加载并处理文本。
  2. 使用 TF-IDF 矢量化器。
  3. 建立神经网络并保存 TF-IDF 矢量化器和模型以预测新数据。

但是,我每天都需要对新的评论进行分类,并纠正任何错误的分类。

目前,我的方法是将具有正确分类的新评论添加到数据集并重新训练整个模型。这个过程很耗时,而且新评论可能会在验证过程中丢失。我想用新分类的文本创建一个新的数据集,并继续对这些新数据进行训练(新评论是手动分类的,因此每个标签都是正确的)。

使用 GPT 和一些在线代码,我编写了所需的流程,但是,我不确定它是否按预期工作,或者我犯了一些不应该发生的愚蠢错误。

因此主要问题是:

  1. 我如何检查解决该问题的建议方法是否如我预期的那样有效?
  2. 当矢量化器面临新的标记时,我该怎么做呢?我只能做一个吗.fit_transform()?否则我会丢失原始的矢量化器吗?

以下是完整的训练过程:

import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader, random_split
from sklearn.preprocessing import LabelEncoder
import polars as pl
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import joblib

set1 = (
    pl
    .read_csv(
        "set1.txt",
        separator=";",
        has_header=False,
        new_columns=["text","label"]
    )
)

# since the dateset its unbalanced, im going to force to have more balance

fear_df = set1.filter(pl.col("label") == "fear")
joy_df = set1.filter(pl.col("label") == "joy").sample(n=2500)
sadness_df = set1.filter(pl.col("label") == "sadness").sample(n=2500)
anger_df = set1.filter(pl.col("label") == "anger")

train_df = pl.concat([fear_df,joy_df,sadness_df,anger_df])

"""
The text its already clean, so im going to change the labels to numeric
and then split it on train, test ,val
"""

label_mapping = {
    "anger": 0,
    "fear": 1,
    "joy": 2,
    "sadness": 3
}

train_mapped = (
    train_df
    .with_columns(
        pl.col("label").replace_strict(label_mapping, default="other").cast(pl.Int16)
    )
   
)

train_set, pre_Test = train_test_split(train_mapped,
                                    test_size=0.4,
                                    random_state=42,
                                    stratify=train_mapped["label"])

test_set, val_set = train_test_split(pre_Test,
                                    test_size=0.5,
                                    random_state=42,
                                    stratify=pre_Test["label"]) 

# Vectorize text data using TF-IDF
vectorizer = TfidfVectorizer(max_features=30000, ngram_range=(1, 2))

X_train_tfidf = vectorizer.fit_transform(train_set['text']).toarray()
X_val_tfidf = vectorizer.transform(val_set['text']).toarray()
X_test_tfidf = vectorizer.transform(test_set['text']).toarray()

y_train = train_set['label']
y_val = val_set['label']
y_test = test_set['label']

class TextDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        return text, label
    
train_dataset = TextDataset(X_train_tfidf, y_train)
val_dataset = TextDataset(X_val_tfidf, y_val)
test_dataset = TextDataset(X_test_tfidf, y_test)

batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

class TextClassificationModel(nn.Module):
    def __init__(self, input_dim, num_classes):
        super(TextClassificationModel, self).__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.dropout1 = nn.Dropout(0.5)
        self.fc2 = nn.Linear(64, 32)
        self.dropout2 = nn.Dropout(0.5)
        self.fc3 = nn.Linear(32, num_classes)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout1(x)
        x = torch.relu(self.fc2(x))
        x = self.dropout2(x)
        x = torch.softmax(self.fc3(x), dim=1)
        return x
    
input_dim = X_train_tfidf.shape[1]
model = TextClassificationModel(input_dim, 4)

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adamax(model.parameters())

# Training loop
num_epochs = 17
best_val_acc = 0.0
best_model_path = "modelbest.pth"

for epoch in range(num_epochs):
    model.train()
    for texts, labels in train_loader:
        texts, labels = texts.float(), labels.long()
        outputs = model(texts)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # Validation
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for texts, labels in val_loader:
            texts, labels = texts.float(), labels.long()
            outputs = model(texts)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    val_acc = correct / total
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), best_model_path)

    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}, Val Acc: {val_acc:.4f}')

# Load the best model
model.load_state_dict(torch.load(best_model_path))

# Load the best model
model.load_state_dict(torch.load(best_model_path))

# Test the model
model.eval()
correct, total = 0, 0
with torch.no_grad():
    for texts, labels in test_loader:
        texts, labels = texts.float(), labels.long()
        outputs = model(texts)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
test_acc = correct / total
print(f'Test Acc: {test_acc:.3f}')


# Save the TF-IDF vectorizer
vectorizer_path = "tfidf_vectorizer.pkl"
joblib.dump(vectorizer, vectorizer_path)

# Save the PyTorch model
model_path = "text_classification_model.pth"
torch.save(model.state_dict(), model_path)

建议代码:

import torch
import joblib
import polars as pl
from sklearn.model_selection import train_test_split
from torch import nn
from torch.utils.data import Dataset, DataLoader

# Load the saved TF-IDF vectorizer
vectorizer_path = "tfidf_vectorizer.pkl"
vectorizer = joblib.load(vectorizer_path)

input_dim = len(vectorizer.get_feature_names_out())

class TextClassificationModel(nn.Module):
    def __init__(self, input_dim, num_classes):
        super(TextClassificationModel, self).__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.dropout1 = nn.Dropout(0.5)
        self.fc2 = nn.Linear(64, 32)
        self.dropout2 = nn.Dropout(0.5)
        self.fc3 = nn.Linear(32, num_classes)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout1(x)
        x = torch.relu(self.fc2(x))
        x = self.dropout2(x)
        x = torch.softmax(self.fc3(x), dim=1)
        return x
    
# Load the saved PyTorch model
model_path = "text_classification_model.pth"
model = TextClassificationModel(input_dim, 4)
model.load_state_dict(torch.load(model_path))

# Map labels to numeric values
label_mapping = {"anger": 0, "fear": 1, "joy": 2, "sadness": 3}
sentiments = ["fear","joy","sadness","anger"]

new_data = (
    pl
    .read_csv(
        "set2.txt",
        separator=";",
        has_header=False,
        new_columns=["text","label"]
    )
    .filter(pl.col("label").is_in(sentiments))
    .with_columns(
        pl.col("label").replace_strict(label_mapping, default="other").cast(pl.Int16)
    )
    
)
# Vectorize the new text data using the loaded TF-IDF vectorizer
X_new = vectorizer.transform(new_data['text']).toarray()
y_new = new_data['label']

class TextDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        return text, label

batch_size = 10
   
# Create DataLoader for the new training data
new_train_dataset = TextDataset(X_new, y_new)
new_train_loader = DataLoader(new_train_dataset, batch_size=batch_size, shuffle=True)

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adamax(model.parameters())

num_epochs = 5
new_best_model_path = "modelbest.pth"
for epoch in range(num_epochs):
    model.train()
    for texts, labels in new_train_loader:
        texts, labels = texts.float(), labels.long()
        outputs = model(texts)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        torch.save(model.state_dict(), new_best_model_path)
        
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

# Save the PyTorch model
new_best_model_path = "new_moedl.pth"
torch.save(model.state_dict(), new_best_model_path)

数据集可以在这里找到

  • 1 1 个回答
  • 39 Views

1 个回答

  • Voted
  1. Best Answer
    divyang4481
    2024-08-31T10:50:05+08:002024-08-31T10:50:05+08:00

    使用预先训练的词嵌入,如 BertForSequenceClassification。这些嵌入可以更优雅地处理未见标记,因为它们根据语义将单词映射到连续向量,从而减少了未见单词的影响。

    使用 BERT 进行模型训练

    import torch
    from torch import nn, optim
    from torch.utils.data import DataLoader, Dataset
    from transformers import BertTokenizer, BertModel, BertForSequenceClassification
    from transformers import Trainer, TrainingArguments
    from sklearn.model_selection import train_test_split
    import polars as pl
    
    # Load and prepare data
    set1 = pl.read_csv("set1.txt", separator=";", has_header=False, new_columns=["text", "label"])
    
    # Balance dataset
    fear_df = set1.filter(pl.col("label") == "fear")
    joy_df = set1.filter(pl.col("label") == "joy").sample(n=2500)
    sadness_df = set1.filter(pl.col("label") == "sadness").sample(n=2500)
    anger_df = set1.filter(pl.col("label") == "anger")
    train_df = pl.concat([fear_df, joy_df, sadness_df, anger_df])
    
    label_mapping = {"anger": 0, "fear": 1, "joy": 2, "sadness": 3}
    train_df = train_df.with_columns(pl.col("label").replace_strict(label_mapping, default="other").cast(pl.Int16))
    
    # Split dataset
    train_set, test_val_set = train_test_split(train_df, test_size=0.4, random_state=42, stratify=train_df["label"])
    test_set, val_set = train_test_split(test_val_set, test_size=0.5, random_state=42, stratify=test_val_set["label"])
    
    # Dataset class
    class TextDataset(Dataset):
        def __init__(self, texts, labels, tokenizer, max_length=128):
            self.texts = texts
            self.labels = labels
            self.tokenizer = tokenizer
            self.max_length = max_length
    
        def __len__(self):
            return len(self.texts)
    
        def __getitem__(self, idx):
            text = self.texts[idx]
            label = self.labels[idx]
            encoding = self.tokenizer.encode_plus(
                text,
                add_special_tokens=True,
                max_length=self.max_length,
                padding='max_length',
                truncation=True,
                return_tensors='pt'
            )
            return {
                'input_ids': encoding['input_ids'].flatten(),
                'attention_mask': encoding['attention_mask'].flatten(),
                'labels': torch.tensor(label, dtype=torch.long)
            }
    
    # Initialize tokenizer and datasets
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    train_dataset = TextDataset(train_set['text'], train_set['label'], tokenizer)
    val_dataset = TextDataset(val_set['text'], val_set['label'], tokenizer)
    test_dataset = TextDataset(test_set['text'], test_set['label'], tokenizer)
    
    # Initialize BERT model for classification
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=4)
    
    # Training arguments
    training_args = TrainingArguments(
        output_dir='./results',
        num_train_epochs=3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        evaluation_strategy='epoch',
        save_strategy='epoch',
        logging_dir='./logs',
        learning_rate=2e-5,
        load_best_model_at_end=True
    )
    
    # Define Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset
    )
    
    # Train model
    trainer.train()
    
    # Evaluate model
    results = trainer.evaluate(test_dataset)
    print(f"Test Accuracy: {results['eval_accuracy']:.4f}")
    
    # Save the model and tokenizer
    model.save_pretrained("saved_model")
    tokenizer.save_pretrained("saved_tokenizer")
    

    用最少的努力进行增量训练

    # Load the saved model and tokenizer
    model = BertForSequenceClassification.from_pretrained("saved_model")
    tokenizer = BertTokenizer.from_pretrained("saved_tokenizer")
    
    # Load new data
    new_data = (
        pl.read_csv("set2.txt", separator=";", has_header=False, new_columns=["text", "label"])
        .filter(pl.col("label").is_in(["fear", "joy", "sadness", "anger"]))
        .with_columns(pl.col("label").replace_strict(label_mapping, default="other").cast(pl.Int16))
    )
    
    # Create new dataset
    new_dataset = TextDataset(new_data['text'], new_data['label'], tokenizer)
    
    # Update training arguments for incremental training
    new_training_args = TrainingArguments(
        output_dir='./results_incremental',
        num_train_epochs=2,  # Fewer epochs since it's incremental
        per_device_train_batch_size=16,
        evaluation_strategy='epoch',
        logging_dir='./logs_incremental',
        learning_rate=2e-5,
        load_best_model_at_end=True
    )
    
    # Define new trainer
    new_trainer = Trainer(
        model=model,
        args=new_training_args,
        train_dataset=new_dataset,
        eval_dataset=val_dataset  # Validate on previous validation set
    )
    
    # Train on new data
    new_trainer.train()
    
    # Evaluate after retraining
    new_results = new_trainer.evaluate(test_dataset)
    print(f"Test Accuracy After Incremental Training: {new_results['eval_accuracy']:.4f}")
    
    # Save the updated model
    model.save_pretrained("saved_model_incremental")
    
    • 2

相关问题

  • 将复制活动的序列号添加到 Blob

  • Packer 动态源重复工件

  • 选择每组连续 1 的行

  • 图形 API 调用列表 subscribedSkus 状态权限不足,但已授予权限

  • 根据列值创建单独的 DF 的函数

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    Vue 3:创建时出错“预期标识符但发现‘导入’”[重复]

    • 1 个回答
  • Marko Smith

    为什么这个简单而小的 Java 代码在所有 Graal JVM 上的运行速度都快 30 倍,但在任何 Oracle JVM 上却不行?

    • 1 个回答
  • Marko Smith

    具有指定基础类型但没有枚举器的“枚举类”的用途是什么?

    • 1 个回答
  • Marko Smith

    如何修复未手动导入的模块的 MODULE_NOT_FOUND 错误?

    • 6 个回答
  • Marko Smith

    `(表达式,左值) = 右值` 在 C 或 C++ 中是有效的赋值吗?为什么有些编译器会接受/拒绝它?

    • 3 个回答
  • Marko Smith

    何时应使用 std::inplace_vector 而不是 std::vector?

    • 3 个回答
  • Marko Smith

    在 C++ 中,一个不执行任何操作的空程序需要 204KB 的堆,但在 C 中则不需要

    • 1 个回答
  • Marko Smith

    PowerBI 目前与 BigQuery 不兼容:Simba 驱动程序与 Windows 更新有关

    • 2 个回答
  • Marko Smith

    AdMob:MobileAds.initialize() - 对于某些设备,“java.lang.Integer 无法转换为 java.lang.String”

    • 1 个回答
  • Marko Smith

    我正在尝试仅使用海龟随机和数学模块来制作吃豆人游戏

    • 1 个回答
  • Martin Hope
    Aleksandr Dubinsky 为什么 InetAddress 上的 switch 模式匹配会失败,并出现“未涵盖所有可能的输入值”? 2024-12-23 06:56:21 +0800 CST
  • Martin Hope
    Phillip Borge 为什么这个简单而小的 Java 代码在所有 Graal JVM 上的运行速度都快 30 倍,但在任何 Oracle JVM 上却不行? 2024-12-12 20:46:46 +0800 CST
  • Martin Hope
    Oodini 具有指定基础类型但没有枚举器的“枚举类”的用途是什么? 2024-12-12 06:27:11 +0800 CST
  • Martin Hope
    sleeptightAnsiC `(表达式,左值) = 右值` 在 C 或 C++ 中是有效的赋值吗?为什么有些编译器会接受/拒绝它? 2024-11-09 07:18:53 +0800 CST
  • Martin Hope
    The Mad Gamer 何时应使用 std::inplace_vector 而不是 std::vector? 2024-10-29 23:01:00 +0800 CST
  • Martin Hope
    Chad Feller 在 5.2 版中,bash 条件语句中的 [[ .. ]] 中的分号现在是可选的吗? 2024-10-21 05:50:33 +0800 CST
  • Martin Hope
    Wrench 为什么双破折号 (--) 会导致此 MariaDB 子句评估为 true? 2024-05-05 13:37:20 +0800 CST
  • Martin Hope
    Waket Zheng 为什么 `dict(id=1, **{'id': 2})` 有时会引发 `KeyError: 'id'` 而不是 TypeError? 2024-05-04 14:19:19 +0800 CST
  • Martin Hope
    user924 AdMob:MobileAds.initialize() - 对于某些设备,“java.lang.Integer 无法转换为 java.lang.String” 2024-03-20 03:12:31 +0800 CST
  • Martin Hope
    MarkB 为什么 GCC 生成有条件执行 SIMD 实现的代码? 2024-02-17 06:17:14 +0800 CST

热门标签

python javascript c++ c# java typescript sql reactjs html

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve