Skip to content

Integrate with Hugging Face Transformers

Hugging Face Transformers provide general-purpose Machine Learning models for Natural Language Understanding (NLP). Transformers give you easy access to pre-trained model weights, and interoperability between PyTorch and TensorFlow.

Instrument Transformers with Comet to start managing experiments, create dataset versions and track hyperparameters for faster and easier reproducibility and collaboration.

Open In Colab

Comet SDKMinimum SDK versionMinimum transformers version
Python-SDK3.31.54.20.0

Start logging

Connect Comet to your existing Transformers Trainer code by configuring it through environment variables.

Add the following lines of code to your script or notebook:

import comet_ml
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# 1. Enable logging of model checkpoints
os.environ["COMET_LOG_ASSETS"] = "True"

# 2. Define your model
model = AutoModelForSequenceClassification.from_pretrained(
   ...
)

# 3. Train your model
trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=train_dataset,
  eval_dataset=test_dataset,
  compute_metrics=compute_metrics,
)

trainer.train()

Log automatically

By integrating with Transformers Trainer object, Comet automatically logs the following items, with no additional configuration:

  • Metrics (such as loss and accuracy)
  • Hyperparameters
  • Assets (such as checkpoints and log files)

End-to-end example

Get started with a basic example of using Comet with the Transformers Trainer.

You can check out the results of this example Transformers experiment for a preview of what's to come.

Install dependencies

python -m pip install "comet_ml>=3.44.0" datasets torch transformers scikit-learn accelerate

Run the example

import os

import comet_ml

# Enable logging of model checkpoints
os.environ["COMET_LOG_ASSETS"] = "True"

comet_ml.login(project_name="comet-example-transformers-trainer")

from datasets import load_dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
)

PRE_TRAINED_MODEL_NAME = "distilbert/distilroberta-base"

raw_datasets = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(
    PRE_TRAINED_MODEL_NAME, num_labels=2
)


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


def get_example(index):
    return eval_dataset[index]["text"]


def compute_metrics(pred):
    experiment = comet_ml.get_global_experiment()

    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average="macro"
    )
    acc = accuracy_score(labels, preds)

    if experiment:
        epoch = int(experiment.curr_epoch) if experiment.curr_epoch is not None else 0
        experiment.set_epoch(epoch)
        experiment.log_confusion_matrix(
            y_true=labels,
            y_predicted=preds,
            file_name=f"confusion-matrix-epoch-{epoch}.json",
            labels=["negative", "postive"],
            index_to_example_function=get_example,
        )

    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(200))
eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(200))


training_args = TrainingArguments(
    seed=42,
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=1,
    do_train=True,
    do_eval=True,
    evaluation_strategy="steps",
    eval_steps=25,
    save_strategy="steps",
    save_total_limit=10,
    save_steps=25,
    per_device_train_batch_size=8,
    report_to=["comet_ml"],
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)
trainer.train()

Try it out!

Here's an example for using Comet with Transformers.

Open In Colab

Configure Comet for Transformers

You can control which Transformers items are logged automatically, by setting the following environment variables:

export COMET_MODE=ONLINE # Set to OFFLINE to run an Offline Experiment or DISABLE to turn off logging
export COMET_LOG_ASSET=True # Set to False to disable logging model checkpoints
export COMET_PROJECT_NAME=<your project name> # Configure your project name
export COMET_OFFLINE_DIRECTORY=<path to offline directory> # Folder to use for saving offline experiments when `COMET_MODE` is "OFFLINE"

For more information about using environment parameters in Comet, see Configure Comet.

Jul. 9, 2024