YOLO Training Pipeline

From LogicalDOC Community Wiki
Jump to navigationJump to search

YOLO Training Pipeline

This guide describes how to prepare a dataset and train a custom YOLO model using annotations created with Label Studio.

Prerequisites

Before starting, ensure that:

  • Label Studio has been installed and configured
  • A dataset has been annotated and exported
  • Python 3.10 or later is available
  • Ultralytics YOLO is installed


Expected Training Directory Structure

Before starting the training process, the training directory should be organized as follows:

training/
├── dataset/
│   ├── images/
│   │   ├── train/
│   │   └── val/
│   └── labels/
│       ├── train/
│       └── val/
├── data.yaml
├── synset.txt
├── notes.json
├── train.py
├── train.sh
└── convert_pt_to_onnx.py

Where:

  • dataset/images/train contains the images used for training.
  • dataset/images/val contains the images used for validation.
  • dataset/labels/train contains the annotation files corresponding to the training images.
  • dataset/labels/val contains the annotation files corresponding to the validation images.
  • data.yaml defines the dataset configuration used by Ultralytics YOLO.
  • synset.txt contains the list of class names used during annotation.
  • notes.json is generated by Label Studio and contains additional project metadata.
  • train.py defines the YOLO training configuration and launches the training process.
  • train.sh (optional) automates dataset preparation, training, and model conversion.
  • convert_pt_to_onnx.py converts the trained PyTorch model (`best.pt`) into the ONNX format required by LogicalDOC.


The following sections describe how to create this directory structure starting from the dataset exported by Label Studio.

Populate the Training Directory

Extract the Dataset

After annotating the documents in Label Studio (see Label Studio Guide), export the dataset and extract the downloaded archive to a local directory.

The extracted archive contains:

  • labels/ – YOLO annotation files (`.txt`)
  • classes.txt – class definitions
  • notes.json – Label Studio metadata
  • images/ – image files (may be empty depending on the selected export format)
Content of the dataset exported by Label Studio

Prepare the Dataset Structure

The training pipeline expects the following dataset structure:

dataset/
├── images/
│   ├── train/
│   └── val/
├── labels/
│   ├── train/
│   └── val/
├── synset.txt
├── notes.json
└── data.yaml

The following steps describe how to transform the exported dataset into the expected structure.

Populate the Dataset

  1. Move all annotation files (.txt) from labels/ to labels/train/.
  1. Copy the corresponding images into images/train/.
  1. Rename classes.txt to synset.txt.
  1. Create the empty directories:
images/val/
labels/val/

IMPORTANT

Depending on the selected export format, Label Studio may export only the annotation (.txt) files. In this case, the corresponding images must be copied manually from the original image directory into images/train/.

IMPORTANT

Verify that every image has a corresponding annotation file with the same filename.

Example:

images/train/invoice001.jpg
labels/train/invoice001.txt

If the automated training script (train.sh) is used, the images/val/ and labels/val/ directories should initially remain empty.

During execution, the script automatically moves the configured percentage of images and annotation files from the training set into the validation set.

Create the data.yaml File

YOLO uses a configuration file named `data.yaml` to locate the dataset and identify the available classes.

The file specifies:

  • Training image directory
  • Validation image directory
  • Number of classes
  • Class names

Example:

path: dataset

train: images/train
val: images/val

nc: 4

names:
0: invoice_number
1: date
2: seller_name
3: total

The class definitions must match those defined during annotation.

Create the Python Training Script

Create a Python script (train.py) that uses the Ultralytics YOLO framework to train the model.

Training always starts from a pretrained YOLO model, such as:

  • yolo11n.pt
  • yolo11s.pt
  • yolo11m.pt
  • yolo11l.pt

The pretrained model does not need to be downloaded manually. Ultralytics automatically downloads it when the training script is executed.

A minimal training script is shown below:

from ultralytics import YOLO

model = YOLO("yolo11n.pt")

model.train(
data="data.yaml",
epochs=100
)

A more advanced script may explicitly configure additional training parameters such as the optimizer, learning rate, batch size, image size, and device selection.

The parameters most commonly customized are:

  • epochs – number of training epochs
  • batch – number of images processed per batch
  • imgsz – input image size
  • device – CPU or GPU used for training
  • project – output directory
  • name – training run name

For a complete list of supported parameters, refer to the official Ultralytics documentation:

https://docs.ultralytics.com/modes/train/#train-settings

Start the Training

Start the Training

Once the dataset has been prepared and the required configuration files have been created, the training process can be started.

During training, Ultralytics YOLO:

  • Loads the dataset.
  • Trains the model.
  • Evaluates the model using the validation dataset.
  • Saves the generated model weights and training metrics.

The training can be started in one of the following ways.

Manual Training

If the validation dataset has already been prepared manually, execute the training script directly:

python train.py

Automated Training

Alternatively, the entire workflow can be automated using the provided train.sh script.

The script performs the following operations:

  1. Removes images without corresponding annotation files.
  1. Moves the configured percentage of images and annotations from train/ to val/.
  1. Executes train.py.
  1. Converts the generated best.pt model to the ONNX format by invoking convert_pt_to_onnx.py.

To use the automated workflow, the following files must be added in the training directory:

train.py
convert_pt_to_onnx.py

An example implementation of convert_pt_to_onnx.py is available in the YOLO to ONNX Conversion page.

An example implementation of train.sh is shown below:


#!/bin/bash

IMAGES_TRAINING_DIR=dataset/images/train
IMAGES_VALIDATION_DIR=dataset/images/val
LABELS_TRAINING_DIR=dataset/labels/train
LABELS_VALIDATION_DIR=dataset/labels/val
VALIDATION_RATE=0.0

TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
LOG=logs/training-$TIMESTAMP.log

# Remove .npy files perhaps left by last elaboration
rm -rf dataset/images/train/*.npy
rm -rf dataset/images/val/*.npy



echo "$TIMESTAMP: Looking for unlabeled images in: $IMAGES_TRAINING_DIR" >> $LOG
for file in $IMAGES_TRAINING_DIR/*.*; do
  BASE_FILENAME="$(basename "$file")"
  BASE_FILENAME="${BASE_FILENAME%.*}"        # strip the extension

  if [ ! -f "$LABELS_TRAINING_DIR/$BASE_FILENAME.txt" ]; then
    TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
    echo "$TIMESTAMP: Image $file non annotated, delete it" >> $LOG
    rm -rf $file
  fi
done

TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
echo "$TIMESTAMP: Looking for unlabeled images in: $IMAGES_VALIDATION_DIR" >> $LOG
for file in $IMAGES_VALIDATION_DIR/*.*; do
  BASE_FILENAME="$(basename "$file")"
  BASE_FILENAME="${BASE_FILENAME%.*}"        # strip the extension

  if [ ! -f "$LABELS_VALIDATION_DIR/$BASE_FILENAME.txt" ]; then
    TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
    echo "$TIMESTAMP: Image $file non annotated, delete it" >> $LOG
    rm -rf $file
  fi
done

echo "$TIMESTAMP: Take $VALIDATION_RATE percent of training images and move them to validation" >> $LOG
total=$(ls -1 $IMAGES_TRAINING_DIR | wc -l)
take=$(echo "$total * $VALIDATION_RATE" | bc | cut -d. -f1)
for file in $(ls -1 $IMAGES_TRAINING_DIR | shuf | head -n "$take"); do
  BASE_FILENAME="$(basename "$file")"
  BASE_FILENAME="${BASE_FILENAME%.*}"        # strip the extension

  TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
  echo "$TIMESTAMP: Relocating file: $file" >> $LOG
  mv $IMAGES_TRAINING_DIR/$file  $IMAGES_VALIDATION_DIR
  mv $LABELS_TRAINING_DIR/$BASE_FILENAME.txt $LABELS_VALIDATION_DIR
done

TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
echo "$TIMESTAMP: Launching the training" >> $LOG
python train.py >> $LOG  2>&1
TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
echo "$TIMESTAMP: Training completed" >> $LOG


TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
echo "$TIMESTAMP: Launching the conversion" >> $LOG
python convert_pt_to_onnx.py >> $LOG 2>&1
TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
echo "$TIMESTAMP: Conversion completed" >> $LOG

Training Output

At the end of the training process, Ultralytics creates a runs/ directory containing the training artifacts.

Typical outputs include:

  • Training logs
  • Validation metrics
  • Loss curves
  • Confusion matrix
  • Model weights

But most importantly, the framework typically saves two checkpoint files:

best.pt
last.pt

They serve different purposes.


best.pt is the model you will usually want to use.

During training, after each epoch, the model is evaluated on the validation dataset. Ultralytics monitors a performance metric (by default, a fitness score derived from metrics such as mAP).

Whenever the model achieves a better validation score than any previous epoch, it overwrites best.pt.

Instead, last.pt always contains the model after the final training epoch, regardless of whether it achieved the best validation results.