YOLO Training Pipeline

From LogicalDOC Community Wiki
Revision as of 07:41, 26 June 2026 by Giuseppe (talk | contribs)
Jump to navigationJump to search

YOLO Training Pipeline

This guide describes how to prepare a dataset and train a custom YOLO model using annotations created with Label Studio.


This guide describes an example workflow for training a custom YOLO model and preparing it for use with LogicalDOC.

Please be aware that this procedure is not coverded by the standard support contract. LogicalDOC cannot provide assistance with issues related to dataset preparation, training failures, model quality, GPU configuration, or third-party tools such as Label Studio, Ultralytics YOLO, or ONNX Runtime.

If you require professional assistance, please contact sales@logicaldoc.com to request a quotation for consulting services.



Licensing notice : Ultralytics YOLO is distributed under its own licensing terms. Commercial use requires a commercial license from Ultralytics. Before using YOLO in a commercial environment, review the licensing information available on the official Ultralytics website: https://www.ultralytics.com/



Prerequisites

Before starting, ensure that:

  • Label Studio has been installed and configured
  • A dataset has been annotated and exported
  • Python 3.10 or later is available
  • Ultralytics YOLO is installed


Expected Training Directory Structure

Before starting the training process, the training directory should be organized as follows:

training/
├── dataset/
│   ├── images/
│   │   ├── train/
│   │   └── val/
│   └── labels/
│       ├── train/
│       └── val/
├── data.yaml
├── synset.txt
├── notes.json
├── train.py
├── train.sh
└── convert_pt_to_onnx.py

Where:

  • dataset/images/train contains the images used for training.
  • dataset/images/val contains the images used for validation.
  • dataset/labels/train contains the annotation files corresponding to the training images.
  • dataset/labels/val contains the annotation files corresponding to the validation images.
  • data.yaml defines the dataset configuration used by Ultralytics YOLO.
  • synset.txt contains the list of class names used during annotation.
  • notes.json is generated by Label Studio and contains additional project metadata.
  • train.py defines the YOLO training configuration and launches the training process.
  • train.sh (optional) automates dataset preparation, training, and model conversion.
  • convert_pt_to_onnx.py converts the trained PyTorch model (`best.pt`) into the ONNX format required by LogicalDOC.


The following sections describe how to create this directory structure starting from the dataset exported by Label Studio.

Populate the Training Directory

Extract the Dataset

After annotating the documents in Label Studio (see Label Studio Guide), export the dataset and extract the downloaded archive to a local directory.

The extracted archive contains:

  • labels/ – YOLO annotation files (`.txt`)
  • classes.txt – class definitions
  • notes.json – Label Studio metadata
  • images/ – image files (may be empty depending on the selected export format)
Content of the dataset exported by Label Studio

Prepare the Dataset Structure

The training pipeline expects the following dataset structure:

dataset/
├── images/
│   ├── train/
│   └── val/
├── labels/
│   ├── train/
│   └── val/
├── synset.txt
├── notes.json
└── data.yaml

The following steps describe how to transform the exported dataset into the expected structure.

Populate the Dataset

  • Move all annotation files (.txt) from labels/ to labels/train/.
  • Copy the corresponding images into images/train/.
  • Rename classes.txt to synset.txt. Yolo explicitly requires a synset.txt file
  • Create the directories:
images/train/
images/val/

labels/train/
labels/val/

IMPORTANT

Depending on the selected export format, Label Studio may export only the annotation (.txt) files. In this case, the corresponding images must be copied manually from the original image directory into images/train/.

IMPORTANT

Verify that every image has a corresponding annotation file with the same filename.

Example:

images/train/invoice001.jpg
labels/train/invoice001.txt

If the automated training script (train.sh) is used, the images/val/ and labels/val/ directories should initially remain empty.

During execution, the script automatically moves the configured percentage of images and annotation files from the training set into the validation set.

Create the data.yaml File

YOLO uses a configuration file named `data.yaml` to locate the dataset and identify the available classes.

The file specifies:

  • Training image directory
  • Validation image directory
  • Number of classes
  • Class names

Example:

path: dataset

train: images/train
val: images/val

nc: 4     # number of classes

names:
0: invoice_number
1: date
2: seller_name
3: total

The class definitions must match those defined during annotation.

Create the Python Training Script

Create a Python script (train.py) that uses the Ultralytics YOLO framework to train the model.

Training always starts from a pretrained YOLO model, such as:

  • yolo11n.pt
  • yolo11s.pt
  • yolo11m.pt
  • yolo11l.pt

The pretrained model does not need to be downloaded manually. Ultralytics automatically downloads it when the training script is executed.

A minimal training script is shown below:

from ultralytics import YOLO

model = YOLO("yolo11n.pt")

model.train(
data="data.yaml",
epochs=100
)

A more advanced script may explicitly configure additional training parameters such as the optimizer, learning rate, batch size, image size, and device selection.

The parameters most commonly customized are:

  • epochs – number of training epochs
  • batch – number of images processed per batch
  • imgsz – input image size
  • device – CPU or GPU used for training
  • project – output directory
  • name – training run name

For a complete list of supported parameters, refer to the official Ultralytics documentation:

https://docs.ultralytics.com/modes/train/#train-settings

Start the Training

Start the Training

Once the dataset has been prepared and the required configuration files have been created, the training process can be started.

During training, Ultralytics YOLO:

  • Loads the dataset.
  • Trains the model.
  • Evaluates the model using the validation dataset.
  • Saves the generated model weights and training metrics.

The training can be started in one of the following ways.

Manual Training

If the validation dataset has already been prepared manually, execute the training script directly:

python train.py

Automated Training

To automate the dataset preparation, model training, and ONNX conversion, add the following scripts to the training directory:

  • train.sh
  • convert_pt_to_onnx.py

The train.sh script automates the following tasks:

  1. Removes images that do not have a corresponding annotation file.
  1. Moves the configured percentage of images and annotations from the training dataset to the validation dataset.
  1. Executes train.py to train the model.
  1. Executes convert_pt_to_onnx.py to convert the generated best.pt model into the ONNX format required by LogicalDOC.

An example implementation of convert_pt_to_onnx.py is available in the YOLO to ONNX Conversion page.

An example implementation of train.sh is shown below:

#!/bin/bash

IMAGES_TRAINING_DIR=dataset/images/train
IMAGES_VALIDATION_DIR=dataset/images/val
LABELS_TRAINING_DIR=dataset/labels/train
LABELS_VALIDATION_DIR=dataset/labels/val
VALIDATION_RATE=0.0

TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
LOG=logs/training-$TIMESTAMP.log

# Remove .npy files perhaps left by last elaboration
rm -rf dataset/images/train/*.npy
rm -rf dataset/images/val/*.npy

# ----------------------------------------------------------------------------- 
# Clean the dataset
# ----------------------------------------------------------------------------- 

echo "$TIMESTAMP: Looking for unlabeled images in: $IMAGES_TRAINING_DIR" >> $LOG
for file in $IMAGES_TRAINING_DIR/*.*; do
  BASE_FILENAME="$(basename "$file")"
  BASE_FILENAME="${BASE_FILENAME%.*}"        # strip the extension

  if [ ! -f "$LABELS_TRAINING_DIR/$BASE_FILENAME.txt" ]; then
    TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
    echo "$TIMESTAMP: Image $file non annotated, delete it" >> $LOG
    rm -rf $file
  fi
done

TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
echo "$TIMESTAMP: Looking for unlabeled images in: $IMAGES_VALIDATION_DIR" >> $LOG
for file in $IMAGES_VALIDATION_DIR/*.*; do
  BASE_FILENAME="$(basename "$file")"
  BASE_FILENAME="${BASE_FILENAME%.*}"        # strip the extension

  if [ ! -f "$LABELS_VALIDATION_DIR/$BASE_FILENAME.txt" ]; then
    TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
    echo "$TIMESTAMP: Image $file non annotated, delete it" >> $LOG
    rm -rf $file
  fi
done

# ----------------------------------------------------------------------------- 
# Prepare the validation dataset
# ----------------------------------------------------------------------------- 

echo "$TIMESTAMP: Take $VALIDATION_RATE percent of training images and move them to validation" >> $LOG
total=$(ls -1 $IMAGES_TRAINING_DIR | wc -l)
take=$(echo "$total * $VALIDATION_RATE" | bc | cut -d. -f1)
for file in $(ls -1 $IMAGES_TRAINING_DIR | shuf | head -n "$take"); do
  BASE_FILENAME="$(basename "$file")"
  BASE_FILENAME="${BASE_FILENAME%.*}"        # strip the extension

  TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
  echo "$TIMESTAMP: Relocating file: $file" >> $LOG
  mv $IMAGES_TRAINING_DIR/$file  $IMAGES_VALIDATION_DIR
  mv $LABELS_TRAINING_DIR/$BASE_FILENAME.txt $LABELS_VALIDATION_DIR
done

# ----------------------------------------------------------------------------- 
# Train the YOLO model
# ----------------------------------------------------------------------------- 

TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
echo "$TIMESTAMP: Launching the training" >> $LOG
python train.py >> $LOG  2>&1
TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
echo "$TIMESTAMP: Training completed" >> $LOG

# ----------------------------------------------------------------------------- 
# Convert the trained model to ONNX
# ----------------------------------------------------------------------------- 

TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
echo "$TIMESTAMP: Launching the conversion" >> $LOG
python convert_pt_to_onnx.py >> $LOG 2>&1
TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
echo "$TIMESTAMP: Conversion completed" >> $LOG

Training Output

At the end of the training process, Ultralytics creates a runs/ directory containing the training artifacts.

Typical outputs include:

  • Training logs
  • Validation metrics
  • Loss curves
  • Confusion matrix
  • Model weights

Among the generated files, the most important are the following model checkpoints:

best.pt
last.pt

These files serve different purposes.

The best.pt file contains the model that achieved the highest validation performance during training.

After each epoch, Ultralytics evaluates the model on the validation dataset and computes a fitness score based on metrics such as mAP. Whenever the model achieves a better validation score than any previous epoch, the best.pt file is updated.

The last.pt file always contains the model from the final training epoch, regardless of its validation performance.

It is primarily intended for resuming training, while best.pt is the recommended model for validation, conversion to ONNX, and deployment in LogicalDOC.