YOLO Training Pipeline
YOLO Training Pipeline
This guide describes how to prepare a dataset and train a custom YOLO model using annotations created with Label Studio.
Prerequisites
Before starting, ensure that:
- Label Studio has been installed and configured
- A dataset has been annotated and exported
- Python 3.10 or later is available
- Ultralytics YOLO is installed
Expected Training Directory Structure
Before starting the training process, the training directory should be organized as follows:
training/ ├── dataset/ │ ├── images/ │ │ ├── train/ │ │ └── val/ │ └── labels/ │ ├── train/ │ └── val/ ├── data.yaml ├── synset.txt ├── notes.json ├── train.py ├── train.sh └── convert_pt_to_onnx.py
Where:
- dataset/images/train contains the images used for training.
- dataset/images/val contains the images used for validation.
- dataset/labels/train contains the annotation files corresponding to the training images.
- dataset/labels/val contains the annotation files corresponding to the validation images.
- data.yaml defines the dataset configuration used by Ultralytics YOLO.
- synset.txt contains the list of class names used during annotation.
- notes.json is generated by Label Studio and contains additional project metadata.
- train.py defines the YOLO training configuration and launches the training process.
- train.sh (optional) automates dataset preparation, training, and model conversion.
- convert_pt_to_onnx.py converts the trained PyTorch model (`best.pt`) into the ONNX format required by LogicalDOC.
The following sections describe how to create this directory structure starting from the dataset exported by Label Studio.
Populate the Training Directory
Extract the Dataset
After annotating the documents in Label Studio (see Label Studio Guide), export the dataset and extract the downloaded archive to a local directory.
The extracted archive contains:
labels/– YOLO annotation files (`.txt`)classes.txt– class definitionsnotes.json– Label Studio metadataimages/– image files (may be empty depending on the selected export format)

Prepare the Dataset Structure
The training pipeline expects the following dataset structure:
dataset/ ├── images/ │ ├── train/ │ └── val/ ├── labels/ │ ├── train/ │ └── val/ ├── synset.txt ├── notes.json └── data.yaml
The following steps describe how to transform the exported dataset into the expected structure.
Populate the Dataset
- Move all annotation files (
.txt) fromlabels/tolabels/train/.
- Copy the corresponding images into
images/train/.
- Rename
classes.txttosynset.txt.
- Create the empty directories:
images/val/ labels/val/
IMPORTANT
Depending on the selected export format, Label Studio may export only the annotation (.txt) files. In this case, the corresponding images must be copied manually from the original image directory into images/train/.
IMPORTANT
Verify that every image has a corresponding annotation file with the same filename.
Example:
images/train/invoice001.jpg labels/train/invoice001.txt
If the automated training script (train.sh) is used, the images/val/ and labels/val/ directories should initially remain empty.
During execution, the script automatically moves the configured percentage of images and annotation files from the training set into the validation set.
Create the data.yaml File
YOLO uses a configuration file named `data.yaml` to locate the dataset and identify the available classes.
The file specifies:
- Training image directory
- Validation image directory
- Number of classes
- Class names
Example:
path: dataset
train: images/train
val: images/val
nc: 4
names:
0: invoice_number
1: date
2: seller_name
3: total
The class definitions must match those defined during annotation.
Create the Python Training Script
Create a Python script (train.py) that uses the Ultralytics YOLO framework to train the model.
Training always starts from a pretrained YOLO model, such as:
yolo11n.ptyolo11s.ptyolo11m.ptyolo11l.pt
The pretrained model does not need to be downloaded manually. Ultralytics automatically downloads it when the training script is executed.
A minimal training script is shown below:
from ultralytics import YOLO
model = YOLO("yolo11n.pt")
model.train(
data="data.yaml",
epochs=100
)
A more advanced script may explicitly configure additional training parameters such as the optimizer, learning rate, batch size, image size, and device selection.
The parameters most commonly customized are:
epochs– number of training epochsbatch– number of images processed per batchimgsz– input image sizedevice– CPU or GPU used for trainingproject– output directoryname– training run name
For a complete list of supported parameters, refer to the official Ultralytics documentation:
https://docs.ultralytics.com/modes/train/#train-settings
Start the Training
Start the Training
Once the dataset has been prepared and the required configuration files have been created, the training process can be started.
During training, Ultralytics YOLO:
- Loads the dataset.
- Trains the model.
- Evaluates the model using the validation dataset.
- Saves the generated model weights and training metrics.
The training can be started in one of the following ways.
Manual Training
If the validation dataset has already been prepared manually, execute the training script directly:
python train.py
Automated Training
Alternatively, the entire workflow can be automated using the provided train.sh script.
The script performs the following operations:
- Removes images without corresponding annotation files.
- Moves the configured percentage of images and annotations from
train/toval/.
- Executes
train.py.
- Converts the generated
best.ptmodel to the ONNX format by invokingconvert_pt_to_onnx.py.
To use the automated workflow, the following files must be present in the training directory:
training/ ├── dataset/ │ ├── images/ │ ├── labels/ │ ├── data.yaml │ ├── synset.txt │ └── notes.json ├── train.py ├── train.sh └── convert_pt_to_onnx.py
An example implementation of convert_pt_to_onnx.py is available in the YOLO to ONNX Conversion page.
An example implementation of train.sh is shown below:
#!/bin/bash
IMAGES_TRAINING_DIR=dataset/images/train
IMAGES_VALIDATION_DIR=dataset/images/val
LABELS_TRAINING_DIR=dataset/labels/train
LABELS_VALIDATION_DIR=dataset/labels/val
VALIDATION_RATE=0.0
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
LOG=logs/training-$TIMESTAMP.log
# Remove .npy files perhaps left by last elaboration
rm -rf dataset/images/train/*.npy
rm -rf dataset/images/val/*.npy
echo "$TIMESTAMP: Looking for unlabeled images in: $IMAGES_TRAINING_DIR" >> $LOG
for file in $IMAGES_TRAINING_DIR/*.*; do
BASE_FILENAME="$(basename "$file")"
BASE_FILENAME="${BASE_FILENAME%.*}" # strip the extension
if [ ! -f "$LABELS_TRAINING_DIR/$BASE_FILENAME.txt" ]; then
TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
echo "$TIMESTAMP: Image $file non annotated, delete it" >> $LOG
rm -rf $file
fi
done
TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
echo "$TIMESTAMP: Looking for unlabeled images in: $IMAGES_VALIDATION_DIR" >> $LOG
for file in $IMAGES_VALIDATION_DIR/*.*; do
BASE_FILENAME="$(basename "$file")"
BASE_FILENAME="${BASE_FILENAME%.*}" # strip the extension
if [ ! -f "$LABELS_VALIDATION_DIR/$BASE_FILENAME.txt" ]; then
TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
echo "$TIMESTAMP: Image $file non annotated, delete it" >> $LOG
rm -rf $file
fi
done
echo "$TIMESTAMP: Take $VALIDATION_RATE percent of training images and move them to validation" >> $LOG
total=$(ls -1 $IMAGES_TRAINING_DIR | wc -l)
take=$(echo "$total * $VALIDATION_RATE" | bc | cut -d. -f1)
for file in $(ls -1 $IMAGES_TRAINING_DIR | shuf | head -n "$take"); do
BASE_FILENAME="$(basename "$file")"
BASE_FILENAME="${BASE_FILENAME%.*}" # strip the extension
TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
echo "$TIMESTAMP: Relocating file: $file" >> $LOG
mv $IMAGES_TRAINING_DIR/$file $IMAGES_VALIDATION_DIR
mv $LABELS_TRAINING_DIR/$BASE_FILENAME.txt $LABELS_VALIDATION_DIR
done
TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
echo "$TIMESTAMP: Launching the training" >> $LOG
python train.py >> $LOG 2>&1
TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
echo "$TIMESTAMP: Training completed" >> $LOG
TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
echo "$TIMESTAMP: Launching the conversion" >> $LOG
python convert_pt_to_onnx.py >> $LOG 2>&1
TIMESTAMP=$(date +"%Y%m%d_%H%M%S.%3N")
echo "$TIMESTAMP: Conversion completed" >> $LOG
Training Output
At the end of the training process, Ultralytics creates a runs/ directory containing the training artifacts.
Typical outputs include:
- Training logs
- Validation metrics
- Loss curves
- Confusion matrix
- Model weights
But most importantly, the framework typically saves two checkpoint files:
best.pt last.pt
They serve different purposes.
best.pt is the model you will usually want to use.
During training, after each epoch, the model is evaluated on the validation dataset. Ultralytics monitors a performance metric (by default, a fitness score derived from metrics such as mAP).
Whenever the model achieves a better validation score than any previous epoch, it overwrites best.pt.
Instead, last.pt always contains the model after the final training epoch, regardless of whether it achieved the best validation results.