This week I mainly want to analyze and understand the training script I used AI to write last week. Since I already have some Python foundation, at least reading and understanding this script file shouldn’t be too difficult for me (hopefully). And I think this step is crucial for what comes next (after all, the technical core challenge of this project is training the model I want). PyTorch isn’t as hard to use as I imagined, because in practice it’s used as a Python package—though installing CUDA and all the other setup steps beforehand was still a lot of work.

Assignment

The whole training script can be divided into five levels, as shown in the figure below. Next, I will introduce my understanding and learning outcomes for each step one by one.

DATA

What the DATA step does is actually very simple. The data I downloaded from my web drawing program includes JSON files and PNG screenshots. This step converts these two training files into a format the model can recognize, and pairs them up.

Pairing

input → step_0000.png（current canvas state） tag → step_0001.json the firsy point+color

transforms.Resize((128, 128))                                
transforms.ToTensor()                                          
transforms.Normalize(mean=[0.5], std=[0.5])

"#ff4444"  →  hex_to_rgb_normalized()  →  (1.0, 0.267, 0.267)

The image is converted to grayscale (.convert('L')), compressing RGB from 3 channels down to 1 channel. The reason is that color information is already in the label; for the canvas, grayscale is enough to represent shape information, and it also simplifies the model input. transform should be a function provided by the PyTorch package itself. Using transform, the image is resized to 128×128. Then the image is converted to tensor format, mapping pixel values into a 0.0–1.0 range that the model can read.

Why normalize to 0–1?

Because the model’s output layer uses Sigmoid, which naturally can only output numbers between 0 and 1. When the labels and output share the same range, MSELoss can compute the error correctly.

Loading

DataLoader(dataset, batch_size=4, shuffle=True)

The Loading layer has only one role: DataLoader. It solves this problem: the Dataset can already provide data one sample at a time, but feeding the model one by one is inefficient. DataLoader is responsible for packaging the data into batches and sending them out. Why do we need batches? This is related to GPU memory size. The larger the VRAM, the more data can be processed at once, so training can be faster. At the same time, if gradients are updated too frequently, the direction noise can be large. A batch is essentially a balance between update frequency and direction accuracy. My GPU has 16GB of VRAM, so in theory I could increase the batch size. But I still don’t fully understand this part.