by Chum Mapa
Introduction
I used to work for a manufacturer where one part of the business was temperature control. The controllers we made came in a wide variety, but they essentially served the same purpose: modulate the heating element in a process to maintain a temperature set point inputted by the user.
One popular and effective method of reaching and maintaining a temperature of a manufacturing process is PID control.
Proportional-integral-derivative (PID) controllers are the most commonly used feedback control mechanism for industrial applications due to their simplicity, functionality, and broad applicability. More than 90% of industrial controllers are implemented based on PID, including self-driving cars, robotics, heat treating processes, etc.
The controller in our basic example, adjusts it’s output to the heating element variably, based on the temperature of the system. The PID method of control allows you to quickly reach your set point (if tuned correctly) without the danger of overshooting, and can maintain that temperature within a fraction of a degree if needed. With PID, the control is based on the individual tuning of 3 values:
- Proportional band: increases or decreases the output proportionally to the process temperature’s deviation from the set point.
- Integral time: eliminates undershoot and overshoot of the set point by adjusting the proportioning control based on the amount of deviation from the set point during steady state operation.
- Derivative time: eliminates undershoot and overshoot by adjusting the proportioning control based on the rate of rise or fall of the process temperature.
In more general terms, the control should be proportional to the current error (the difference between system output and desired output), the integral of the past error over time, and the derivative of the error, which represents future trend.
These help the unit automatically compensate for changes in a control system.
While learning about neural networks in machine learning, I noticed some similarities of deep learning and manufacturing process control. Optimizing a neural network typically involves an algorithm that updates the weights by considering their past and current gradients (Gradient Descent).
Gradient descent is one of the most common optimization methods that are employed in machine learning. The basic concept is you choose a starting point, calculate the gradient of your loss function, and take a step in the opposite direction. Recalculate the gradient at your new location, take another step, and repeat until convergence. The gradient is essentially a multidimensional slope, and it points toward the direction of steepest ascent. How you calculate the gradient and what size of a step to take varies, and there is a wide range of gradient descent algorithms that approach this in different ways.
These gradient descent algorithms, while effective, can still suffer from a potential overshoot problem, which hinders the convergence of network training. Could applying a PID control method to network training be an effective way to reduce the possibility of overshooting? This method would not only use past and current gradients, but also the change of gradients to update the network parameters.
This was the concept behind a study done by the Shenzhen Institute of Future Media Technology, in conjunction with students from Stanford, Hong Kong Polytechnic, and Tsinghua University:
Why studies like these are important:
The training of deep networks on large-scale datasets is usually computationally expensive, taking several days or even weeks. It is increasingly important to investigate how to accelerate the training speed of deep models without sacrificing the accuracy, which can save time and memory cost. The optimizer of neural networks is the key component of training, as it defines how the thousands/millions/billions of deep model parameters are updated.
How PID is implemented:
Deep network optimization shares high similarity to PID based control. Both of them update the system/network based on the difference/loss between actual output and desired output. Also, the feedback in PID control corresponds to the back-propagation in network optimization.
The major difference is that the PID controller computes the update using sys-tem error, while deep network optimizers determines the updates based on gradient. Adding a derivative component (change of gradient) term to SGD-Momentum is how the PID quickly updates the weights in the network to minimize overshoot issues.
Conclusion:
The proposed PID method in the referenced study above reduced much the overshoot phenomena of Stochastic Gradient Descent, and achieved up to 50% acceleration on popular deep network architectures with competitive accuracy.