°Ë»ö¾î ÀÔ·Â

FIELD | Computational Sciences |
---|---|

DATE | January 25 (Tue), 2022 |

TIME | 10:30-13:00 |

PLACE | 7323 |

SPEAKER | ÀÌÇö±Ô |

HOST | Lee, Hyun gyu |

INSTITUTE | KIAS |

TITLE | [GS_C_DS] Group seminar / Lee, Hyun Gyu |

ABSTRACT | Stochastic gradient descent (SGD) is a widely popular optimization method in machine learning community, and its variants are still extensively used in deep neural network applications. Its ability to find minima in the cost function space is closely connected to the size of fluctuations that arise due to random choice of mini-batches (instead of full-batch which uses the entire training dataset). However, theoretical understanding of the nature of the fluctuations in SGD is still on-going and has many rooms to explore in hopes of achieving better performance of the deep learning models. In this seminar, we discuss two (although seemingly distant) topics which involve SGD: 1) explores the striking similarity between the equation for the learning of a deep (linear) neural network, and Langevin equations for particles under dissipative force and uncorrelated external random force. Under an assumption that the noise arising from mini-batch in the neural network is also temporally uncorrelated, the authors derive fluctuation-dissipation theorem (FDT)-like identity for which they test on different popular datasets. 2) discusses a phenomenon in which test set accuracy falls when large batch-size is used as opposed to small batches. Large batches are preferable to small batches in terms of performance because the former takes advantage of parallel computation. Here we explore the origin and properties of this ¡°gap¡± between test set accuracies of small and large batches, and the ¡°noise scale¡± that controls the gap is indeed a function of batch size and learning rate. Accordingly, one can achieve the effect of annealing (i.e., decaying of learning rate) in deep learning by increasing the size of batch. References Han el al., Fluctuation-dissipation-type theorem in stochastic linear learning (PRE, 2021). Keskar et al., On large-batch training for deep learning: Generalization gap and sharp minima (ICLR, 2017). Hoffer et al,. Train longer, generalize better: closing the generalization gap in large batch training of neural networks (NIPS, 2017). Smith et al., Don¡¯t decay the learning rate, increase the batch size (ICLR, 2018). Smith & Le, A Bayesian perspective on generalization and stochastic gradient descent (ICLR, 2018). |

FILE | 152661643083881673_1.pdf |