This paper was accepted on the workshop “I Can’t Imagine It’s Not Higher: Understanding Deep Studying Via Empirical Falsification”
Steady pseudo-labeling (PL) algorithms similar to slimIPL have just lately emerged as a strong technique for semi-supervised studying in speech recognition. In distinction with earlier methods that alternated between coaching a mannequin and producing pseudo-labels (PLs) with it, right here PLs are generated in end-to-end method as coaching proceeds, bettering coaching velocity and the accuracy of the ultimate mannequin. PL shares a typical theme with teacher-student fashions similar to distillation in {that a} instructor mannequin generates targets that should be mimicked by the coed mannequin being skilled. Nevertheless, apparently, PL methods generally use hard-labels, whereas distillation makes use of the distribution over labels because the goal to imitate. Impressed by distillation we anticipate that specifying the entire distribution (aka soft-labels) over sequences because the goal for unlabeled information, as an alternative of a single greatest go pseudo-labeled transcript (hard-labels) ought to enhance PL efficiency and convergence. Surprisingly and unexpectedly, we discover that soft-labels targets can result in coaching divergence, with the mannequin collapsing to a degenerate token distribution per body. We hypothesize that the rationale this doesn’t occur with hard-labels is that coaching loss on hard-labels imposes sequence-level consistency that retains the mannequin from collapsing to the degenerate resolution. On this paper, we present a number of experiments that assist this speculation, and experiment with a number of regularization approaches that may ameliorate the degenerate collapse when utilizing soft-labels. These approaches can deliver the accuracy of soft-labels nearer to that of hard-labels, and whereas they’re unable to outperform them but, they function a helpful framework for additional enhancements.