Diffusion models have recently shown great promise for generative modeling, out- performing GANs on perceptual quality and autoregressive models at density es- timation. A remaining downside is their slow sampling time: generating high quality samples takes many hundreds or thousands of model evaluations. Here we make two contributions to help eliminate this downside: First, we present new parameterizations of diffusion models that provide increased stability when using few sampling steps. Second, we present a method to distill a trained deterministic diffusion sampler, using many steps, into a new diffusion model that takes half as many sampling steps. We then keep progressively applying this distillation proce- dure to our model, halving the number of required sampling steps each time. On standard image generation benchmarks like CIFAR-10, ImageNet, and LSUN, we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality; achieving, for example, a FID of 3.0 on CIFAR-10 in 4 steps. Finally, we show that the full progressive distillation procedure does not take more time than it takes to train the original model, thus representing an efficient solution for generative modeling using diffusion at both train and test time.
扩散模型最近在生成模型方面表现出了很大的优点,在感知质量和密度估计方面比 GAN 表现更好,而缺点是采样时间较长:生成高质量的样本需要数百或数千个模型评估。在这里,我们提出了两个贡献来帮助消除这个缺点:首先,我们提出了新的扩散模型参数化方式,在使用少量采样步骤时可以提供更高的稳定性。其次,我们提出了一种方法,可以将训练好的确定性扩散采样器通过许多步骤转换为新的扩散模型,只需要一半左右的采样步骤。然后我们逐步应用这个蒸馏过程到我们的模型中,每次将所需的采样步骤减半。在标准图像生成基准测试如 CIFAR-10、ImageNet 和 LSUN 中,我们开始使用高达 8192 个步骤的最新采样器,并能够将其蒸馏到只需要 4 个步骤的模型中,而不会丢失太多感知质量。例如,在 CIFAR-10 中,我们在 4 个步骤内实现了 FID=3.0。最后,我们展示了完整的逐步蒸馏过程并不比训练原始模型花费更多时间,因此它是一种在训练和测试时同时使用扩散模型的有效解决方案。
主要是两个创新点,
-
1.提出了一种可以简化步骤的方法
-
2.提出了一种知识蒸馏的方法,可以把更高的迭代次数优化为更低的迭代次数。
Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) are an emerg- ing class of generative models that has recently delivered impressive results on many standard gen- erative modeling benchmarks. These models have achieved ImageNet generation results outper- forming BigGAN-deep and VQ-VAE-2 in terms of FID score and classification accuracy score (Ho et al., 2021; Dhariwal & Nichol, 2021), and they have achieved likelihoods outperforming autore- gressive image models (Kingma et al., 2021; Song et al., 2021b). They have also succeeded in image super-resolution (Saharia et al., 2021; Li et al., 2021) and image inpainting (Song et al., 2021c), and there have been promising results in shape generation (Cai et al., 2020), graph generation (Niu et al., 2020), and text generation (Hoogeboom et al., 2021; Austin et al., 2021).
扩散模型 (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) 是一种新兴的生成模型类群,最近在多个标准生成模型基准测试中取得了令人印象深刻的结果。这些模型在 FID 得分和分类准确率得分方面实现了 ImageNet 生成结果,远优于 BigGAN-deep 和 VQ-VAE-2(Ho et al., 2021; Dhariwal & Nichol, 2021),并且在生成概率方面比自回归图像模型表现更好 (Kingma et al., 2021; Song et al., 2021b)。它们还成功地实现了图像超分辨率 (Saharia et al., 2021; Li et al., 2021) 和图像修复 (Song et al., 2021c),并且在形状生成 (Cai et al., 2020)、图生成 (Niu et al., 2020) 和文本生成 (Hoogeboom et al., 2021; Austin et al., 2021) 等领域取得了有前途的结果。
A major barrier remains to practical adoption of diffusion models: sampling speed. While sam- pling can be accomplished in relatively few steps in strongly conditioned settings, such as text-to- speech (Chen et al., 2021) and image super-resolution (Saharia et al., 2021), or when guiding the sampler using an auxiliary classifier (Dhariwal & Nichol, 2021), the situation is substantially differ- ent in settings in which there is less conditioning information available. Examples of such settings are unconditional and standard class-conditional image generation, which currently require hundreds or thousands of steps using network evaluations that are not amenable to the caching optimizations of other types of generative models (Ramachandran et al., 2017).
然而,扩散模型的实际应用仍然存在一个主要障碍:采样速度。虽然在强条件设置下,如文本到语音 (Chen et al., 2021) 和图像超分辨率 (Saharia et al., 2021),采样可以在相对较短的时间内完成,或者通过辅助分类器指导采样 (Dhariwal & Nichol, 2021),但在条件较少的设置下,情况则完全不同。例如,无条件和标准条件的图像生成目前需要使用网络评估,这些评估不受其他类型生成模型的缓存优化 (Ramachandran et al., 2017)。
In this paper, we reduce the sampling time of diffusion models by orders of magnitude in uncondi- tional and class-conditional image generation, which represent the setting in which diffusion models have been slowest in previous work. We present a procedure to distill the behavior of a N -step DDIM sampler (Song et al., 2021a) for a pretrained diffusion model into a new model with N/2 steps, with little degradation in sample quality. In what we call progressive distillation, we repeat this distilla- tion procedure to produce models that generate in as few as 4 steps, still maintaining sample quality competitive with state-of-the-art models using thousands of steps.
在本文中,我们显著提高了扩散模型在无条件和标准条件图像生成方面的采样速度,这是扩散模型在以前的工作中最慢的设置之一。我们提出了一种方法,将一个预训练的扩散模型的 N 步 DDIM 采样器的行为蒸馏到一个新的模型中,只需要 N/2 步,而样本质量几乎没有下降。我们称之为逐步蒸馏,我们将这种方法重复应用到我们的模型中,以产生只需 4 步即可生成图像的模型,并保持与使用数千步的最先进的模型相当的样本质量。
Figure 1: A visualization of two iterations of our proposed progressive distillation algorithm. A sampler f(z;η), mapping random noise ε to samples x in 4 deterministic steps, is distilled into a new sampler f(z;θ) taking only a single step. The original sampler is derived by approximately integrating the probability flow ODE for a learned diffusion model, and distillation can thus be understood as learning to integrate in fewer steps, or amortizing this integration into the new sampler.
Figure 1: 一种可视化了我们提出的渐进式蒸馏算法的两个迭代。一个采样器 f(z;η),将随机噪声ε映射到 4 个确定的采样 x,被蒸馏成一个仅需要一个步骤的新采样器 f(z;θ)。原始的采样器是通过近似积分学习扩散模型的概率流常微分方程 (ODE) 得出的,因此蒸馏可以被理解为在更少的步骤中学习积分,或者将积分分摊到新采样器中。
To make diffusion models more efficient at sampling time, we propose progressive distillation: an algorithm that iteratively halves the number of required sampling steps by distilling a slow teacher diffusion model into a faster student model. Our implementation of progressive distillation stays very close to the implementation for training the original diffusion model, as described by e.g. Ho et al. (2020). Algorithm 1 and Algorithm 2 present diffusion model training and progressive distillation side-by-side, with the relative changes in progressive distillation highlighted in green.
为了在采样时提高扩散模型的效率,我们提出了渐进式蒸馏算法:一种迭代地减半所需采样步骤的算法,通过将慢速教师扩散模型蒸馏成更快的学生模型来实现。我们实现的渐进式蒸馏算法与训练原始扩散模型的实现非常相似,例如 Ho 等 (2020) 所述。算法 1 和算法 2 同时展示了扩散模型训练和渐进式蒸馏,其中渐进式蒸馏的相对变化被绿色突出显示。
We start the progressive distillation procedure with a teacher diffusion model that is obtained by training in the standard way. At every iteration of progressive distillation, we then initialize the student model with a copy of the teacher, using both the same parameters and same model definition. Like in standard training, we then sample data from the training set and add noise to it, before forming the training loss by applying the student denoising model to this noisy data zt. The main difference in progressive distillation is in how we set the target for the denoising model: instead of the original data x, we have the student model denoise towards a target x ̃ that makes a single student DDIM step match 2 teacher DDIM steps. We calculate this target value by running 2 DDIM sampling steps using the teacher, starting from zt and ending at zt−1/N , with N being the number of student sampling steps. By inverting a single step of DDIM, we then calculate the value the student model would need to predict in order to move from zt to zt−1/N in a single step, as we show in detail in Appendix G. The resulting target value x ̃(zt) is fully determined given the teacher model and starting point zt, which allows the student model to make a sharp prediction when evaluated at zt. In contrast, the original data point x is not fully determined given zt, since multiple different data points x can produce the same noisy data zt: this means that the original denoising model is predicting a weighted average of possible x values, which produces a blurry prediction. By making sharper predictions, the student model can make faster progress during sampling.
我们开始渐进式蒸馏过程,使用通过标准训练得到的教师扩散模型开始。在每次迭代中,然后我们初始化学生模型,使用与教师相同的参数和模型定义,从训练集中采样数据并添加噪声,然后在使用学生去噪模型对这个噪声数据 zt 应用时形成训练损失。渐进式蒸馏的主要区别在于我们如何设置去噪模型的目标:而不是原始数据 x,我们让学生模型去噪以目标 x ̃使单个学生 DDIM 步等于 2 个教师 DDIM 步。我们计算这个目标值,通过使用教师运行 2 个 DDIM 采样步骤,从 zt 开始并结束于 zt-1/N,其中 N 是学生采样步骤的数量。通过逆推一个 DDIM 步,我们然后计算学生模型需要在一步中从 zt 移动到 zt-1/N 所需的预测值,具体细节在附录 G 中展示。结果的目标值 x ̃(zt) 在教师模型和起始点 zt 给定的情况下是完全确定的,这允许学生模型在评估 zt 时做出清晰的预测。相比之下,给定 zt 的原始数据点 x 没有完全确定,因为多个不同的数据点 x 可以产生相同的噪声数据 zt:这意味着原始去噪模型预测加权平均可能 x 值,这产生模糊的预测。通过做出更清晰的预测,学生模型在采样时能够更快地进展。
Unlike our procedure for training the original model, we always run progressive distillation in dis- crete time: we sample this discrete time such that the highest time index corresponds to a signal-to- noise ratio of zero, i.e. α1 = 0, which exactly matches the distribution of input noise z1 ∼ N (0, I) that is used at test time. We found this to work slightly better than starting from a non-zero signal- to-noise ratio as used by e.g. Ho et al. (2020), both for training the original model as well as when performing progressive distillation.
不同于我们训练原始模型的流程,我们总是在离散时间中进行渐进式蒸馏:我们采样离散时间,使得最高时间索引对应信号噪声比为零,即α1=0,这完全匹配测试时使用的输入噪声 z1∼N(0,I) 的分布。我们发现,这种方法相对于例如 Ho 等 (2020) 使用非零信号噪声比开始训练和进行渐进式蒸馏时,在训练原始模型和进行渐进式蒸馏时表现更好。
扩散模型相关论文阅读,扩散模型和知识蒸馏的结合提升预测速度:Progressive Distillation for Fast Sampling of Diffusion Models
主要是两个创新点,1.提出了一种可以简化步骤的方法2.提出了一种知识蒸馏的方法,可以把更高的迭代次数优化为更低的迭代次数。1.扩散模型虽然取得了很好的效果,但是预测速度慢。2.作者提出了一种逐步蒸馏的方式,如下图:为了在采样时提高扩散模型的效率,我们提出了渐进式蒸馏算法:一种迭代地减半所需采样步骤的算法,通过将慢速教师扩散模型蒸馏成更快的学生模型来实现。我们实现的渐进式蒸馏算法与训练原始扩散模型的实现非常相似,例如 Ho 等 (2020) 所述。
导读:2016年,Google提出了PWA,志在增强Web体验。可显著提高加载
速度
、可离线工作、可被添加至主屏、全屏执行、推送通知消息……等等这些特性可使Web应用渐进式地变成App,甚至与APP相匹敌。这一系列特性背后有哪些核心关键技术支撑,本文将为您一一分析,解开它的神秘面纱。近年来,Web应用在整个软件与互联网行业承载的责任越来越重,软件复杂度和维护成本越来越高,Web技术,尤其是Web客户端技术,迎来了爆发式的发展。包括但不限于基于Node.js的前端工程化方案;诸如Webpack、Rollup这样的打包工具;Babel、PostCSS这样的转译工具;TypeScript、Elm这样转
本文首先介绍生成式模型,然后着重梳理生成式模型(Generat
ive
Models
)中生成对抗网络(Generat
ive
Adversarial Network)的研究与发展。作者按照GAN主干
论文
、GAN应用性
论文
、GAN
相关
论文
分类整理了45篇近两年的
论文
,着重梳理了主干
论文
之间的联系与区别,揭示生成式对抗网络的研究脉络。
本文涉及的
论文
有:
2020 NeurIPS AdderNet基于核的渐进式的蒸馏加法神经网络一.简介二.问题解决2.1问题提出2.2初步解决方案2.3具体分析2.4问题解决——用核方法来解决这个问题2.5渐进式学习三.实验结果分析3.1基于MNIST数据集将分类结果可视化3.2不同实验设置对比3.3作者在CIFAR-10,CIFAR-100,ImageNet上分别进行了实验四.总结
加法器神经网络(ANN)提供了一种低能耗深层神经网络的新方法。但是,当用加法替换所有卷积时,精度会下降。作者认为这主要是由于使用L1范