wordpress上传源代码,seo查询工具网站,软件工作室网站模板,专业网站建设详细方案定制SVI目标和培训循环
Pyro支持各种基于优化的贝叶斯推理方法#xff0c;包括Trace_ELBO作为SVI(随机变分推理)的基本实现。参见文件#xff08;documents的简写#xff09;有关各种SVI实现和SVI教程的更多信息I, 二#xff0c;以及罗马数字3了解SVI的背景。
在本教程中…定制SVI目标和培训循环¶
Pyro支持各种基于优化的贝叶斯推理方法包括Trace_ELBO作为SVI(随机变分推理)的基本实现。参见文件documents的简写有关各种SVI实现和SVI教程的更多信息I, 二以及罗马数字3了解SVI的背景。
在本教程中我们将展示高级用户如何修改和/或增加变分目标(或者:损失函数)以及由Pyro提供的训练步骤实现以支持特殊的用例。 基本SVI用法 较低层次的模式 示例:自定义正则化 示例:调整损失 例如:贝塔VAE 示例:混合优化器 示例:自定义ELBO 示例:KL退火
基本SVI用法¶
我们首先回顾一下SVI烟火中的物体。我们假设用户已经定义了一个model和一个guide。然后用户创建一个优化器和一个SVI对象:
optimizer pyro.optim.Adam({lr: 0.001, betas: (0.90, 0.999)})
svi pyro.infer.SVI(model, guide, optimizer, losspyro.infer.Trace_ELBO())然后可以通过调用svi.step(...)。对…的争论step()然后被传递给model和guide.
较低层次的模式¶
上述模式的好处在于它允许Pyro为我们处理各种细节例如: pyro.optim.Adam动态创建一个新的torch.optim.Adam每当遇到新参数时优化器 SVI.step()渐变步骤之间的零渐变
如果我们想要更多的控制我们可以直接操纵各种可微损失方法ELBO班级。例如这个优化循环:
svi pyro.infer.SVI(model, guide, optimizer, losspyro.infer.Trace_ELBO())
for i in range(n_iter):loss svi.step(X_train, y_train)相当于这个低级模式:
loss_fn lambda model, guide: pyro.infer.Trace_ELBO().differentiable_loss(model, guide, X_train, y_train)
with pyro.poutine.trace(param_onlyTrue) as param_capture:loss loss_fn(model, guide)
params set(site[value].unconstrained()for site in param_capture.trace.nodes.values())
optimizer torch.optim.Adam(params, lr0.001, betas(0.90, 0.999))
for i in range(n_iter):# compute lossloss loss_fn(model, guide)loss.backward()# take a step and zero the parameter gradientsoptimizer.step()optimizer.zero_grad()示例:自定义正则化¶
假设我们想在SVI损失中加入一个定制的正则项。使用上面的使用模式这很容易做到。首先我们定义正则项:
def my_custom_L2_regularizer(my_parameters):reg_loss 0.0for param in my_parameters:reg_loss reg_loss param.pow(2.0).sum()return reg_loss那么我们唯一需要做的改变就是:
- loss loss_fn(model, guide)loss loss_fn(model, guide) my_custom_L2_regularizer(my_parameters)示例:剪裁渐变¶
对于一些模型损耗梯度可能在训练期间爆炸导致溢出和NaN价值观。防止这种情况的一种方法是使用渐变剪辑。中的优化器pyro.optim拿一本可选的字典clip_args这允许将梯度范数或梯度值剪切到给定的界限内。
要更改上面的基本示例:
- optimizer pyro.optim.Adam({lr: 0.001, betas: (0.90, 0.999)})optimizer pyro.optim.Adam({lr: 0.001, betas: (0.90, 0.999)}, {clip_norm: 10.0})还可以通过修改上述低级模式来手动实现梯度裁剪的其他变体。
示例:调整损失¶
根据优化算法损失的规模可能重要也可能无关紧要。假设我们想在对损失函数进行微分之前根据数据点的数量对其进行缩放。这很容易做到:
- loss loss_fn(model, guide)loss loss_fn(model, guide) / N_data请注意在SVI的情况下损失函数中的每一项都是来自模型或指南的对数概率同样的效果可以通过使用poutine.scale。例如我们可以使用poutine.scale装饰器缩放模型和指南:
poutine.scale(scale1.0/N_data)
def model(...):passpoutine.scale(scale1.0/N_data)
def guide(...):pass例如:贝塔VAE¶
我们也可以使用poutine.scale以构建非标准的ELBO变分目标其中例如KL散度相对于期望的对数似然性被不同地缩放。特别是对于βVAEKL散度用一个因子来缩放beta:
def model(data, beta0.5):z_loc, z_scale ...with pyro.poutine.scale(scalebeta)z pyro.sample(z, dist.Normal(z_loc, z_scale))pyro.sample(obs, dist.Bernoulli(...), obsdata)def guide(data, beta0.5):with pyro.poutine.scale(scalebeta)z_loc, z_scale ...z pyro.sample(z, dist.Normal(z_loc, z_scale))有了这种模型的选择并引导对应于潜在变量的测井密度z来构建变分目标
svi pyro.infer.SVI(model, guide, optimizer, losspyro.infer.Trace_ELBO())将被缩放一倍beta导致KL发散其同样由beta.
示例:混合优化器¶
中的各种优化器pyro.optim允许用户在每个参数的基础上指定优化设置(如学习率)。但是如果我们要对不同的参数使用不同的优化算法呢我们可以使用Pyro的MultiOptimizer(见下文)但我们也可以实现同样的事情如果我们直接操纵differentiable_loss:
adam torch.optim.Adam(adam_parameters, {lr: 0.001, betas: (0.90, 0.999)})
sgd torch.optim.SGD(sgd_parameters, {lr: 0.0001})
loss_fn pyro.infer.Trace_ELBO().differentiable_loss
# compute loss
loss loss_fn(model, guide)
loss.backward()
# take a step and zero the parameter gradients
adam.step()
sgd.step()
adam.zero_grad()
sgd.zero_grad()为了完整起见我们还展示了如何使用多重优化器这使我们能够结合多个烟火优化。请注意由于MultiOptimizer使用torch.autograd.grad引擎盖下(而不是torch.Tensor.backward())它的界面略有不同特别是step()方法也将参数作为输入。
def model():pyro.param(a, ...)pyro.param(b, ...)...adam pyro.optim.Adam({lr: 0.1})
sgd pyro.optim.SGD({lr: 0.01})
optim MixedMultiOptimizer([([a], adam), ([b], sgd)])
with pyro.poutine.trace(param_onlyTrue) as param_capture:loss elbo.differentiable_loss(model, guide)
params {a: pyro.param(a), b: pyro.param(b)}
optim.step(loss, params)示例:自定义ELBO¶
在前三个例子中我们绕过了创建SVI对象提供的可微分损失函数ELBO实施。我们可以做的另一件事是创造定制ELBO实现并将它们传递到SVI机械。例如简化版本的Trace_ELBO损失函数可能如下所示:
# note that simple_elbo takes a model, a guide, and their respective arguments as inputs
def simple_elbo(model, guide, *args, **kwargs):# run the guide and trace its executionguide_trace poutine.trace(guide).get_trace(*args, **kwargs)# run the model and replay it against the samples from the guidemodel_trace poutine.trace(poutine.replay(model, traceguide_trace)).get_trace(*args, **kwargs)# construct the elbo loss functionreturn -1*(model_trace.log_prob_sum() - guide_trace.log_prob_sum())svi SVI(model, guide, optim, losssimple_elbo)请注意这基本上就是elbo实施于“迷你烟火”看起来像。
示例:KL退火¶ 在深度学习中KL退火是一种在训练过程中逐渐减少KL散度损失权重的技术这有助于模型更好地学习数据分布。这种方法在处理深度马尔可夫模型时尤其有用因为它允许模型在训练初期不过分关注潜在变量的细节而是更多地关注数据的整体结构。随着训练的进行逐渐增加KL散度的权重使得模型能够细化其对潜在变量的推断。 在实现KL退火时可以通过定义一个定制的损失函数来实现或者使用像poutine.scale这样的工具来动态调整损失函数中的KL散度项的权重。例如在Pyro这个概率编程库中可以使用poutine.scale装饰器来按比例缩小KL散度项从而实现退火效果。这种方法提供了一种灵活的方式来控制模型在训练过程中对潜在变量分布的推断强度。 在实际应用中KL退火可以帮助模型避免过早地陷入局部最优解并且有助于提高模型对数据的泛化能力。通过逐渐调整损失函数中各项的权重模型可以在训练的不同阶段关注不同的学习目标最终达到更好的学习效果。 此外KL退火也可以视为一种正则化技术它通过控制模型复杂度来防止过拟合。在训练初期较小的KL散度权重允许模型在不完全匹配潜在变量分布的情况下进行学习这有助于模型探索更广泛的参数空间。随着训练的进行逐渐增大的KL散度权重则迫使模型更加精确地学习数据分布从而提高模型的预测准确性。 总的来说KL退火是一种有效的技术可以在深度学习模型的训练中发挥作用特别是在处理复杂的数据分布和结构时。通过合理地设计退火策略可以显著提高模型的性能和泛化能力。 爬山法是完完全全的贪心法每次都鼠目寸光的选择一个当前最优解因此只能搜索到局部的最优值。模拟退火其实也是一种贪心算法但是它的搜索过程引入了随机因素。模拟退火算法以一定的概率来接受一个比当前解要差的解因此有可能会跳出这个局部的最优解达到全局的最优解。以图1为例模拟退火算法在搜索到局部最优解A后会以一定的概率接受到E的移动。也许经过几次这样的不是局部最优的移动后会到达D点于是就跳出了局部最大值A。 模拟退火算法描述 若J( Y(i1) ) J( Y(i) ) (即移动后得到更优解)则总是接受该移动 若J( Y(i1) ) J( Y(i) ) (即移动后的解比当前解要差)则以一定的概率接受移动而且这个概率随着时间推移逐渐降低逐渐降低才能趋向稳定 这里的“一定的概率”的计算参考了金属冶炼的退火过程这也是模拟退火算法名称的由来。 根据热力学的原理在温度为T时出现能量差为dE的降温的概率为P(dE)表示为 P(dE) exp( dE/(kT) ) 其中k是一个常数exp表示自然指数且dE0。这条公式说白了就是温度越高出现一次能量差为dE的降温的概率就越大温度越低则出现降温的概率就越小。又由于dE总是小于0否则就不叫退火了因此dE/kT 0 所以P(dE)的函数取值范围是(0,1) 。 随着温度T的降低P(dE)会逐渐降低。 我们将一次向较差解的移动看做一次温度跳变过程我们以概率P(dE)来接受这样的移动。 关于爬山算法与模拟退火有一个有趣的比喻 爬山算法兔子朝着比现在高的地方跳去。它找到了不远处的最高山峰。但是这座山不一定是珠穆朗玛峰。这就是爬山算法它不能保证局部最优值就是全局最优值。 模拟退火兔子喝醉了。它随机地跳了很长时间。这期间它可能走向高处也可能踏入平地。但是它渐渐清醒了并朝最高方向跳去。这就是模拟退火 (http://www.cnblogs.com/heaad/ 转载请注明) 在……里深度马尔可夫模型教程ELBO变分目标在训练期间被修改。特别地潜在随机变量之间的各种KL-散度项相对于观察数据的对数概率按比例缩小(即退火)。在本教程中这是通过使用poutine.scale。我们可以通过定义一个定制的损失函数来完成同样的事情。后一种选择并不是一种非常优雅的模式但是我们还是包含了它以显示我们所拥有的灵活性。
def simple_elbo_kl_annealing(model, guide, *args, **kwargs):# get the annealing factor and latents to anneal from the keyword# arguments passed to the model and guideannealing_factor kwargs.pop(annealing_factor, 1.0)latents_to_anneal kwargs.pop(latents_to_anneal, [])# run the guide and replay the model against the guideguide_trace poutine.trace(guide).get_trace(*args, **kwargs)model_trace poutine.trace(poutine.replay(model, traceguide_trace)).get_trace(*args, **kwargs)elbo 0.0# loop through all the sample sites in the model and guide trace and# construct the loss; note that we scale all the log probabilities of# samples sites in latents_to_anneal by the factor annealing_factorfor site in model_trace.values():if site[type] sample:factor annealing_factor if site[name] in latents_to_anneal else 1.0elbo elbo factor * site[fn].log_prob(site[value]).sum()for site in guide_trace.values():if site[type] sample:factor annealing_factor if site[name] in latents_to_anneal else 1.0elbo elbo - factor * site[fn].log_prob(site[value]).sum()return -elbosvi SVI(model, guide, optim, losssimple_elbo_kl_annealing)
svi.step(other_args, annealing_factor0.2, latents_to_anneal[my_latent])以前的然后 Customizing SVI objectives and training loops¶
Pyro provides support for various optimization-based approaches to Bayesian inference, with Trace_ELBO serving as the basic implementation of SVI (stochastic variational inference). See the docs for more information on the various SVI implementations and SVI tutorials I, II, and III for background on SVI.
In this tutorial we show how advanced users can modify and/or augment the variational objectives (alternatively: loss functions) and the training step implementation provided by Pyro to support special use cases. Basic SVI Usage A Lower Level Pattern Example: Custom Regularizer Example: Scaling the Loss Example: Beta VAE Example: Mixing Optimizers Example: Custom ELBO Example: KL Annealing
Basic SVI Usage¶
We first review the basic usage pattern of SVI objects in Pyro. We assume that the user has defined a model and a guide. The user then creates an optimizer and an SVI object:
optimizer pyro.optim.Adam({lr: 0.001, betas: (0.90, 0.999)})
svi pyro.infer.SVI(model, guide, optimizer, losspyro.infer.Trace_ELBO())Gradient steps can then be taken with a call to svi.step(...). The arguments to step() are then passed to model and guide.
A Lower-Level Pattern¶
The nice thing about the above pattern is that it allows Pyro to take care of various details for us, for example: pyro.optim.Adam dynamically creates a new torch.optim.Adam optimizer whenever a new parameter is encountered SVI.step() zeros gradients between gradient steps
If we want more control, we can directly manipulate the differentiable loss method of the various ELBO classes. For example, this optimization loop:
svi pyro.infer.SVI(model, guide, optimizer, losspyro.infer.Trace_ELBO())
for i in range(n_iter):loss svi.step(X_train, y_train)is equivalent to this low-level pattern:
loss_fn lambda model, guide: pyro.infer.Trace_ELBO().differentiable_loss(model, guide, X_train, y_train)
with pyro.poutine.trace(param_onlyTrue) as param_capture:loss loss_fn(model, guide)
params set(site[value].unconstrained()for site in param_capture.trace.nodes.values())
optimizer torch.optim.Adam(params, lr0.001, betas(0.90, 0.999))
for i in range(n_iter):# compute lossloss loss_fn(model, guide)loss.backward()# take a step and zero the parameter gradientsoptimizer.step()optimizer.zero_grad()Example: Custom Regularizer¶
Suppose we want to add a custom regularization term to the SVI loss. Using the above usage pattern, this is easy to do. First we define our regularizer:
def my_custom_L2_regularizer(my_parameters):reg_loss 0.0for param in my_parameters:reg_loss reg_loss param.pow(2.0).sum()return reg_lossThen the only change we need to make is:
- loss loss_fn(model, guide)loss loss_fn(model, guide) my_custom_L2_regularizer(my_parameters)Example: Clipping Gradients¶
For some models the loss gradient can explode during training, leading to overflow and NaN values. One way to protect against this is with gradient clipping. The optimizers in pyro.optim take an optional dictionary of clip_args which allows clipping either the gradient norm or the gradient value to fall within the given limit.
To change the basic example above:
- optimizer pyro.optim.Adam({lr: 0.001, betas: (0.90, 0.999)})optimizer pyro.optim.Adam({lr: 0.001, betas: (0.90, 0.999)}, {clip_norm: 10.0})Further variants of gradient clipping can also be implemented manually by modifying the low-level pattern described above.
Example: Scaling the Loss¶
Depending on the optimization algorithm, the scale of the loss may or not matter. Suppose we want to scale our loss function by the number of datapoints before we differentiate it. This is easily done:
- loss loss_fn(model, guide)loss loss_fn(model, guide) / N_dataNote that in the case of SVI, where each term in the loss function is a log probability from the model or guide, this same effect can be achieved using poutine.scale. For example we can use the poutine.scale decorator to scale both the model and guide:
poutine.scale(scale1.0/N_data)
def model(...):passpoutine.scale(scale1.0/N_data)
def guide(...):passExample: Beta VAE¶
We can also use poutine.scale to construct non-standard ELBO variational objectives in which, for example, the KL divergence is scaled differently relative to the expected log likelihood. In particular for the Beta VAE the KL divergence is scaled by a factor beta:
def model(data, beta0.5):z_loc, z_scale ...with pyro.poutine.scale(scalebeta)z pyro.sample(z, dist.Normal(z_loc, z_scale))pyro.sample(obs, dist.Bernoulli(...), obsdata)def guide(data, beta0.5):with pyro.poutine.scale(scalebeta)z_loc, z_scale ...z pyro.sample(z, dist.Normal(z_loc, z_scale))With this choice of model and guide the log densities corresponding to the latent variable z that enter into constructing the variational objective via
svi pyro.infer.SVI(model, guide, optimizer, losspyro.infer.Trace_ELBO())will be scaled by a factor of beta, resulting in a KL divergence that is likewise scaled by beta.
Example: Mixing Optimizers¶
The various optimizers in pyro.optim allow the user to specify optimization settings (e.g. learning rates) on a per-parameter basis. But what if we want to use different optimization algorithms for different parameters? We can do this using Pyro’s MultiOptimizer (see below), but we can also achieve the same thing if we directly manipulate differentiable_loss:
adam torch.optim.Adam(adam_parameters, {lr: 0.001, betas: (0.90, 0.999)})
sgd torch.optim.SGD(sgd_parameters, {lr: 0.0001})
loss_fn pyro.infer.Trace_ELBO().differentiable_loss
# compute loss
loss loss_fn(model, guide)
loss.backward()
# take a step and zero the parameter gradients
adam.step()
sgd.step()
adam.zero_grad()
sgd.zero_grad()For completeness, we also show how we can do the same thing using MultiOptimizer, which allows us to combine multiple Pyro optimizers. Note that since MultiOptimizer uses torch.autograd.grad under the hood (instead of torch.Tensor.backward()), it has a slightly different interface; in particular the step() method also takes parameters as inputs.
def model():pyro.param(a, ...)pyro.param(b, ...)...adam pyro.optim.Adam({lr: 0.1})
sgd pyro.optim.SGD({lr: 0.01})
optim MixedMultiOptimizer([([a], adam), ([b], sgd)])
with pyro.poutine.trace(param_onlyTrue) as param_capture:loss elbo.differentiable_loss(model, guide)
params {a: pyro.param(a), b: pyro.param(b)}
optim.step(loss, params)Example: Custom ELBO¶
In the previous three examples we bypassed creating a SVI object and directly manipulated the differentiable loss function provided by an ELBO implementation. Another thing we can do is create custom ELBO implementations and pass those into the SVI machinery. For example, a simplified version of a Trace_ELBO loss function might look as follows:
# note that simple_elbo takes a model, a guide, and their respective arguments as inputs
def simple_elbo(model, guide, *args, **kwargs):# run the guide and trace its executionguide_trace poutine.trace(guide).get_trace(*args, **kwargs)# run the model and replay it against the samples from the guidemodel_trace poutine.trace(poutine.replay(model, traceguide_trace)).get_trace(*args, **kwargs)# construct the elbo loss functionreturn -1*(model_trace.log_prob_sum() - guide_trace.log_prob_sum())svi SVI(model, guide, optim, losssimple_elbo)Note that this is basically what the elbo implementation in “mini-pyro” looks like.
Example: KL Annealing¶
In the Deep Markov Model Tutorial the ELBO variational objective is modified during training. In particular the various KL-divergence terms between latent random variables are scaled downward (i.e. annealed) relative to the log probabilities of the observed data. In the tutorial this is accomplished using poutine.scale. We can accomplish the same thing by defining a custom loss function. This latter option is not a very elegant pattern but we include it anyway to show the flexibility we have at our disposal.
def simple_elbo_kl_annealing(model, guide, *args, **kwargs):# get the annealing factor and latents to anneal from the keyword# arguments passed to the model and guideannealing_factor kwargs.pop(annealing_factor, 1.0)latents_to_anneal kwargs.pop(latents_to_anneal, [])# run the guide and replay the model against the guideguide_trace poutine.trace(guide).get_trace(*args, **kwargs)model_trace poutine.trace(poutine.replay(model, traceguide_trace)).get_trace(*args, **kwargs)elbo 0.0# loop through all the sample sites in the model and guide trace and# construct the loss; note that we scale all the log probabilities of# samples sites in latents_to_anneal by the factor annealing_factorfor site in model_trace.values():if site[type] sample:factor annealing_factor if site[name] in latents_to_anneal else 1.0elbo elbo factor * site[fn].log_prob(site[value]).sum()for site in guide_trace.values():if site[type] sample:factor annealing_factor if site[name] in latents_to_anneal else 1.0elbo elbo - factor * site[fn].log_prob(site[value]).sum()return -elbosvi SVI(model, guide, optim, losssimple_elbo_kl_annealing)
svi.step(other_args, annealing_factor0.2, latents_to_anneal[my_latent])