1. 手工设计一个线性运算层

设输入为 $\bold{x}$ ，参数为 $\bold{w}$ 和 $\bold{b}$ ，运算如下：
$\bold{y=w\odot x+b}$
其中， $\odot$ 是矩阵的Hadmard积运算。

用 $f$ 来表示接下来的所有层的运算，有：
$model(\bold{x})=f(\bold{y})$
这里， $m o d e l$ 表示模型的全部运算。

在反向传播求导中，给出损失函数对于 $\bold{y}$ 的导数 $\frac{\partial{L}}{\partial{\bold{y}}}$ ，则求参数 $\bold{w}$ 和 $\bold{b}$ 对于损失函数的导数有：
$\frac{\partial{L}}{\partial{\bold{w}}}=\frac{\partial{L}}{\partial{\bold{y}}}\frac{\partial{\bold{y}}}{\partial{\bold{w}}}=\frac{\partial{L}}{\partial{\bold{y}}}\odot\bold{x}$ $\frac{\partial{L}}{\partial{\bold{b}}}=\frac{\partial{L}}{\partial{\bold{y}}}\frac{\partial{\bold{y}}}{\partial{\bold{b}}}=\frac{\partial{L}}{\partial{\bold{y}}}\odot\bold{1}$

至此，我们已经求出了这个线性层的前向和反向传播的公式。

2. 使用pytorch.autograd.Function编码实现

pytorch.autograd.Function是实现自动求导类的基类。为了实现自定义类，实现forward()和backward()两个静态方法。对于上文的运算，代码如下：

from torch.autograd import Function
class MultiplyAdd(Function):

    @staticmethod
    def forward(ctx, w, x, b):
        ctx.save_for_backward(x,)
        output = w*x +b
        return output

    @staticmethod
    def backward(ctx, grad_output):
        x, = ctx.saved_tensors
        grad_w = grad_output * x
        grad_b = grad_output * 1
        return grad_w, None, grad_b

在这里，ctx可以理解为一个连接前后运算的对象，用ctx.save_for_backward存储反向传播的使用用到的参数。grad_output是这一层的输出的导数，即 $\frac{\partial{L}}{\partial{\bold{y}}}$ 。

给出w,x,b

w = torch.tensor([[1.,2],[3,4]], requires_grad=True)
x = torch.rand(2, 2)
b = torch.tensor([[4.,3],[2,1]], requires_grad=True)
w,x,b

(tensor([[1., 2.],
         [3., 4.]], requires_grad=True),
 tensor([[0.0534, 0.8366],
         [0.9568, 0.1293]]),
 tensor([[4., 3.],
         [2., 1.]], requires_grad=True))

在创建参数的tensor的时候，参数requires_grad=True，使得该张量可以计算梯度（默认为False）。

ag_func = MultiplyAdd()
out = ag_func.apply(w, x, b)
out.backward(torch.ones(2,2), retain_graph=True)
w.grad, b.grad

在前向传播的时候，使用apply()方法而不是直接调用forward()方法，具体可以参见PyTorch文档。

(tensor([[0.0534, 0.8366],
         [0.9568, 0.1293]]),
 tensor([[1., 1.],
         [1., 1.]]))

使用 grad_fn可以得到当前张量计算的计算图。 grad_fn.next_functions存储了上一层的计算单元。这里存储了三个单元，但是由于没有求x的梯度，所以其对应的是None。可以看到存储的单元中存放的是w和b。

print(out.grad_fn)
print(out.grad_fn.next_functions)
print(out.grad_fn.next_functions[0][0].variable)
print(out.grad_fn.next_functions[2][0].variable)

<torch.autograd.function.MultiplyAddBackward object at 0x0000014F2AFB52E0>
((<AccumulateGrad object at 0x0000014F2AF82040>, 0), (None, 0), (<AccumulateGrad object at 0x0000014F2A53F250>, 0))
tensor([[1., 2.],
        [3., 4.]], requires_grad=True)
tensor([[4., 3.],
        [2., 1.]], requires_grad=True)

3. graphviz进行可视化

graphviz是一个常用的画图工具包，具体的安装可以参考网上的教程（记得添加环境变量）。

from graphviz import Digraph

node_attr = dict(style='filled',
                     shape='box',
                     align='left',
                     fontsize='12',
                     ranksep='0.1',
                     height='0.2')
dot = Digraph(node_attr=node_attr, graph_attr=dict(size="12,12"))
seen = set()

def size_to_str(size):
    return '(' + (', ').join(['%d' % v for v in size]) + ')'

def add_nodes(var):

    if var not in seen:
        if torch.is_tensor(var):
            # note: this used to show .saved_tensors in pytorch0.2, but stopped
            # working as it was moved to ATen and Variable-Tensor merged
            dot.node(str(id(var)), size_to_str(var.size()), fillcolor='yellow')
        elif hasattr(var, 'variable'):
            u = var.variable
            node_name = size_to_str(u.size())
            dot.node(str(id(var)), node_name, fillcolor='lightblue')
        else:
            dot.node(str(id(var)), str(type(var).__name__))

            seen.add(var)
        if hasattr(var, 'next_functions'):
            for u in var.next_functions:
                if u[0] is not None:
                    dot.edge(str(id(u[0])), str(id(var)))
                    add_nodes(u[0])
        if hasattr(var, 'saved_tensors'):
            for t in var.saved_tensors:
                dot.edge(str(id(t)), str(id(var)))
                add_nodes(t)
dot.node('Output', 'out\n'+size_to_str(out.size()))
dot.edge( str(id(out.grad_fn)),'Output')
add_nodes(out.grad_fn)
dot.render(('graph'), view=False)