Style Transfer--AI在繪畫領域上的應用

神經網路 · 發表 2018-09-20 19:11:26

摘要： Part 1: paper Figure 1 Style Transfer是AI將不同風格和內容結合在一起從而創造出新藝術作品的技術。如Figure 1所示，將相機拍攝下的街景照片分別與梵高的《星空》、蒙克的《尖叫》以及透納的《牛頭人...

Part 1:

ofollow,noindex"> paper

Figure 1

Style Transfer是AI將不同風格和內容結合在一起從而創造出新藝術作品的技術。如Figure 1所示，將相機拍攝下的街景照片分別與梵高的《星空》、蒙克的《尖叫》以及透納的《牛頭人的沉船》結合在一起，創造出對應風格的油畫作品。

以梵高的《星空》為例，圖c的內容和原始影象相近，除此之外天空中還呈現了《星空》中的月亮和星星，而繪畫筆法上也繼承了梵高的粗筆觸，畫面的整體色調和《星空》一致。可以看出，演算法對提供繪畫素材的街景圖片和提供繪畫風格素材的《星空》兩者的處理方式是不同的，對前者著重保留畫面內容，對後者則是要剔除掉其內容而保留繪畫風格。

該模型出自 A Neural Algorithm of Artistic Style ，是最早的關於Artistic Style Transfer的paper，也被認為是現在最有效的演算法，本文就是要講解如何實現該演算法。

Figure 2

Figure 2是模型的核心思想，通過從輸入圖片中提取出內容表徵（Content Representations）和風格表徵（Style Resentations），並用提取出的content和style表徵來生成一幅內容和風格分別與原始圖片相近但又不完全相同的新圖片。

Figure 3

如Figure 3的草圖所示，模型的Input包括：

content image
style image
output image

Input通過loss function通過計算content/style和output的差異，將求導得到的梯度用於修正output image的pixels。從前文已經知道，模型對content和style的處理是不同的，所以需要兩個loss function: content loss和style loss。

content loss

從公式可以看出，content loss function就是MSE。表示content image，則表示output image，和則分別表示和在深度卷積網路第l層的第i個feature map第j個位置的activation值。

Figure 4

如Figure 4所示，CNN每層的activation值就是整體影象的一個區域性，output image就是將content image每層的區域性activation組合在一起的結果，所以content image和的內容才會有高相似度，但又不完全相同。如果你還需要進一步瞭解卷積神經網路，請閱讀 Visualizing and Understanding Convolutional Networks 。

style loss

公式中的符號表示：

, 原始的style image
, 生成的image
G & A, Gram matrix.
N _l , l _th 層的filters個數.
M _l , l _th 層的feature map size.
, 神經網路中l _th 層的i _th feature map向量.
k, 向量中的element.

從公式中可以看出，style loss取自原始image和生成的image在神經網路中的Gram matrix的MSE。

https://en.wikipedia.org/wiki/Gramian_matrix

上述是WikiPedia對Gram matrix的解釋，在這裡，我想用一種更直觀的方式告訴你，為什麼要在style representation中使用Gram matrix，其背後的原理是什麼。

從前文對Figure 1的分析已經知道，我們想要剔除style image中實際的內容，比如《星空》中的月亮、星星、房屋、樹木等，只提取style image中的筆觸、光影、色彩等繪畫手法，這就需要破壞feature map matrix中的空間資訊，將matrix轉換為vector。

Figure 5

Figure 5中的matrix就是神經網路l _th 層的5x5x16大小的activation matrix，C1、C2分別是前兩個channel的feature map轉換後的結果：兩個長度為25的向量。

flattening後的向量已經沒有了原來的空間資訊，只剩下樣式資訊，而C1和C2的點積會得到的一個值，這個值代表的是什麼呢？我們假設C1表示的是畫家作畫時的筆觸比較粗，C2表示的是畫家作畫時的筆觸比較短。

C1@C2 = G ₁₂ ，G ₁₂ 表示該畫作中粗筆觸和短筆觸的相關性（correlations）。兩個向量相關性越強，則它們點積值就越大（大正數或小負數），反之，弱相關性會讓兩向量的點積產生相互抵消的效果（小正數或大負數）。所以，G ₁₂ 的值越大，表明畫家越喜歡用短且粗的筆觸作畫（這是梵高的繪畫特點之一），反之表示畫家不會用短的粗筆觸作畫。

C1@C1 = G ₁₁ ，G ₁₁ 表示該畫作中筆觸的粗礦粒度以及“粗筆觸”這一特性的活躍程度，和C1@C2一樣，G ₁₁ 值越大表示粗粒程度越大、特性越明顯，這就好像是卷積神經網路中max pooling layer，grid cell值越大表明其是高頻特徵的程度越大。

(C1, C1)，(C1, C2), (C1, C3), ...... (C16, C16)所有組合的點積，就能得到Figure 6所示的圖形，它實際上是16x16的矩陣的內積，就是style loss要求的Gram matrix。

Figure 6

total loss

image.png

total loss就是content loss和style loss的和，通過調整和的比例來控制style transfer的比例，一般固定為1，只調節 , [0, 1]。

到這裡，已經分析完 A Neural Algorithm of Artistic Style 模型，接下來將進入part2，模型實現部分。

Part 2:

Notebook

Input images

在Part 2，我們將以白頭海鷹和梵高另一幅星空作品為素材，通過style transfer創造出由梵高“畫”的白頭海鷹油畫。

Dataset

可以看出，input images的shape是不一致，我們需要先調整他們的大小，並生成相同大小的output image。

img.shape, style_img.shape

((710, 1024, 3), (960, 1344, 3))

def scale_match(src, targ):
h,w,_ = src.shape
sh,sw,_ = targ.shape
rat = max(h/sh,w/sw); rat
res = cv2.resize(targ, (int(sw*rat), int(sh*rat)))
return res[:h,:w]
style = scale_match(img, style_img)

img.shape, style.shape
((710, 1024, 3), (710, 1024, 3))

一般來說，style image的解析度往往比content image要高，所以通常是根據content image的shape來調整style image的大小。

output_img = np.random.uniform(0, 1, size=img.shape).astype(np.float32)
output_img = scipy.ndimage.filters.median_filter(output_img , [8,8,1])
plt.imshow(output_img);

output image

output_img是我們要生成的目標影象，模型訓練過程就是利用梯度不斷修正out_img和其他input images相似度的過程。之所以要對nosie image做median filter，是因為真實的影象都是平滑的，而非np.random.uniform()建立的嚴格均勻分佈的隨機數，否則它就不像是影象，而只是一堆隨機數，在實際訓練中很難計算出梯度。median filter起到了median pooling，讓影象平滑化的作用。

trn_tfms,val_tfms = tfms_from_model(vgg16, sz)
img_tfm = val_tfms(img)
img_tfm.shape

(3, 710, 710)

output_img = val_tfms(output_img)/2
output_img_v= V(output_img[None], requires_grad=True)
output_img_v.shape

torch.Size([1, 3, 710, 710])

作為神經網路的dataset，需要將input images的shape從rank 3轉換為rank 4 [batch_size, num_channel, height, width]，這裡通過None生成batch為1的維度，同時還要將height、width設定為相同長度。val_tfms是不做data augumentation的transform，其原因會在講解model部分時作說明。

Model

Artistic Style 基於vgg神經網路模型，和其他專案不同，style transfer不需要訓練神經網路中的權值，而是通過梯度來修正output image的畫素。和paper一樣，我使用pretrained vgg16，並disable更新權值的功能以減少多餘的計算和記憶體消耗。

m_vgg = to_gpu(vgg16(True)).eval()
set_trainable(m_vgg, False)

從content和style的loss function公式可以看出，和其他CNN不同，我們需要匯出每一層的activation值，對於pytorch，可以用forward hook來實現的。

Forward Hook

Pytorch的nn.Module有一個callable方法: forward，從名字上你就可以知道，它是神經網路做前向傳播的方法，例如：

class Xnet(nn.Module):
def __init__(self, nin, nf):
......

def forward(self, x):
......

xnet = Xnet()
xnet(dataset)

Xnet繼承於nn.Module，xnet(dataset)會呼叫Xnet.forward方法來進行前向傳播計算，如果Xnet註冊了forward hook方法，它會在Xnet.forward結束後觸發。

Style transfer中，我們需要獲取的是feature map grid size改變前的activation值，即通過給maxpooling或stride convolution層（stride == 2）的上一層註冊forward hook。

class SaveFeatures():
features=None
def __init__(self, m): self.hook = m.register_forward_hook(self.hook_fn)
def hook_fn(self, module, input, output): self.features = output
def close(self): self.hook.remove()

block_ends = [i-1 for i,o in enumerate(children(m_vgg))
if isinstance(o,nn.MaxPool2d)]
block_ends

[5, 12, 22, 32, 42]

SaveFeatures用於註冊forward hook，block_ends中存放著feature map grid size發生改變之前的層號。

Training

def get_opt():
output_img = np.random.uniform(0, 1, size=img.shape).astype(np.float32)
output_img = scipy.ndimage.filters.median_filter(output_img, [8,8,1])
output_img_v = V(val_tfms(output_img/2)[None], requires_grad=True)
return output_img_v, optim.LBFGS([output_img_v])

def step(loss_fn):
global n_iter
optimizer.zero_grad()
loss = loss_fn(output_img_v)
loss.backward()
n_iter+=1
if n_iter%show_iter==0: print(f'Iteration: n_iter, loss: {loss.data[0]}')
return loss

n_iter=0
max_iter = 1000
show_iter = 100
output_img_v, optimizer = get_opt()
while n_iter <= max_iter: optimizer.step(partial(step,actn_loss))

我們知道，神經網路通過優化器（optimizer），在迴圈迭代的過程中，利用loss function計算出梯度，找到神經網路引數的調整方向，通過對引數的調整以達到降低loss值。換句話說，降低loss值的過程就是擬合神經網路模型的過程。

回顧Figure 3，style transfer的訓練過程就是通過優化content/style loss來調整output image的畫素，讓它和content image和style image相互match的過程。

我們在這個專案使用的優化器是 L-BFGS ，它被

CVPR_2016_paper.pdf" target="_blank" rel="nofollow,noindex">Image Style Transfer Using Convolutional Neural Networks 認為是在影象合成中表現最優的優化器。

L-BFGS 中的"BFGS"是四位演算法發明者（Broyden–Fletcher–Goldfarb–Shanno）名字的簡寫，"L"則代表limited memory。和SGD、Adam不同的是，LBFGS在深度神經網路中的表現往往都很糟糕。之所以它在神經網路中的表現不好，在於它除了會根據loss值計算梯度，還會計算梯度的梯度（Hessian ），結果不僅需要更多的計算量，還需要使用大量的記憶體來track梯度變數，這也是它不如SGD和Adam使用如此之廣的原因。

如果說loss值計算梯度，是為調整引數找到方向，那麼計算Hessian則是計算梯度變化的速度，是為得到引數調整的步長。雖然LBFGS相比SGD的momentum可以更精確地調整引數，但其計算量也相比SGD要更大，對於有著成百上千萬個引數的深度神經網路，LBFGS顯然不是好的選擇。但對於不需要調整網路引數的style transfer來說，LBFGS就能發揮出它的優勢，這也是為什麼會在這個專案中使用這個冷門優化器的原因。

Content Restruct

block_ends[3]

32

sf = SaveFeatures(children(m_vgg)[block_ends[3]])

def content_loss(x):
m_vgg(x)
out = V(sf.features)
return F.mse_loss(out, targ_v)*1000

output_img_v, optimizer = get_opt()

m_vgg(VV(img_tfm[None]))
targ_v = V(sf.features.clone())

n_iter=0
while n_iter <= max_iter: optimizer.step(partial(step, content_loss))

Iteration: n_iter, loss: 0.14002405107021332
Iteration: n_iter, loss: 0.05928822606801987
Iteration: n_iter, loss: 0.037577468901872635
Iteration: n_iter, loss: 0.027887802571058273
Iteration: n_iter, loss: 0.02253057062625885
Iteration: n_iter, loss: 0.01918598636984825
Iteration: n_iter, loss: 0.016832195222377777
Iteration: n_iter, loss: 0.015042142942547798
Iteration: n_iter, loss: 0.013666849583387375
Iteration: n_iter, loss: 0.01256621815264225

回顧Part 1中的content loss，我們在這裡選擇content image和output image的第32層的activation做MSE。之所以選擇block_ends[3]（32）而不是block_ends[2]或block_ends[4]，是由最終結果決定的。之所以要對mse_loss的結果乘以1000，是因為原loss值非常小，通過對其做scale處理可以有利於模型訓練。最終，我們得到了一張白頭海鷹但又不是原圖中白頭海鷹的影象，這正是我們所需要的。

Figure 7

Style Restruct

m_vgg(VV(img_tfm[None]))
sfs = [SaveFeatures(children(m_vgg)[idx]) for idx in block_ends]
targ_vs = [V(o.features.clone()) for o in sfs]
[o.shape for o in targ_vs]

[torch.Size([1, 64, 710, 710]),
 torch.Size([1, 128, 355, 355]),
 torch.Size([1, 256, 177, 177]),
 torch.Size([1, 512, 88, 88]),
 torch.Size([1, 512, 44, 44])]

def gram(input):
b,c,h,w = input.size()
x = input.view(b*c, -1)
return torch.mm(x, x.t())/input.numel()*1e6

def gram_mse_loss(input, target): return F.mse_loss(gram(input), gram(target))

def style_loss(x):
m_vgg(output_img_v)
outs = [V(o.features) for o in sfs]
losses = [gram_mse_loss(o, s) for o,s in zip(outs, targ_styles)]
return sum(losses)

n_iter=0
while n_iter <= max_iter: optimizer.step(partial(step,style_loss))

Iteration: n_iter, loss: 52.1091423034668
Iteration: n_iter, loss: 4.63181209564209
Iteration: n_iter, loss: 0.9747222661972046
Iteration: n_iter, loss: 0.4136861264705658
Iteration: n_iter, loss: 0.2491530179977417
Iteration: n_iter, loss: 0.1806013584136963
Iteration: n_iter, loss: 0.14466366171836853
Iteration: n_iter, loss: 0.12279225140810013
Iteration: n_iter, loss: 0.10791991651058197
Iteration: n_iter, loss: 0.09749597311019897

和content match類似的，通過計算block_ends中所有層的activation值的MSE的總和可以得到style loss值。Gram matrix == flattened vectors * 它們的轉置 / (b * c * h * w)，input.numel()就是(b * c * h * w)的封裝，1e6在這裡也是起到scale Gram matrix values的作用。Figure 8就是從style image中提取出來的樣式特徵。

Figure 8

Style Transfer

def comb_loss(x):
m_vgg(output_img_v)
outs = [V(o.features) for o in sfs]
losses = [gram_mse_loss(o, s) for o,s in zip(outs, targ_styles)]
cnt_loss= F.mse_loss(outs[3], targ_vs[3])*1e+6
style_loss = sum(losses)
return cnt_loss + style_loss

n_iter=0
while n_iter <= max_iter: optimizer.step(partial(step,comb_loss))

Figure 9

Figure 9就是最終style transfer呈現出來的效果，用油畫的形式，結合《夜港》中藍色和黃色交織的樣式，生成出不同於原圖中的白頭海鷹，從呈像效果來看， A Neural Algorithm of Artistic Style 還是很驚豔的。

小結

Content Restruct + Style Restruct = Style Transfer。Restruct是通過深度神經網路來實現的，optimizer會根據content loss和style loss來調整output

image的畫素。Style restruct通過flattening feature map matrix來剔除style image中空間資訊（原圖中的內容），flattened matrix就是Gram matrix。