Neural style transfer – tribute to an original idea

Since Machine learning has gotten extremely popular in the recent years, many ideas are formed everyday and eventually put into research. Out of curiousity, I took a look at the number of papers submitted at arxiv > in 2017, which was 4012 in total, whereas the same number in 2016 was 2622. Not all of these papers gets published. However, ML and DL continues to grow in popularity.

The flipside is that some ideas might get overlooked. However, the purpose of this post is to make a “tribute” to one of those that really stand out, and what I would claim to be one of the most original ideas in recent years: Neural Style Transfer.


Description of NST

Leon A. Gatys, Alexander S. Ecker & Matthias Bethge wrote the original 2015 paper which first introduced a technique to separate style from content in an image, and project the style onto another content image, while preserving its semantic content. The idea exploits a pre-trained CNN, while modifying it to output a style-transformed image ( X_G ) instead of a prediction vector ( \hat y ).

A style image ( X_S ) and a content image ( X_C ) are chosen, and subsequently transformed into a new generated image ( X_G ), where each of these is essentially a (ColorChannels, Height, Width) -dimensional matrix. This is done by modifying the cost function of the pre-trained Convolutional Neural network (VGG-net for the original 2015 NST-paper) in order to exploit the learnt features and imply transfer learning to the task.



Lets first start with the VGG-net architecture: This CNN comes with a 16 or 19-layer architecture, and is trained for 1000 different classes, on the IMAGENET data. The final layer of the original model is thus a 1000-way soft-max ( \hat y ). This particular model has achieved a top-1 error rate of 24.4% within 1000 different classes in the training data.

Fig. 1. VGG-net, illustration stolen from:

In the case of neural style transfer, the final softmax is ignored, and instead the activations of a particular layer is chosen as the output (X_G in this case). The cost function thus ignores the softmax and are instead minimizing the distance between X_G and the style (gram) matrix X_C and the content matrix X_C, respectively.


The overall cost function:

J(X_G) = \alpha J_{content}(X_C,X_G) + \beta J_{style}(X_S,X_G)

The \alpha and \beta coefficients are used to weight the next two underlying cost functions: The style cost J_{style}(X_S,X_G), and the content cost J_{content}(X_C,X_G). The balance between the two will determine how much the style will dominate in the transformed image.


The content cost:

The content cost is simply the squared distance between the activations of the generated image a^{(G)} and the content image a^{(C)}: J_{content}(X_C,X_G) = \theta \sum (a^{(C)} - a^{(G)})^2, where a is the activations of the chosen output layer. a^{(C)} and a^{(G)} is found through forward-propagation of X_C and X_S, respectively. In this case X_G is initialized by white noise.


The style cost:

The style cost is similar to the content cost, in the way that the loss is the squared distance between the generated image, and the style matrix: J_{style}^{[l]}(X_S,X_G) = \theta \sum _{i=1}^{Chan}\sum_{j=1}^{Chan}(S^{(S)}_{ij} - S^{(G)}_{ij})^2. However, the difference is that the style matrices S, is the dot-product of the activations of a layer: {\displaystyle S_{ij} = v_{i}^T v_{j}}. According to the authors, the best results are obtained by using multiple (5 in the paper) layers l in order to capture both pixel-level as well as object-level covariation, which presents the style. The style cost is thus a weighted sum of the differences between X_G and each style matrix S_l. The final style cost can then be written as: J_{style}(X_S,X_G) = \sum_{l} \lambda^{[l]} J^{[l]}_{style}(X_S,X_G).



Fig. 2. Convolutional neural network, stolen from: Gatys et al., 2015.

As can be seen by the illustration above (taken from the original paper), the different style matrices (layers) yields different levels of style features: in style reconstruction exibit (a), the features are essentially small dots, and as the results are propagated forward (b, c, d), the dots transform into larger areas with different colors (e).


The training process:

The whole training process is repeated n times, resulting in X_G getting closer and closer to both the style matrix and the content image, based on the balance in the overall cost. Based on experience with VGG-16 and 800×600 images, the most visually pleasing results appear after 500 iterations. Based on the size of the network and number of inputs, it is highly suggested to use a GPU for this task.

Resulting image ( X_G ) for 100 iterations:


Total cost for 100 iterations:


Implications and directions of research
Yongcheng J., Yezhou Y., Zunlei F., Jingwen Y., Yizhou Y., and Mingli S. (2018), posted a paper on arxiv summarizing the progress of NST as a field, as well as the current problems and promising directions to solve those. As they show in their paper, multiple directions of research have risen after the paper proposed by Gatys et al. in 2015. However, on a meta-level they seem to be divded into two main categories: Online and offline optimization.

The online methods resembles the original 2015-method described above, where the offline approach divides the task into a train-process and a subsequent forward-pass, based on the leant function in the training process. This latter approach allows for faster computation and essentially real-time processing of images so that e.g. video can be transformed realtime. Generative adversarial networks are also used for this in some cases.

A recent paper by E. Grinstein, N. Q. K. Duong, A. Ozerov and P. Perez also test out the NST technique on audio data by using the style-matrices from spectrograms of the sound signal. However, the results are not equally successfully as for images. Presumably this because the “style” and “content” are not equally simple to separate for sound. You can take a listen here.


Why is this idea original?

  • First of all, its a very creative solution to a complex problem: Can a model separate style from content in an image? And can the separated style be projected onto a new image? The answers seem to be yes! And the process is simpler than what one might have expected. The real value lies in learning from the style (gram) matrix.
  • Second, the idea was (according to the author) inspired by biological systems. This was translated into a mathematical expression that happens to fit well in the framework of a CNN, which meant that learnt features of VGG or any other CNN could be exploited through transfer learning.
  • Third, it stands out, and is not ‘just another’ optimisation like another InsertNameHere-Net (although they are ofc. needed). It has actually created a new area of research (which Yongcheng et al. argues will create even more areas), and created a lot of interest outside academia.
  • Fourth, it is quite entertaining! And easy to set up for anyone who wants to experiment with their own pictures.


If you are eager to get your own hands dirty with this technique, there are multiple github repositories like this one or this one. If you want to learn more about the subject I suggest you read the paper or take Course 4 of Andrew Ng’s deeplearning specialization at coursera.


Next blog post:

Experiment: How well will a drawing-style generalize to a video sequence?


Leave a Reply