Generating Stylized Portraits with Deep Learning models

I thought it would be nice to experiment with the latest trend, and create a stylized portrait for this website. However, I did not want to use paid services as I wanted to learn the process of training models and generating images.

I was familiar with Stable Diffusion at the time but did not fully understand how it works. Stable Diffusion is a text to image model, similar to DALL-E. I can’t pretend to understand the mathematical formulas in the architecture, but I can attempt to understand the concept with my limited knowledge of neural networks.

The Stable Diffusion model is trained by taking as inputs an image and a caption. The inputs are passed through a diffusion model, a series of mathematical functions, which turn the inputs into noise. In order to generate an image again, the process is reversed. The model recognizes patterns in the seed noise image and associates these patterns with the keywords in the prompt. Once a couple of pixel patterns have been formed, the model is able to continue this process of finding patterns and tweaking pixels to form a full image.

With a little searching, I stumbled across the Stable Diffusion WebUI. Essentially, it provides a graphical user interface to Stable Diffusion and reduces the learning curve for many who wish to use Stable Diffusion. Following the steps of a tutorial got me through most of the way. As with Python applications, installation is never without obstacles. I’ve compiled a list of troubleshooting links below that could help with your troubles.

The Stable Diffusion WebUI is straightforward to use. I experimented with models found at Civitai. With little effort I could create some nice images. Specialised models are created using targeted data sets. For example, I created these with an Architectural model. These models have been debated as the training data sets make use of Artist’s copyrighted work.

One of the primary reasons for exploring this tool was to create a profile picture for my site. I experimented with the impaint function which essentially superimposes another image and attempts to blend them. I used this function and ran the image several times through Stable Diffusion’s img2img tool to smooth the blend. This was the end result.

I was unsatisfied with this approach. I felt that this was a simple trick which could be achieved with photoshop. I explored further and learnt about LoRA (Low Rank Adaptation). LoRA was initially proposed as a faster method to fine tune models. In the conventional approach, fine tuning would require re-training and back-propagation to modify the original weights. LoRA takes a different approach and maintains a separate set of weights that are added on when running the model. I’ve included links in the reference section that explain this further. The main advantage of LoRA is the reduced computational load required. I could run the training relatively quickly on my GTX1060 6GB.

To train a LoRA model, I used Kohya’s GUI. Kohya’s GUI provides a GUI for Kohya’s stable diffusion utilities. Following this tutorial and this tutorial, I selected a base model from Civitai and compiled a library of my portraits. I was then able to produce a safetensors file that I could use for LoRA application in Stable Diffusion. Rather than apply my LoRA model in the prompt, I used After Detailer which allows me to target the face when applying my LoRA model. The end result turned out pretty well. It’s also clear that this was not simply superimposing my face.

I could further improve the model by including side portraits of myself. I noticed that output images with side profiles could not apply my facial features well.

This project also gave me quite a good insight into how Stable Diffusion worked and the workflows required to perfect images. However I think that the results produced by Lensa could be achieved with some work. At this stage, I think this tool provides aspiring creatives who may not have the artistic talent (like myself) to express themselves. I don’t think this is necessarily a threat like deepfake turned out to be. While deepfakes can be easily made and quickly spread, they are often debunked at the same speed.

It is arguable if this technology aids with ideation as the outputs are purely based on the training data it is provided with - ie. work that was created before, by someone else. I noticed that the models I used tended to produce very similar results after awhile, purely due to the nature of original datasets. I think that reliance on such tools could be detrimental to human creativity.

That being said, I think that there is value in learning this creative tool and I would want to work with video next with Warp Fusion.

Previous
Previous

Web3 messaging

Next
Next

Understanding Filecoin