All memories
📝

Hitchhiker's Guide to Text-To-Image Generation

[Blog in presentation format](https://hardippatel.com/presentation/sd101)\n\n* * *\n\n## Intro\n\n- Fulltime: Backend-Heavy Full-Stack developer\n- Sparetime: Work on my [Accountability](https://hardippatel.com) site\n- Hobbies:\n - Snooker (very recent)\n - Box Cricket\n - Try new hobbies\n- Currently Reading\n - [Make by Pieter Levels](https://readmake.com)\n\n* * *\n\n## Why this topic? ...even though you're GenAI "noob"\n\n- Provide beginner's perspective\n- Wanted to help close the **barrier to entry** gap\n\n* * *\n\n## Inspiration ...for getting into it\n\n- Want to create dynamically updating hero pic for my [Accountability](https://hardippatel.com) site\n- [Pieter Levels](https://twitter.com/levelsio) (Check [Photo AI](https://photoai.com/))\n- [Sayak Paul](https://huggingface.co/sayakpaul)\n- [Overpowered](https://www.youtube.com/watch?v=IlIhykPDesE)\n- [Abhishek Thakur](https://www.linkedin.com/in/abhi1thakur)\n\n* * *\n\n## Journey Overview\n\n- Tried **Midjournery** on Discord very very early\n- Tested **Automatic1111** after watching Overpowered\n- Reached saturation with UI, so wanted to try with code\n - So hopped on to [Google Colab](https://colab.research.google.com)\n- Tried **ComfyUI** for this talk and it is ***quite awesome*** to say the least\n\n* * *\n\n## What is Stable Diffusion?\n\n- Text to Image model, combination of...\n - Language Model, to transform Text to Latent Representation\n - Generative Image Model, image conditioned on that Representation\n- Based on Diffusion (Probablistic) Models\n - Class of Latent Variable Generative models\n\n* * *\n\n## UI Tools for No-Code\n\n- Automatic1111\n- ComfyUI\n- Invoke AI\n- DiffusionBee\n\n* * *\n\n## [Automatic1111](https://github.com/AUTOMATIC1111/stable-diffusion-webui)\n\n[Installation Link](https://github.com/AUTOMATIC1111/stable-diffusion-webui?tab=readme-ov-file#installation-and-running)\n\n- Widely used\n- Good extension support\n- Most compatible\n- But unstable...\n\n* * *\n\n## ComfyUI\n\n[Installation Link](https://github.com/comfyanonymous/ComfyUI?tab=readme-ov-file#installing) \n[Tutorial/Guide](https://comfyanonymous.github.io/ComfyUI_tutorial_vn/)\n\n- Getting slack lately\n- Intuitive UI\n- Very stable\n\n* * *\n\n## Terminologies\n\n- PyTorch\n - deep learning framework based on Torch\n- Base Model\n - Foundational model upon with specific model variants are made\n - For example, v1.5, v2, XL 0.9, XL 1.0\n- Checkpoint (Model)\n - Pretrained Weights\n - Types of images model is trained on\n - For example, Juggernaut XL, Anything v3.0, epicRealism, etc...\n\n#\n\n- Guidance Scale (CFG)\n - Controls how much a process **follows a text prompt**\n- **LoRA** ( **LO**w **R**ank **A**daptation Technology)\n - Add specific styles or characters while mantaining manageable file sizes\n- **PEFT** (**P**arameter **E**fficient **F**ine-**T**uning)\n - Adapting Pre-trained Language Model(PLMs) to fine-tune extra parameters while keeping original parameters frozen.\n - Used to create LoRA\n\n#\n\n- Weights\n - Numerical values associated with the connections between neurons in neural network architecture\n - [Visualize](https://hackernoon.imgix.net/hn-images/1*_RLj3E4Lt8cZzlwtmcbqlA.png)\n- Prompt\n - Text based instruction\n\n#\n\n- Text encoder\n - Transformer language model\n - Tokenizes text to be fed into U-Net\n- U-Net\n - Takes encoded text (plain text processed into a format it can understand) and a noisy array of numbers as inputs\n- VAE\n - Encodes and decodes images to and from a smaller latent space\n- [Visualize](https://miro.medium.com/v2/resize:fit:1156/format:webp/1*ka4ci_UymoxuH4LAjiA6iw.png)\n\n#\n\n- Pipeline\n - Running diffusion models in inference by bundling all the necessary components.\n - Provides flexibility\n- Seed\n- Fine-Tuning\n - Train a wide dataset model on a narrow dataset model\n\n* * *\n\n## Code demo\n\n- [Inference Code](https://gist.github.com/knightkill/a0f207c69068479686057a1293d2cfa0)\n- [Model Fine-Tuning code](https://gist.github.com/knightkill/a0f207c69068479686057a1293d2cfa0)\n - Prepare Images for Training using [Birme](https://www.birme.net/)\n - [Don't use token which is already trained](https://github.com/2kpr/dreambooth-tokens/blob/main/all_single_tokens_to_4_characters.txt)\n- [Inference code with trained model](https://gist.github.com/knightkill/d6a7db77ea6a6fcf2bad5eb14dcf0b7f)\n\n* * *\n\n## Further capabilities of Stable Diffusion\n\n- Inpainting\n - Restore/Repair image\n- Outpainting\n - Extend canvas of the image\n- Image To Image\n - New image from input as image and text prompt\n - New image will follow the composition and color of input image\n- Depth To Image\n - Take depth of the input image for composition of new image\n\n* * *\n\n# THAT'S ALL FOLKS!\n\n* * *\n\n## Credits\n\n- [Towards Data Science](https://towardsdatascience.com)\n- [Hugging Face](https://huggingface.co/docs)\n- [Google Colab](https://colab.google)\n- [BIRME](https://www.birme.net/)\n- [Automatic1111](https://github.com/AUTOMATIC1111)\n- [Comfy UI](https://github.com/comfyanonymous/ComfyUI)