Possible LLM+GAN+Stable Diff Image Generator?

Cryso Agori

V.I.P. Member
@Ral

Moving this here cause idw spam the ai thread.

I will when I get home, thanks for showing this too me:mshad

Off but also on-topic I've been wondering if it's possible to make a similar multi-modal image gen with SD.

Stable Diffusion while can generate images as good if not better than Dalle-3 especially with XL, requires a ton of add-ons to do so, mainly because SD is built to generate images first, understand text second, which means that it can't understand prompts or sentences as well as Bing Image Creator, which uses ChatGPT as its prompt understander meaning that it understands text way better.

I was wondering if its possible to use a Llama freeware, and train it on English and understanding prompts, so that it can understand sentences and as such multiple subjects way better. Then connect it to SD.

Problem is how to translate the info from the llama to SD without losing any information (thus losing multi-modality). I ain't a coder (yet) so I have no clue how it could be done but as far as I'm aware the idea is sound.

Infact fooocus is basically a weaker version of this.
Thinking on this more, said llama freeware would probably also need to be able to understand images too.

GPT-4 can actually understand and caption images pretty accurately, which is probably why Dalle-3 can generate well. So it would need to have image recognition capabilities to, it would also allow higher img2img capabilities.

Extending this, if this second "Prompt AI" can understand both text and images, to be able to caption it too. You can probably use it to train itself too. Similar to CLIP or BLIP.

Another thing is perhaps adding a GAN to it. For example, Style-GAN-Human based off the research is way better at creating humans than diffusion models.


Mainly hands look like hands, unlike Diffusion models.

So I'm wondering if you can connect it together in a pipeline like Prompter AI(understands prompt)->Diffusion(generates broad conceptual image based on prompt)->GAN(draws over with precision based on prompt and SD image)

This way if you want a realistic human, it can do it with good hands.

However problem is that GAN's unlike diffusion models are very specific, Diffusion models can take a concept and apply it to a lot of things (like john wick is a cartoon style). While GAN's can only generate a specific thing, a GAN trained on realistic faces can't make a cartoon face.

So maybe a pipeline where we switch GAN and Diffusion would work.

Prompter AI(understands prompt)->GAN(draws precise image based on prompt)->Diffusion(draws over precise image with stylized art based off prompt and GAN image)

However there are base GAN's capable of generating different concepts. So perhaps its possible to use StyleGAN3 maybe that can work together with the Diffusion model.

only problem is hardware space, and how much ram it would take to run what's basically 3 dnn's at the same time lol.

Found it. https://mingukkang.github.io/GigaGAN/#:~:text=GigaGAN: Large-scale GAN for Text-to-Image Synthesis&text=It generates 512px outputs at,controllable latent space of GANs.

I knew I heard of a GAN capable of generating multiple concepts before

so it may be actually possible.

LlM+ Stable Diffusion XL +finetuned models/lora + Controlnet +MultiDiff + GigaGAN + finetuned GAN Models= kino

Even better is if its possible to pack it into a photoshop/blender/daz3d esc 3d+2d image and model creator+editor.

You could have the AI not only generate a image, but generate layers with the image, or make the layers yourself and have the AI generate a consistent image between layers.

That way you could have.

Background Layer
Foreground Layer
Character Layer
Text Bubble Layer

So if you don't like the generation or something. Instead of regenerating the entire image, you can just regenerate the layer you don't like. Or edit it yourself through photoshop-esc tools.

with 3d, you can make or perhaps even generate 3d models. Or combine them with openpose for extreme control.

Nigh endless possiblities, probably takes way too much space and ram lol. I wouldn't even know where to start with something like this lol.

Nice ideas, but problems are obv, learning how to code it all, find a llm that has image recog software, getting the hardware to train it all, and figuring out if it can even be run locally or if it would take too much ram/storage.

At least I know Amazon has free AWS servers.
 
  • Like
Reactions: Ral

Ral

[SUBTRACTED]
Administrator
Pronouns
He/Him
@Ral

Moving this here cause idw spam the ai thread.

Nice ideas, but problems are obv, learning how to code it all, find a llm that has image recog software, getting the hardware to train it all, and figuring out if it can even be run locally or if it would take too much ram/storage.

At least I know Amazon has free AWS servers.
Give Oracle Cloud a try. You can sign up for a free account that includes free resources, such as 24 GB of RAM and 4 vCPUs, offered under the "Always Free" services. I eventually upgraded to PAYG (Pay As You Go) because it gives you priority over the free account users, for obvious reasons :maybe However, it's important to note that these are ARM64 CPUs, so I'm not sure if you can run any LLMs on the "Always Free" servers you spin up with them. Even with PAYG, I haven't had to pay a single dime; I simply make sure to stay within the free limits to avoid being charged for extended services.
 
Back
Top