Prompting with different models
This is not an in-depth technical article. It is meant to make prompting with different models easier to understand, including why behavior differs.
Prompting effectiveness depends on:
- Text encoder used in the model
- Dataset used to train the model
- Prompt and negative prompt structure
Note
Different models have different prompting best practices. When in doubt, refer to model-specific notes on the Civitai website.
For example, standard SD15/SDXL models are prompted very differently from Pony-derived models and from SD35 or Flux.1 models. A big reason is which text encoder is used and how it was trained.
Prompt Parser
SDNext supports multiple prompt parsers:
- Native: Default
- A1111: Compatible with the A1111 implementation
- Compel: Use compel-style attention syntax, e.g.
dog++ordog-- - xHinker: Alternative engine compatible with
T5text-encoder
Native Prompt Parser
Prompt Attention
(x): emphasis. Multiplies the attention to x by 1.1, equivalent to(x:1.1)[x]: de-emphasis, divides the attention to x by 1.1, approximate to(x:0.91)(x:number): Multiply the attention by number, either higher or lower than1
Note
Multiple attention modifiers are multiplied, not added:
((a dog:1.5) with a (bone:1.5)1.5) is the same as (a dog:3.375) (with a bone:2.25)
Other Prompt Syntax
\(x\): Escapes parentheses so they are treated as plain text
Prompt Scheduling
[x:x:number]: Uses the first x until number steps have finished, then uses the second x
Note
number can be an int (exact step count) or a float (percentage of total steps).
Components
Text encoder
- SD15 started with: CLIP-ViT/L
- SDXL adds second encoder: OpenCLIP-ViT/G
- SD3 adds third (optional) encoder: T5 Version 1.1
Other model families follow a similar path by replacing simpler early encoders with more advanced ones. For example: PixArt-Σ and Tencent HunyuanDiT
In StabilityAI models, text encoder outputs are concatenated, so the oldest CLIP-ViT/L encoder still has major impact.
Dataset
Note
Nearly all modern models are trained on subset of laion-5b dataset
Later fine-tuning introduces additional data (for example, Pony uses a heavily tagged dataset).
Prompting tips
Prompt Engineering
Know your model: different models were trained on different datasets, so some terms work better on some models than others.
Main groups
- Mediums: best starting a prompt with it after specifying artist
Examples: painting, photograph, drawing, sketch - Flavors: best left as separate token at the end of the prompt
Examples: ray tracing, fine art, black and white, pixiv, artstation - Movements: best added as keywords
Examples: pop art, photorealism - Artists: best starting a prompt with it
Examples: greg rutkowski, artgerm, dc comics, picasso
Modifiers
- Feel: best near the end
Examples: beautiful, sharp focus, 4k, hdr, high detailed, canon 5d - Composition: best at front, but only use if results don't fit
Examples: 1man, 1woman
Negative Prompt
- Any keyword can be specified in a negative prompt as well Examples: watermark
Hints
- Use either artists or movements
Using both may result in one overpowering the other, or in unexpected outcomes. - Select medium that fits artist
It helps the model to know which medium to use when styling. - Add action after subject
Examples: portrait, standing, sitting - Moving things to the front of prompt may increase its emphasis
Example: cartoon drawing of a woman as pixar vs pixar drawing of a woman - Use both subject and scene keywords Example: woman on a beach
Example
(composition) (artist) (medium) (subject) (action) (scene) (movement) (flavor) (feel)
1woman greg rutkowski painting of a woman happy front portrait on a beach as photorealism, sharp focus, artstation
Negative prompt
When you add a negative prompt, it is appended to the prompt with negative weights. This makes the model steer away from those terms, but it still has to represent them in context first.
So by adding negative prompt, you're:
- Limiting model freedom A common result is lower variation (for example, faces becoming more similar).
- Making the model more prone to hallucinations If you steer away from something that was not present, it may introduce the opposite effect.
Negative prompts are useful for steering away from specific concepts, but they should not be treated as a general-purpose long list.
Note
Negative prompting availability depends on the model
Negative prompting requires models with both positive and negative guidance. For example, SD-XL will first perform positive guidance and then modify it with negative guidance for each generation step. However, if negative guidance is disabled or the model does not support it, negative prompts will not take effect.
Examples of models where negative guidance can be disabled by setting guidance scale to 0 or 1: SD 1.5, SD-XL
Examples of models where negative guidance is not available by default: FLUX.1, Flex, OmniGen
Prompt attention
Adding extra attention to parts of a prompt is fine, but use it sparingly and keep overall balance in mind.
If your prompt has 10 words and you increase attention on 5 of them, average prompt weight becomes too high. A common result is overbaked images.
Prompt balance does not need to be perfect, but many weighted words in one prompt is usually a red flag.
Also note that very strong attention modifiers such as (((xyz))) or (xyz:1.5) can make the model lose the overall prompt intent.
Model Specific Tips
SD15
Dataset:
- Used small laion-5b subset with preexisting captions
- No major effort was put into processing or captioning other than what's in the subset
Result:
- Model that has a basic understanding of the world, but effective use often requires "hunting" for the right keywords
- Models are easily fine-tuned as nothing-forced-nothing-removed approach was used during training
- How to prompt? Old-school prompting which put heavy emphasis on keywords and attention modifiers
SD21
Dataset:
- Trained on same dataset as SD15 and then fine-tuned on extended dataset with larger resolution
- It was also censored in the final parts of training which introduced heavy bias
Result:
- It's almost like the model was "lobotomized".
E.g. for the concept of "topless", it is not just missing sample data. It was effectively "burned out" of the model. So to add concepts that model did not understand it takes massive effort, almost close to retraining. Fail.
SDXL
Dataset:
- Used larger subset with extended captions and also diverse resolutions
- Instead of censoring in the final stages, they simply pruned dataset used for training.
Result:
- Model that knows-what-it-knows and the rest can be added with fine-tuning.
- E.g. It knows what "topless" means, just doesn't have enough examples to develop it fully.
-
How to prompt? Extended captions and second text encoder mean that model can be prompted in a more natural way and extended use of keywords and attention-modifiers should be avoided.
-
Extra note: PonyXL was extensively trained on heavily tagged dataset without natural language captions and as such it needs to be prompted differently - using tags and keywords instead of natural language.
SD3
- Used even larger subset with even more diverse resolutions, but dataset itself was processed differently:
- Censored not only by removing images from dataset, but also by modifying them by censoring parts of the image
- Captioned extensively using LLM. Unfortunately, not ON-TOP of existing captions so it can augment them, but instead it REPLACED them - thus keywords already existing in dataset are not trained for at all.
Result:
- Model that thinks-it-knows-everything and now it is up to you to prove it wrong.
E.g., it knows what a topless person looks like, and it is "certain" that nipples should be presented as blank.
Which means it would take a massive effort to retrain what it learned the wrong way.
I hope to be proven wrong, but this looks like a fail. - How to prompt? Long LLM-generated captions mean this model usually works better with descriptive natural language. Reduce reliance on style tags, keywords, and heavy attention modifiers. Because LLM captions often do not include classic style tags, replace them with concrete visual descriptions.