Skip to content

Caption

Caption tab includes tools for image interrogation and captioning.
It is split into two main sections: VLM and CLiP.

Captioning and interrogation can run on a single image, a list of uploaded images, or a folder of images.
If you save results to file, they are written next to each source image with a .txt extension.

VLM Caption

Uses VLM (Vision Language Model) models to generate image captions. A VLM is an LLM (large language model) with a vision component for image analysis.
You can use a predefined prompt such as MORE DETAILED CAPTION or enter a custom prompt.
For example:
- describe the background of the image
- does the image contain a person?

SD.Next supports many VLM models, including Florence, MoonDream, Gemma, Qwen, and JoyCaption.
VLM models are auto-downloaded on first use.

CLiP Interrogate

CLiP (Contrastive Language-Image Pre-Training) is a neural network trained on image-text pairs.
Different CLiP models are commonly used for text encoding in models such as SD15, SD-XL, and SD3.5.

SD.Next supports 50+ CLiP models.
CLiP models are auto-downloaded on first use.

In addition to Interrogate, the CLiP tab includes Analyze, which predicts image categories such as medium, artist, movement, trending, and flavor.