Caption

Caption tab includes functions regarding image interrogation and captioning
Its separated into two main sections: VLM and CLiP

Captionining/interrogate can be performed on a single image, on list of uploaded images or on a folder containing images
Captioning results can be saved to a file in which case, they will be saved next to original image file with a filename with .txt extension

VLM Caption

Uses VLM (Vision Language Model) models to generate captions for images. VLM model is LLM (large language model) with additional vision component to allow it to analyze input images.
You can use a predefined prompt such as MORE DETAILED CAPTION or enter your own prompt to generate captions.
For example:
- describe the background of the image
- *does the image cointain a person?"

SD.Next supports many VLM models such as: Florence, MoonDream, Gemma, Qwen, JoyCaption, etc.
VLM model will be auto-downloaded on first use

CLiP Interrogate

CLiP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of image and text pairs.
Different CLiP models are commonly used for text-encoding task during generate image using many popular models such as SD15/SD-XL/SD3.5/etc.

SD.Next supports over 50+ CLiP models
CLiP model will be auto-downloaded on first use

In addition to Interrogate, CLiP tab also includes Analyze feature which allows to predict image categories such as: medium, artist, movement, trending and flavor.