FramePack
Implementation of Lllyasviel FramePack for Tencent HunyuanVideo I2V
With some major differences and improvements:
- T2V, I2V & FLF2V modes support
- Bi-directional and Forward-only (F1) model variants
- Resolution and frame-rate scaling: output resolution in any aspect ration and/or resolution
- Prompt enhancer: use LLM to enhance your short prompts
- Complex actions: modify prompt each section of the video
- Video encode: multiple video codecs, raw export, frame export, frame interpolation
- LoRA: support for LoRA models
- Offloading and quantization: support for offloading and quantization
- API: support for use via HTTPRest API calls
- Custom model: support for custom model loading
Screenshot:
Example:
Note
At the moment implemented as SD.Next extension,
but will be fully integrated into the main codebase in the future
Reason is to avoid breaking changes as upstream changes are made to the codebase
Important
Video support requires ffmpeg
to be installed and available in the PATH
Install
Exension repository URL: https://github.com/vladmandic/sd-extension-framepack
Via UI
Refresh extension list and install extension by selecting it and pressing install
Or enter repository URL in SD.Next -> Extensions -> Manual Install and select Install
Extension will appear as a top-level tab after server restart
Via CLI
Clone repository into SD.Next /extensions
folder
Modes
Supports 3 modes of operation:
T2V: text-to-video
- Uses only text prompt and generates video from it
- Automatically triggered if no init image is provided
I2V: image-to-video
- Uses image as init frame and text prompt to generate video
- Automatically triggered if init image is provided and there is no end frame
- Init strength controls the strength of the image, similar to img2img denoising strength
- Vision strength controls the strength of the vision model that controls video generation
FLF2V: frame-last-frame-to-video
- Uses images as init frame and end-frame and text prompt to generate video
- Automatically triggered if init image and end frame are provided
- Init strength controls the strength of the image, similar to img2img denoising strength
- End strength controls the strength of the end frame compared to init frame
- Vision strength controls the strength of the vision model that controls video generation
- Ratio of init and end strengths can be used to skew video towards init or end frame
Variants
Both FramePack variants are based on HunyuanVideo model, but with different approach to video generation:
- Bi-directional model: default
- Runs generation in reverse order and assembles video
- Forward-only model: F1
- Runs generation in forward order and assembles video
Resolution Scaling
Video model is trained on 640p resolution, but can be used to generate video at different resolutions
However, the resolution must be within exact supported aspect rations
Given any input image, the model will first find the closest aspect ratio and then scale it to the desired resolution
As a result, the output resolution will be the closest supported aspect ratio to the desired resolution
Note
Resolution is directly proportional to VRAM usage, so if you have low VRAM, use lower resolution
Frame rate can be set to any value and is used to calculate the number of frames to be generated as well as the playback speed of the encoded video
Prompt Enhancer
Uses VLM to enhance your short prompts, for example you can enter just dancing
or jumping
as a prompt and it will be expanded to a longer prompt
VLM first analyzes the input image (if provided) and then generates a longer prompt based on the short prompt to incorporate both the input image and the short prompt
Complex Actions
When changing duration or FPS parameters, model will print number of sections that will be generated
Each video section can have its own prompt suffix, which can be used to change the prompt over time
Prompt suffix is a string that will be added to the end of the prompt
Each line of section prompts will be used as a separate prompt suffix
For example, if you have 3 sections and 3 lines in the prompt suffix, each section will use a different line of the prompt suffix
Example:
- main prompt: astronaut on the moon
- section prompts:
- line-1: walking
- line-2: jumping
Note that number of lines in sections prompts does not have to match the number of sections
If there are less lines than sections, sections will be interpolated to strech duration of the entire video
For example, video with 4 sections and 2 lines in the section prompts will result in first line being used for the first 2 sections and second line being used for the last 2 sections
Use of section prompts is optional, but can be used to create more complex videos
Section prompts are compatible with Prompt Enhancer and in that case, each combine prompt (base prompt plus per-section prompt) will be enhanced separately
Video Encode
Video location is set in settings -> image paths -> video
Video is encoded using selected codec and codec options
Default codec is libx264
, to see codecs available on your system, use refresh
By default, model will not create image files, but can be enabled in video settings
Tip
Hardware-accelerated codecs (e.g. hevc_nvenc
) will be at the top of the list
Use hardware-accelerated codecs whenever possible
Warning
Video encoding can be very memory intensive depending on codec and number of frames
Advanced Video Options
Any specified video options will be sent to ffmpeg
as-is
For example, default crf:16
specifies the quality of the video vs compression rate, lower is better
For details, see https://trac.ffmpeg.org/wiki#Encoding
Interpolation
Video can optionally have additional interpolated frames added using RIFE interpolation method
For example, if you render 10sec 30fps video with 0 interpolated frames, its 300 frames that need to be generated
But if you set 3 interpolated frames, video fps and duration do not change, but only 100 frames need to be generated and additional 200 interpolated frames are added in-between generated frames
CLI Video Encode
Video encoding can be skipped by setting codec to none
In which case, you may want to save raw video frames as safetensors
file and use command line utility to encode video later
python encode-video.py
Allows to:
- Export frames from safetensors
file as individual images
Those can be used for further processing or to manually create video using ffmpeg
from-image-sequence
- Encode frames from safetensors
file into video using cv2
- Encode frames from safetensors
file into video using torchvision/ffmpeg
LoRA
Limited support for any HunyuanVideo LoRAs
Effects will be limited unless LoRA is trained on FramePack itself
Uses standard syntax: <lora:filename:weight>
Note
There is no networks panel available in FramePack so you have to add LoRA to prompt manually
Offloading and Quantization
Implementation replaces lllyasviel offloading with SD.Next Balanced offloading
Balanced offload will use more resources, but unless you have a low-end GPU, it should also be much faster especially when used together with quantization
Add support for LLM and DiT/Video modules on-the-fly quantization quantization
Only available when using native offloading, configure as usual in settings -> quantization
Add support for post-load quantization such as NNCF
Tip
Its recommended to enable quantization for TE
, Video
and LLM
modules
See docs for for more details on offloading and quantization
API
Extension supports API calls: /sdapi/v1/framepack
Only required params are base64-encoded init-image and prompt, all other parameters are optional
Once video has been generated, you can download it using /file={path-to-file}
endpoint
For example, see create-video.py
python create-video.py --help
Custom Model
You can get current receipe to see which modules would be loaded and change them if desired
For example, changing original llama to different one can be done with:
text_encoder: Kijai/llava-llama-3-8b-text-encoder-tokenizer/
tokenizer: Kijai/llava-llama-3-8b-text-encoder-tokenizer/