FramePack

Implementation of Lllyasviel FramePack for Tencent HunyuanVideo I2V
With some major differences and improvements: - T2V, I2V & FLF2V modes support
- Bi-directional and Forward-only (F1) model variants
- Resolution and frame-rate scaling: output resolution in any aspect ration and/or resolution
- Prompt enhancer: use LLM to enhance your short prompts
- Complex actions: modify prompt each section of the video
- Video encode: multiple video codecs, raw export, frame export, frame interpolation
- LoRA: support for LoRA models
- Offloading and quantization: support for offloading and quantization
- API: support for use via HTTPRest API calls
- Custom model: support for custom model loading

Screenshot:

Example:

Important

Video support requires ffmpeg to be installed and available in the PATH

Modes

Supports 3 modes of operation:

T2V: text-to-video

Uses only text prompt and generates video from it
Automatically triggered if no init image is provided

I2V: image-to-video

Uses image as init frame and text prompt to generate video
Automatically triggered if init image is provided and there is no end frame
Init strength controls the strength of the image, similar to img2img denoising strength
Vision strength controls the strength of the vision model that controls video generation

FLF2V: frame-last-frame-to-video

Uses images as init frame and end-frame and text prompt to generate video
Automatically triggered if init image and end frame are provided
Init strength controls the strength of the image, similar to img2img denoising strength
End strength controls the strength of the end frame compared to init frame
Vision strength controls the strength of the vision model that controls video generation
Ratio of init and end strengths can be used to skew video towards init or end frame

Variants

Both FramePack variants are based on HunyuanVideo model, but with different approach to video generation:
- Bi-directional model: default
Runs generation in reverse order and assembles video
- Forward-only model: F1
Runs generation in forward order and assembles video

Resolution Scaling

Video model is trained on 640p resolution, but can be used to generate video at different resolutions
However, the resolution must be within exact supported aspect rations
Given any input image, the model will first find the closest aspect ratio and then scale it to the desired resolution
As a result, the output resolution will be the closest supported aspect ratio to the desired resolution

Note

Resolution is directly proportional to VRAM usage, so if you have low VRAM, use lower resolution

Frame rate can be set to any value and is used to calculate the number of frames to be generated as well as the playback speed of the encoded video

Prompt Enhancer

Uses VLM to enhance your short prompts, for example you can enter just dancing or jumping as a prompt and it will be expanded to a longer prompt
VLM first analyzes the input image (if provided) and then generates a longer prompt based on the short prompt to incorporate both the input image and the short prompt

Complex Actions

When changing duration or FPS parameters, model will print number of sections that will be generated
Each video section can have its own prompt suffix, which can be used to change the prompt over time
Prompt suffix is a string that will be added to the end of the prompt
Each line of section prompts will be used as a separate prompt suffix
For example, if you have 3 sections and 3 lines in the prompt suffix, each section will use a different line of the prompt suffix

Example:
- main prompt: astronaut on the moon
- section prompts:
- line-1: walking
- line-2: jumping

Note that number of lines in sections prompts does not have to match the number of sections
If there are less lines than sections, sections will be interpolated to strech duration of the entire video
For example, video with 4 sections and 2 lines in the section prompts will result in first line being used for the first 2 sections and second line being used for the last 2 sections

Use of section prompts is optional, but can be used to create more complex videos
Section prompts are compatible with Prompt Enhancer and in that case, each combine prompt (base prompt plus per-section prompt) will be enhanced separately

Video Encode

Video location is set in settings -> image paths -> video

Video is encoded using selected codec and codec options
Default codec is libx264, to see codecs available on your system, use refresh
By default, model will not create image files, but can be enabled in video settings

Tip

Hardware-accelerated codecs (e.g. hevc_nvenc) will be at the top of the list
Use hardware-accelerated codecs whenever possible

Warning

Video encoding can be very memory intensive depending on codec and number of frames

Advanced Video Options

Any specified video options will be sent to ffmpeg as-is
For example, default crf:16 specifies the quality of the video vs compression rate, lower is better
For details, see https://trac.ffmpeg.org/wiki#Encoding

Interpolation

Video can optionally have additional interpolated frames added using RIFE interpolation method
For example, if you render 10sec 30fps video with 0 interpolated frames, its 300 frames that need to be generated
But if you set 3 interpolated frames, video fps and duration do not change, but only 100 frames need to be generated and additional 200 interpolated frames are added in-between generated frames

CLI Video Encode

Video encoding can be skipped by setting codec to none
In which case, you may want to save raw video frames as safetensors file and use command line utility to encode video later

python encode-video.py

Allows to:
- Export frames from safetensors file as individual images
Those can be used for further processing or to manually create video using ffmpeg from-image-sequence
- Encode frames from safetensors file into video using cv2
- Encode frames from safetensors file into video using torchvision/ffmpeg

LoRA

Limited support for any HunyuanVideo LoRAs
Effects will be limited unless LoRA is trained on FramePack itself
Uses standard syntax: <lora:filename:weight>

Note

There is no networks panel available in FramePack so you have to add LoRA to prompt manually

Offloading and Quantization

Implementation replaces lllyasviel offloading with SD.Next Balanced offloading
Balanced offload will use more resources, but unless you have a low-end GPU, it should also be much faster especially when used together with quantization

Add support for LLM and DiT/Video modules on-the-fly quantization quantization
Only available when using native offloading, configure as usual in settings -> quantization

Tip

Its recommended to enable quantization for Model, TE, LLM modules

See docs for for more details on offloading and quantization

API

Extension supports API calls: /sdapi/v1/framepack
Only required params are base64-encoded init-image and prompt, all other parameters are optional
Once video has been generated, you can download it using /file={path-to-file} endpoint
For example, see create-video.py

python create-video.py --help

Custom Model

You can get current receipe to see which modules would be loaded and change them if desired
For example, changing original llama to different one can be done with:

text_encoder: Kijai/llava-llama-3-8b-text-encoder-tokenizer/
tokenizer: Kijai/llava-llama-3-8b-text-encoder-tokenizer/