ZLUDA
ZLUDA is a CUDA Wrapper that allows to run applications using normally unsupported GPUS such as AMD GPUs in Windows
Warning
ZLUDA support is unofficial and support is limited at this time
- For unofficial instructions on how to manually build ROCm libraries, see ROCm Custom Build section
- For unofficial instructions on how to install ROCm for older GPUs such as Polaris and Vega, see ROCm for Polaris and Vega post
Installing ZLUDA for AMD GPUs in Windows
Note
This guide assumes you have Git and Python installed,
and are comfortable using the command prompt, navigating Windows Explorer, renaming files and folders, and working with zip files.
Important
If you have an integrated AMD GPU (iGPU), you may need to disable it,
or use the HIP_VISIBLE_DEVICES
environment variable.
Install Visual C++ Runtime
Note
Most everyone would have this anyway, since it comes with a lot of games, but there's no harm in trying to install it.
Grab the latest version of Visual C++ Runtime from https://aka.ms/vs/17/release/vc_redist.x64.exe (this is a direct download link) and then run it.
If you get the options to Repair or Uninstall, then you already have it installed and can click Close. Otherwise, install it.
Install ZLUDA
ZLUDA is now auto-installed, and automatically added to PATH, when starting webui.bat with --use-zluda
.
Install HIP SDK
Install HIP SDK 6.2 from https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html
So long as your regular AMD GPU driver is up to date, you don't need to install the PRO driver HIP SDK suggests.
Replace HIP SDK library files for unsupported GPU architectures
Go to https://rocm.docs.amd.com/projects/install-on-windows/en/develop/reference/system-requirements.html and find your GPU model.
If your GPU model has a ✅ in both columns then skip to Install SD.Next.
If your GPU model has an ❌ in the HIP SDK column, or if your GPU isn't listed, follow the instructions below;
- Open Windows Explorer and copy and paste
C:\Program Files\AMD\ROCm\6.2\bin\rocblas
into the location bar.
(Assuming you've installed the HIP SDK in the default location and Windows is located on C:) - Make a copy of the
library
folder, for backup purposes. - Download one of the unofficial rocBLAS library, and unzip them in the original library folder, overwriting any files there.
gfx1010: RX 5700, RX 5700 XT
gfx1012: RX 5500, RX 5500 XT
gfx1031: RX 6700, RX 6700 XT, RX 6750 XT
gfx1032: RX 6600, RX 6600 XT, RX 6650 XT
gfx1103: Radeon 780M
gfx803: RX 570, RX 580
More... - Open the zip file.
- Drag and drop the
library
folder from zip file into%HIP_PATH%bin\rocblas
(The folder you opened in step 1). - Reboot PC
If your GPU model not in the HIP SDK column or not available in the above list, follow the instructions in ROCm Support guide to build your own RocblasLibs.
Warning
Building your own libraries is not for the faint of heart
Install SD.Next
Using Windows Explorer, navigate to a place you'd like to install SD.Next. This should be a folder which your user account has read/write/execute access to. Installing SD.Next in a directory which requires admin permissions may cause it to not launch properly.
Note: Refrain from installing SD.Next into the Program Files, Users, or Windows folders, this includes the OneDrive folder or on the Desktop, or into a folder that begins with a period; (eg: .sdnext
).
The best place would be on an SSD for model loading.
In the Location Bar, type cmd
, then hit [Enter]. This will open a Command Prompt window at that location.
Copy and paste the following commands into the Command Prompt window, one at a time;
git clone https://github.com/vladmandic/sdnext
cd sdnext
.\webui.bat --use-zluda --debug --autolaunch
Compilation and First Generation
Now, try to generate something. This should take a fair while to compile (10-15mins, or even longer; some reports state over an hour), but this compilation should only need to be done once.
Note: The text Compilation is in progress. Please wait...
will repeatedly appear, just be patient. Eventually your image will start generating.
Subsequent generations will be significantly quicker.
Upgrading ZLUDA
If you have problem with ZLUDA after updating SD.Next, upgrading ZLUDA may help.
- Remove
.zluda
folder. - Launch WebUI. The installer will download and install newer ZLUDA.
※ You may have to wait for a while to compile as the first generation.
Experimental features
cuDNN
Speed-up: ★★★☆☆
VRAM: ★★★★☆
Stability: ★★★☆☆
Compatible with: Navi cards
MIOpen, the equivalent of cuDNN for AMDGPUs, hasn't been released on Windows yet.
However, you can enable it with a custom build of MIOpen.
This section describes how to enable cuDNN.
- Install HIP SDK 6.2. If you already have older HIP SDK, uninstall it before installing 6.2.
- Download and install HIP SDK extension from here.
(unzip and paste folders uponpath/to/AMD/ROCm/6.2
) - Remove
.zluda
folder if exists. - Launch WebUI with command line arguments
--use-zluda --use-nightly
.
The first generation will take long time because MIOpen has to find the optimal solution and cache it.
If you get driver crashes, restart webui and try again.
cuBLASLt
Speed-up: ★☆☆☆☆
VRAM: ★☆☆☆☆
Stability: ★★☆☆☆
Compatible with: gfx1100, or CDNA accelerators
hipBLASLt, the equivalent of cuBLASLt for AMDGPUs, hasn't been released on Windows yet.
However, there're unofficial builds available.
This section describes how to enable cuBLASLt.
- Install HIP SDK 6.2. If you already have older HIP SDK, uninstall it before installing 6.2.
- Download and install HIP SDK extension from here.
(unzip and paste folders uponpath/to/AMD/ROCm/6.2
) - Remove
.zluda
folder if exists. - Launch WebUI with command line arguments
--use-zluda --use-nightly
.
triton
Speed-up: ★★★★★
VRAM: ★★★★☆
Stability: ★★★★☆
Compatible with: Navi cards
- Prepare Python 3.11 (or 3.12) environment.
- Download a triton wheel that matches your Python version from here.
(cp312 is Python 3.12, cp311 is Python 3.11 and cp310 is Python 3.10) - Open a PowerShell Windows in the SDNext folder and install via pip.
venv\scripts\python -m pip install --upgrade setuptools
venv\scripts\python -m pip install --upgrade path/to/downloaded/triton.whl
Important
Developer PowerShell for Visual Studio (or Prompt) will be needed to compile kernel using triton.
Flash Attention 2
Using triton, you can enable Flash Attention 2.
- Go to Settings.
- Set attention method to
Scaled Dot-product
. - Enable
Triton Flash attention
. - Restart WebUI.
torch.compile
Using triton, you can enable torch.compile
.
- Go to Settings.
- Enable compilation.
- Set compilation method to
inductor
orcuda-graph
.
※ torch.compile
is currently not compatible with flash attention 2 on ZLUDA.
Comparison (DirectML)
DirectML | ZLUDA | |
---|---|---|
Speed | Slower | Faster |
VRAM Usage | More | Less |
VRAM GC | ❌ | ✅ |
Traning | * | ✅ |
Flash Attention | ❌ | ✅ |
FFT | ✅ | ⚠️ |
DNN | ❓ | ✅ |
RTC | ❓ | ✅ |
Source Code | Closed-source | Open-source |
❓: unknown
⚠️: partially supported
*: known as possible, but uses too much VRAM to train stable diffusion models/LoRAs/etc.
Compatibility
DTYPE | |
---|---|
FP64 | ✅ |
FP32 | ✅ |
FP16 | ✅ |
BF16 | ✅ |
LONG | ✅ |
INT8 | ✅ |
UINT8 | ✅* |
INT4 | ❓ |
FP8 | ⚠️ |
BF8 | ⚠️ |
*: Not tested.
Building rocBLAS for unsupported architectures
This is a guide to build rocBLAS based on the ROCm Official Documentations.
You may have an AMD GPU without official support on ROCm HIP SDK OR if you are using integrated AMD GPU (iGPU), and want it to be supported by HIP SDK on Windows. You may follow the guide below to build your rocBLAS.
If you do not need to build ROCmLibs or already have the library, please skip this.
Make sure you have the following software available on your PC. Otherwise, you may fail to build the ROCmLibs: 1. Visual Studio 2022 2. Python 3. Strawberry Perl 4. CMake 5. Git 6. HIP SDK (Mentioned in the first step) 7. Download rocBLAS and Tensile (Download Tensile 4.38.0 for ROCm 5.7.0 (latest) on Windows)
Edit line 41 in file rdeps.py for rocBLAS. The old repo has an outdated vckpg, which will lead to failed build. Update the vcpkg by entering the following line in the terminal:
git clone -b 2024.02.14 https://github.com/microsoft/vcpkg
Download Tensile 4.38.0
from the release page.
Download Tensile-fix-fallback-arch-build.patch, and place in the Tensile
folder. In this example, the path is: C:\ROCm\Tensile-rocm-5.7.0
.
Enter the following line in the terminal opened in Tensile-rocm-5.7.0
:
git apply Tensile-fix-fallback-arch-build.patch
if your vckpkg version is built later than April, 2023, please replace the CMakeLists.txt
in Tensile/tree/develop/Tensile/Source/lib/CMakeLists.txt
with this CMakeLists.txt, and put in same folder. (For more information, please access ROCm Official Guide)
In C:\ROCm\rocBLAS-rocm-5.7.0
, run:
python rdeps.py
If you encounter any mistake, try to Google and fix it or try it again. Use install.sh -d
in Linux.
Once done, run:
python rmake.py -a "gfx906;gfx1012" --lazy-library-loading --no-merge-architectures -t "C:\ROCm\Tensile-rocm-5.7.0"
Change gfx906;gfx1012
to your GPU LLVM Target. If you want to build multiple ones at a time, make sure to separate with ;
.
Upon successful compilation, rocblas.dll will be generated. In this example, the file path is C:\ROCm\rocBLAS-rocm-5.7.0\build\release\staging\rocblas.dll
. In addition, some Tensile data files will also be produced in C:\ROCm\rocBLAS-rocm-5.7.0\build\release\Tensile\library
.
To compile HIP SDK programs that use hipBLAS/rocBLAS, you need to replace the rocblas.dll file in the SDK with the one that you have just made yourself. Then, place rocblas.dll
into C:\Program Files\AMD\ROCm\5.7\bin
and the Tensile data files into C:\Program Files\AMD\ROCm\5.7\bin\rocblas\library
.
Your programs should run smooth as silk on the designated graphics card now.
ROCm Custom Build
This guide will walk you through building rocBLAS using the official ROCm documentation.
This guide is for users with AMD GPUs lacking official ROCm/HIP SDK support, or those wanting to enable HIP SDK support for hip sdk 5.7 and 6.1.2 on Windows for integrated AMD GPUs(iGPUs)."
If you already have the libraries, you can skip this section!
Prerequisites: Ensure the following software is installed on your PC. python
, git
, and the HIP SDK
are
essential. The script rdeps.py
will automatically download any missing dependencies when you run it.
- Visual Studio 2022: (Download from https://visualstudio.microsoft.com/)
- Python: (Download from https://www.python.org/)
- Strawberry Perl: (Download from https://strawberryperl.com/)
- CMake: (Download from https://cmake.org/download/)
- Git: (Download from https://git-scm.com/)
- HIP SDK: (Download from https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html)
Downloading the Source Code:
- rocBLAS: Download the latest version (https://github.com/ROCm/rocBLAS).
-
ROCm 5.7.0: Download
rocBLAS 3.1.0
rocBLAS 3.1.0 for ROCm 5.7.0- ROCm 6.1.2: Download
rocBLAS 4.1.2
rocBLAS 4.1.2 for ROCm 6.1.2
- ROCm 6.1.2: Download
-
Tensile: Download the appropriate version:(https://github.com/ROCm/Tensile)
-
ROCm 5.7.0: Download
Tensile 4.38.0
Tensile 4.38.0 for ROCm 5.7.0 -
ROCm 6.1.2: Download
Tensile 4.40.0
Tensile 4.40.0 for ROCm 6.1.2
Patching Tensile for ROCm (For Advanced Users, Not-a-must-Do)
These steps are necessary for specific configurations of ROCm and may not be required in all cases. If you had a optimized logic for you gpu arche,you may skip this steps.Especily build libs for xnack- features.
Determine Your ROCm Version:
- ROCm 5.7.0: Follow the instructions for "For hip 5.7" below.
- ROCm 6.1.2: Follow the instructions for "For hip 6.1.2" below.
Patches for Tensile:
For hip 5.7.0:
-
Download Tensile-fix-fallback-arch-build.patch.
-
Place the patch file in your
Tensile
folder (e.g.,C:\ROCM\Tensile-rocm-5.7.0
). -
Open a terminal within the
Tensile
folder. -
Apply the patch:
git apply Tensile-fix-fallback-arch-build.patch
- If nothing appears after applying, it's patched successfully. Otherwise, you may need to manually add the
patch content to
TensileCreateLibrary.py
, you may also skip this steps if you have optimized logic available.
For hip 6.1.2:
-
Place the patch file in your
Tensile
folder (e.g.,C:\ROCM\Tensile-rocm-6.1.2
). -
Open a terminal within the
Tensile
folder. -
Apply the patch:
git apply Tensile-fix-fallback-arch-build-hip-6.1.2.patch
-
If nothing appears after applying, it's patched successfully. Otherwise, you may need to manually add the patch content to
TensileCreateLibrary.py
.
( Skip this step for ROCm 6.1.2 )
Note: edit the line 41 in file rdeps.py for rocBLAS ,The old repo has an outdated vckpg, which will lead to fail build.update the vcpkg ,by replace with the following line
git clone -b 2024.02.14 https://github.com/microsoft/vcpkg
- vcpkg Version: If your vcpkg version was built after April 2023, replace
CMakeLists.txt
inTensile/tree/develop/Tensile/Source/lib/CMakeLists.txt
with this version and place it in the same folder (e.g.,rocm
). - For more information, see the official ROCm guide.
Build with rdeps and rmake:
- Navigate to the
rocm/rocBLAS
directory in your terminal. -
Run
python rdeps.py
. This script will configure your environment and download necessary packages.python rdeps.py
( usinginstall.sh -d
in linux , if you encounter any mistakes , try to google and fix with it or try it again ) after done . try next step -
After
rdeps.py
completes, run(adjust paths and architectures as needed).python rmake.py -a "gfx1101;gfx1103" --lazy-library-loading--no-merge-architectures -t "C:\rocm\Tensile-rocm-5.7.0"
Important:
- Replace
"gfx1101;gfx1103"
with the correct GPU or APU architecture names for your system.Make sure sepearte with ";"if you have more than one arches build . - Make sure read the Editing Tensile/Common.py and blow before to build .
- For ROCm 6.1.2, change the path to
C:\rocm\Tensile-rocm-6.1.2
. - The specific commands and patch files may vary depending on your setup and ROCm version.
After successfully building rocBLAS from source, you need to replace the default rocblas.dll
with your compiled
version for your HIP programs to utilize it. Here's how:
- Locate your Compiled Files:
rocblas.dll
: Located inC:\ROCM\rocBLAS-rocm-5.7.0\build\release\staging\
(or a similar path based on your build location).-
Tensile data files: Found within
C:\ROCM\rocBLAS-rocm-5.7.0\build\release\Tensile\library\
(adjust the path if needed). -
Replace the Default rocBLAS:
-
Copy
rocblas.dll
toC:\Program Files\AMD\ROCm\5.7\bin
. This is where the HIP SDK looks for it by default.( make sure to bakc up the origianl rocblas.dll ) -
Place Tensile Data Files:
-
Navigate to
C:\Program Files\AMD\ROCm\5.7\bin\rocblas\
-
Replace the
library
with new build ( back up the origianl library by rename to different name ,eg ,bklibrary). This is where you should place all the Tensile data files from your build directory. -
Test Your HIP Program:
- Now, when you run your HIP program, it should use your newly compiled
rocblas.dll
and its associated Tensile data files.
- Now, when you run your HIP program, it should use your newly compiled
Important Notes:
* For ROCm 6.1.2, change the path to C:\Program Files\AMD\ROCm\6.1\bin\
.
* Always double-check the paths to ensure they match your installation configuration.
* Make sure the ROCm version in the bin
directory matches the version of rocBLAS you built.
Note: Editing Tensile/Common.py
This file contains general parameters used by the Tensile library. To ensure compatibility with your GPU, you need
to update two specific settings.Update the value of " globalParameters["SupportedISA"]"
and "CACHED_ASM_CAPS"
with yourgpu ISA and info
.and choose the simliar gpu achetecture. eg RND2 for gfx1031 ,RND2 for gfx1032
, then copy and put below with your gpu number and others availble gpu data .For hip sdk 6.1.2 , CACHED_ASM_CAPS
info move to tensile/AsmCaps.py, also edit architectureMap from line299 to 310 , add your arch infomation .map your arch information to correct logic file .however , some optimized logic don't exsit in the offoicial release. then we need to creat it.otherwilse ,it will creat a fallback no optimized rocblas and library.
Here's a step-by-step guide:
- Choose Your Architecture:
- Select an existing architecture folder within
rocBLAS\library\src\blas3\Tensile\Logic\asm_full
(e.g.,navi21
). This will serve as a template for your new architecture. -
Create a new folder with the name of your target architecture (e.g.,
navi22
). -
Copy Files:
- Copy all the files from your chosen template folder into your new architecture folder.
-
Modify Files:
- Open the copied files in a code editor (like VS Code or Visual Studio).
- Search for instances of
navi21
and replace them withnavi22
. - Update any
gfx1030
references togfx1031
(or your target GPU's identifier). - Find lines containing
ISA: [10, 3, 0]
and replace them withISA: [10, 3, 1]
. (Remember to adjust the ISA code according to your GPU) - "Rename all files within the new folder to reflect your architecture name (e.g., change 'navi21' to 'navi22'). You can use a file renaming tool like 'File Rename APP', a free application available in the Windows Store, for this task."
-
if build failed ,that's beacuse ROCm architectures have different capabilities. You need to ensure your
rocblas
is tailored to each architecture you're targeting:-
gfx90c: Doesn't support
4x8II
. Delete any logic or files related to4x8II
within theasm_full
folder underrocBLAS\library\src\blas3\Tensile\Logic
. -
gfx1010: Doesn't support
8II
. Do the same for files related to8II
in theasm_full
folder. - Checking Logic Files: The "new named logic file" is likely a critical place where these operations are defined. Carefully review it and remove any unsupported calculations.
-
-
Use Your New Architecture:
- In
Tensile/Common.py
, update"CACHED_ASM_CAPS"
or the relevant entries inarchitectureMap
to reference your newnavi22
folder.
Important Notes:
- Carefully review the changes you make, as incorrect modifications can lead to errors.
(Skip this for HIP 5.7, Necessary for HIP 6.1.2)
Key Changes:
- Search for
gfx1030
: Begin by searching within both the Tensile and rocBLAS folders for instances ofgfx1030
. This identifier represents a gfx1030 GPU architecture. - Replace with Your Target Architecture: Replace all occurrences of
gfx1030
with the corresponding code for your desired GPU architecture (e.g.,gfx1031
).
Important Files to Modify:
-
Tensile: Within the Tensile folder, make changes to:
CMakeLists.txt
: This file configures the build process and needs adjustments for new architectures.AMDGPU.hpp
: Defines the architecture-specific interface.PlaceholderLibrary.hpp
,Predicaters.hpp
,OclUtiles.cpp
: These files contain code related to specific functionalities, which might require modifications for your target GPU.
-
rocBLAS: In the rocBLAS folder:
CMakeLists.txt
: Similar to Tensile, update this file for your new architecture.handle.cpp
,tensile_host.cpp
,handle.hpp
: These files are likely involved in communication and interactions between rocBLAS and the GPU.
Caution:
- Modifying these core files can have unintended consequences.
Advanced Usage:
For maximum performance optimization, delve deeper into Tensile's logic files. Examples are provided in
rocBLAS\library\src\blas3\Tensile\Logic\asm_full
.
For truly optimized libraries, you'll need to fine-tune these logic files specifically for your target hardware.The Tensile Tuning Guide provides practical guidance and techniques for start this process. Keep in mind that the tuning process requires patience, time, and a willingness to delve into Tensile's inner workings.
More detail can be found in tuning , and tensile tuning .tex , A pdf version available in here
Please feel welcome to edit this post and contribute optimized logic links. Remember to carefully consider the impact of any edits or additions.