Type something to search...
Tiny Diffusion

Tiny Diffusion

Tiny Diffusion

131 14
03 May, 2024

What is Tiny Audio Diffusion ?

Tiny Audio Diffusion is a repository for generating short audio samples and training waveform diffusion models on a consumer-grade GPU with less than 2GB VRAM.


The purpose of this project is to provide access to stereo high-resolution (44.1kHz) conditional and unconditional audio waveform (1D U-Net) diffusion code for those interested in exploration but who have limited resources. There are many methods for audio generation on low-level hardware, but less so specifically for waveform-based diffusion.

The repository is built heavily adapting code from Archinet’s audio-diffusion-pytorch library. A huge thank you to Flavio Schneider for his incredible open-source work in this field!


Direct waveform diffusion is inherently computationally intensive. For example, an audio sample with the industry standard 44.1kHz sampling rate requires 44, 100 samples for just 1 second of audio. Now multiply that by 2 for a stereo file. However, it has a significant advantage over many methods that reduce audio into spectrograms or downsample - the network retains and learns from phase information. Phase is challenging to represent on its own in visual methods, such as spectrograms, as it appears similar to that of random noise. Because of this, many generative methods discard phase information and then implement ways of estimating and regenerating it. However, it plays a key role in defining the timbral qualities of sounds and should not be dispensed with so easily.

Waveform diffusion is able to retain this important feature as it does not perform any transforms on the audio before feeding it into the network. This is how humans perceive sounds, with both amplitude and phase information bundled together in a single signal. As mentioned previously, this comes at the expense of computational requirements and is often reserved for training on a cluster of GPUs with high speeds and lots of memory. Because of this, it is hard to begin to experiment with waveform diffusion with limited resources.

This repository seeks to offer some base code to those looking to experiment with and learn more about waveform diffusion on their own computer without having to purchase cloud resources or upgrade hardware. This goes for not only inference, but training your own models as well!

To make this feasible, however, there must be a tradeoff of quality, speed, and sample length. Because of this, I have focused on training base models for one-shot drum samples - as they are inherently short in sample length.

The current configuration is set up to be able to train ~0.75 second stereo samples at 44.1kHz, allowing for the generation of high-quality one-shot audio samples. The network configuration can be adjusted to improve the resolution, sample rate, training and inference speed, sample length, etc. but, of course, more hardware resources will be required.

Other methods of diffusion, such as diffusion in the latent space (Stable Diffusion’s secret sauce), compared to this repo’s raw waveform diffusion can offer an improvement and other tradeoffs between quality, memory requirements, speed, etc. I recommend this repo to remain up-to-date with the latest research in generative audio: https://github.com/archinetai/audio-ai-timeline

Also recommended is Harmonai’s community project, Dance Diffusion, which implements similar functionality to this repo on a larger scale with several pre-trained models. Colab notebook available.


Follow these steps to set up an environment for both generating audio samples and training models.

NOTE: To use this repo with a GPU, you must have a CUDA-capable GPU and have the CUDA toolkit installed for your specific to your system (ex. Linux, x86_64, WSL-Ubuntu). More information can be found here.

1. Create a Virtual Environment:

Ensure that Anaconda (or Miniconda) is installed and activated. From the command line, cd into the setup/ folder and run the following lines:

Terminal window
conda env create -f environment.yml
conda activate tiny-audio-diffusion

This will create and activate a conda environment from the setup/environment.yml file and install the dependencies in setup/requirements.txt .

2. Install Python Kernel For Jupyter Notebook

Run the following line to create a kernel for the current environment to run the inference notebook.

Terminal window
python -m ipykernel install --user --name tiny-audio-diffusion --display-name "tiny-audio-diffusion (Python 3.10)"

3. Define Environment Variables

Rename .env.tmp to .env and replace the entries with your own variables (example values are random).

Terminal window
# Required if using Weights & Biases (W&B) logger
WANDB_PROJECT=tiny_drum_diffusion # Custom W&B name for current project
WANDB_ENTITY=johnsmith # W&B username
WANDB_API_KEY=a21dzbqlybbzccqla4txa21dzbqlybbzccqla4tx # W&B API key

NOTE: Sign up for a Weights & Biases account to log audio samples, spectrograms, and other metrics while training (it’s free!).

W&B logging example for this repo here.

Pre-trained Models

Pretrained models can be found on Hugging Face (each model contains a .ckpt and .yaml file):

Percussion (all drum types)crlandsc/tiny-audio-diffusion-percussion

Follow current model training progress here (more models will be added as they are trained).

Pre-trained models can be downloaded to generate samples via the inference notebook. They can also be used as a base model to fine-tune on custom data. It is recommended to create subfolders within the saved_models folder to store each model’s .ckpt and .yaml files.


Hugging Face Spaces

Generate samples withot code on πŸ€— Hugging Face Spaces

Jupyter Notebook

Audio Sample Generation

Current Capabilities:

  • Unconditional Generation

  • Conditional β€œStyle-transfer” Generation

Open the Inference.ipynb in Jupyter Notebook and follow the instructions to generate new audio samples. Ensure that the "tiny-audio-diffusion (Python 3.10)" kernel is active in Jupyter to run the notebook and you have downloaded the pre-trained model of interest from Hugging Face.


The model architecture has been constructed with PyTorch Lightning and Hydra frameworks. All configurations for the model are contained within .yaml files and should be edited there rather than hardcoded.

exp/drum_diffusion.yaml contains the default model configuration. Additional custom model configurations can be added to the exp folder.

Custom models can be trained or fine-tuned on custom datasets. Datasets should consist of a folder of .wav audio files with a 44.1kHz sampling rate.

To train or finetune models, run one of the following commands in the terminal from the repo’s root folder and replace <path/to/your/train/data> with the path to your custom training set.

Train model from scratch (on CPU):

(not recommended)

Terminal window
python train.py exp=drum_diffusion datamodule.dataset.path=<path/to/your/train/data>

Train model from scratch (on GPU):

Terminal window
python train.py exp=drum_diffusion trainer.gpus=1 datamodule.dataset.path=<path/to/your/train/data>

NOTE: To train on GPU, you must have a CUDA-capable GPU and have the CUDA toolkit installed for your specific to your system (ex. Linux, x86_64, WSL-Ubuntu). More information can be found here.

Resume run from a checkpoint (with GPU):

Terminal window
python train.py exp=drum_diffusion trainer.gpus=1 +ckpt=</path/to/checkpoint.ckpt> datamodule.dataset.path=<path/to/your/train/data>

Repository Structure

The structure of this repository is as follows:

Terminal window
β”œβ”€β”€ main
β”‚ β”œβ”€β”€ diffusion_module.py - contains pl model, data loading, and logging functionalities for training
β”‚ └── utils.py - contains utility functions for training
β”œβ”€β”€ exp
β”‚ └── *.yaml - Hydra configuration files
β”œβ”€β”€ setup
β”‚ β”œβ”€β”€ environment.yml - file to set up conda environment
β”‚ └── requirements.txt - contains repo dependencies
β”œβ”€β”€ images - directory containing images for README.md
β”‚ └── *.png
β”œβ”€β”€ samples - directory containing sample outputs from tiny-audio-diffusion models
β”‚ └── *.wav
β”œβ”€β”€ .env.tmp - temporary environment variables (rename to .env)
β”œβ”€β”€ .gitignore
β”œβ”€β”€ README.md
β”œβ”€β”€ Inference.ipynb - Jupyter notebook for running inference to generate new samples
β”œβ”€β”€ config.yaml - Hydra base configs
β”œβ”€β”€ train.py - script for training
β”œβ”€β”€ data - directory to host custom training data
β”‚ └── wav_dataset
β”‚ └── (*.wav)
└── saved_models - directory to host model checkpoints and hyper-parameters for inference
└── (kicks/snare/etc.)
β”œβ”€β”€ (*.ckpt) - pl model checkpoint file
└── (config.yaml) - pl model hydra hyperparameters (required for inference)