Neutone Blog

Implementing models with overlap-add in Neutone
June 17, 2023

Throughout our workshops and tutorials so far we have extensively covered how to implement Neuton models using causal architectures such as networks based on causal convolutions or recurrent neural networks. These are a natural fit for a VST which processes outputs in realtime one buffer at a time without being able to see into the future.

However, in some cases we simply don’t have access to the underlying architecture of the network, for example when using pretrained networks. In such cases due to the models not being able to handle causal data by design we will hear clicks and pops in the output audio as in the example below.

No overlap-add:

With overlap-add:

A common technique to work around this problem is called overlap-add. In a nutshell, at each step we run inference on a longer buffer than the buffer we are planning to output and we keep an overlap between subsequent inputs. When outputting the current buffer, we fade out the overlap from the previous outputs and fade in the overlap from the current outputs. The picture below makes this easy to visualize.


But how do we implement this in Neutone? We start by using a simple wrapper, such as the one available in the SDK for the clipper. The model always receives a buffer of a certain size, but if we want to have additional overlap we will have to keep track of this state within the model. We start off by defining a setup function. This will define the input and output buffers for the model as well as the triangle shaped faders. buffer_size is given by the DAW at runtime and is guaranteed to be one of the values defined in get_native_buffer_sizes. When the buffer_size in the DAW changes, the wrapper automatically chooses a suitable sample rate and buffer size and calls the set_model_sample_rate_and_buffer_size method on the model which in turn will initialize these buffers. Notice in this example our input to the network is not only of size buffer_size but it also includes the number of overlap samples. This structure will naturally induce a number of samples of delay that we have to report back to the DAW.

def setup(self, buffer_size: int):
    self.buffer_size = int(buffer_size)
    self.overlap_n_samples = 32 # Could be dynamically defined based on the buffer_size
    self.model_segment_size = self.buffer_size + self.overlap_n_samples
    self.fade_up = torch.linspace(0, 1, max(self.overlap_n_samples, 1))
    self.fade_down = torch.linspace(1, 0, max(self.overlap_n_samples, 1))
    self.in_buf = torch.zeros(1 if self.is_input_mono() else 2, self.model_segment_size)
    self.in_buf_tmp = torch.zeros(1 if self.is_input_mono() else 2, self.model_segment_size)
    self.out_buf = torch.zeros(1 if self.is_outpu_mono() else 2, self.model_segment_size)

def get_native_buffer_sizes(self) -> List[int]:
    return [128, 256]

def set_model_sample_rate_and_buffer_size( self, sample_rate: int, n_samples: int ) -> bool:
    self.setup(n_samples) return True

def calc_min_delay_samples(self) -> int:
    return self.overlap_n_samples

def reset(self) -> None:

With all the buffers set up we can now look at the forward pass. Here we simply wrap the internal call to the model with a couple of operations which rotate the input buffer and which apply the faders on the output buffer.

def do_forward_pass(self, x: Tensor, params: Dict[str, Tensor]) -> Tensor:
    # Rotate previous input buffer to the left
    self.in_buf[:, : self.overlap_n_samples] = self.in_buf[:, -self.overlap_n_samples:]
    self.in_buf[:, -self.buffer_size :] = x

    out = self.model(self.in_buf.unsqueeze(0))[0]

    # Apply faders to the output buffer
    self.out_buf[:, -self.overlap_n_samples:] *= self.fade_down
    out[:, : self.overlap_n_samples] *= self.fade_up
    out[:, : self.overlap_n_samples] += self.out_buf[:, -self.overlap_n_samples :]
    self.out_buf = out

    return out[:, :self.buffer_size]

There are a few caveats:

  • We cannot use causal models with this approach like the ones based on cached convolutions because we are feeding the overlap twice to the network
  • It introduces computational overhead for the same reason
  • If the network is unstable it will need a bigger number of overlap samples which in turn increases the computational overhead and the delay of the resulting plugin

As a result, it is preferred to use cached convolutions, RNNs, causal transformers or similar approaches whenever possible. But in the rare cases where these are not an option, overlap-add is a viable alternative.

We are currently testing Demucs in Neutone using this approach, but unfortunately the delay becomes quite large due to the required segment sizes for acceptable quality. We do however make the model available at the URL below if you want to give it a try and might release it in a later version of the plugin. The model will output only vocals with the default parameters and every knob controls the amount of one stem in the output. From left to right: bass, drums, vocals and other.