Loading ...

Speech Settings

A guide on utilizing the stability and similarity sliders in Spectral for personalized voice performances. Discover how to find the right balance for emotive and consistent audio outputs.

Our users have discovered various workflows that suit their preferences. The most common setting you’ll encounter involves stabilizing around 50 and aiming for a similarity of about 80, with minimal adjustments thereafter. However, the effectiveness of these settings depends on the original voice and the desired performance style.


It’s essential to understand that the AI operates non-deterministically; setting specific values for the sliders doesn’t ensure identical results each time. Instead, the sliders act as a range, influencing the extent of randomization between generations. A lower stability setting widens the range, often resulting in a more expressive performance, but this outcome is also influenced by the inherent characteristics of the voice.


Hovering over the ! icon next to the sliders provides additional information.


For a livelier and more dramatic performance, it is advisable to lower the stability slider and generate a few times until you find a suitable performance.


Conversely, for a more serious performance, possibly bordering on monotone with very high values, it is recommended to set the stability slider higher. Since this approach is more consistent and stable, you may not need to generate as many times to achieve the desired result. Experimentation will help you determine the optimal settings for your needs!


The stability slider controls the voice’s stability and the degree of randomness between generations. Decreasing this slider expands the emotional range of the voice, but this is also influenced by the characteristics of the original voice. If set too low, it might produce erratic performances that are excessively random and lead to fast-paced speech. Conversely, setting it too high can result in a monotonous voice with limited emotion.


The similarity slider determines how faithfully the AI should mimic the original voice. If the original audio is of low quality and the similarity slider is set too high, the AI may replicate artifacts or background noise present in the original recording, potentially affecting the quality of the mimicked voice.

Style Exaggeration

This feature aims to enhance the original speaker’s style, but it comes with increased computational resource consumption and potential latency if set to a value other than 0. It’s crucial to be aware that using this setting may slightly reduce the model’s stability, as it works to emphasize and imitate the original voice’s style.


As a general recommendation, it is advised to keep this setting at 0 for optimal performance.

Speaker Boost

The setting is straightforward – it enhances the similarity to the original speaker. However, it comes with a slightly higher computational load, leading to increased latency. The effects introduced by this setting are generally subtle.