Skip to content
Lotte V edited this page Jan 11, 2026 · 2 revisions

Embeds

Embeds are parameters that can be optionally trained in a model for the sake of feature enhancement.

Precautions

  • Training variance embeds together with duration can possibly cause instability in the model. In this case, you may refer to this document for best practices for training.
  • Variance embeds (including pitch) may make certain types of vocal modes (such as "Soft", "Power", etc.) less relevant. You can still train both, but the effects might be similar, possibly causing diminished returns.

Variance embeds

Variance embeds, aside from pitch, will all be exported to the variance.onnx file for distribution. Note that pitch uses its own pitch.onnx file.

Breathiness

Controls the breathiness of the voice. Higher breathiness will make the voice breathier, lower breathiness will make the voice less so. Has less effect on the "power" of the voice than voicing and energy.

Tension

Makes the voice more or less "tense", depending on the parameter value. A higher value also usually adds more power to the voice, along with a somewhat nasal quality.

Voicing

Controls the amount of voicing in the voice. A higher value will make add a degree of power to the voice, without the nasal quality of tension. A lower value will create devoicing, in a matter distinct from breathiness, with a more "lax" sound. (Note: In order to increase voicing in the OpenUtau software, you must manually edit the maximum value in the Expressions menu of the program.)

Energy

Controls the energetic quality of the voice. Higher energy will make the voice sound more energetic, whereas a lower value will make the voice sound more "lax". Note: this parameter will likely be deprecated in DiffSinger v3. Voicing has a similar effect and is considered more stable.

Pitch

This embed is trained off of the pitch variations in the original data. With this parameter enabled, you can choose to load automatic pitch values into your project file. As mentioned above, pitch will be exported to it's own ONNX-model.

Acoustic embeds

Acoustic embeds are present in the acoustic model and aren't exported separately. This counts for both the checkpoints as well as the exported ONNX file.

Velocity

Controls the velocity of phonemes, without affecting the duration. A higher value will make phonemes sound more "snappy", whereas a lower value will make them sound more "slurred".

Gender (random pitch shifting)

Controls the formant quality of the voice. A higher value will give a "deeper", more masculine quality to the voice, whereas a lower value will make the voice sound more "childlike".

Clone this wiki locally