looks like an analogue to stable diffusion for voice is whisperspeech which functions ok

“quick! to the batporter! the ginseng hives are modulating!” low quality, baseline voice