[wrong] Mozilla DeepSpeech stuff was Re: How to isolate/figure out whispers in audio clip?
https://github.com/mozilla/DeepSpeech DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers. DeepSpeech is an open-source Speech-To-Text engine, using a model trained by machine learning techniques based on `Baidu's Deep Speech research paper <https://arxiv.org/abs/1412.5567>`_. Project DeepSpeech uses Google's `TensorFlow <https://www.tensorflow.org/>`_ to make the implementation easier. Documentation for installation, usage, and training models are available on `deepspeech.readthedocs.io <https://deepspeech.readthedocs.io/?badge=latest>`_. For the latest release, including pre-trained models and checkpoints, `see the latest release on GitHub <https://github.com/mozilla/DeepSpeech/releases/latest>`_. For contribution guidelines, see `CONTRIBUTING.rst <CONTRIBUTING.rst>`_. For contact and support information, see `SUPPORT.rst <SUPPORT.rst>`_.
There are two different parts to this. _Using deepspeech at all_, and _training models to process new kinds of data_. The latter may use a much beefier system than the former. Training can happen much faster with much less data but information on doing that easily, publicly, and normally has not reached me yet. Welcome to DeepSpeech’s documentation! <https://deepspeech.readthedocs.io/en/r0.9/?badge=latest#welcome-to-deepspeech-s-documentation> DeepSpeech is an open source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu’s Deep Speech research paper <https://arxiv.org/abs/1412.5567>. Project DeepSpeech uses Google’s TensorFlow <https://www.tensorflow.org/> to make the implementation easier. To install and use DeepSpeech all you have to do is: # Create and activate a virtualenv virtualenv -p python3 $HOME/tmp/deepspeech-venv/source $HOME/tmp/deepspeech-venv/bin/activate # Install DeepSpeech pip3 install deepspeech # Download pre-trained English model files curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.... curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.... # Download example audio files curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/audio-0.9.3.t... tar xvf audio-0.9.3.tar.gz # Transcribe an audio file deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio audio/2830-3980-0043.wav A pre-trained English model is available for use and can be downloaded following the instructions in the usage docs <https://deepspeech.readthedocs.io/en/r0.9/USING.html#usage-docs>. For the latest release, including pre-trained models and checkpoints, see the GitHub releases page <https://github.com/mozilla/DeepSpeech/releases/latest> . Quicker inference can be performed using a supported NVIDIA GPU on Linux. See the release notes <https://github.com/mozilla/DeepSpeech/releases/latest> to find which GPUs are supported. To run deepspeech on a GPU, install the GPU specific package: # Create and activate a virtualenv virtualenv -p python3 $HOME/tmp/deepspeech-gpu-venv/source $HOME/tmp/deepspeech-gpu-venv/bin/activate # Install DeepSpeech CUDA enabled package pip3 install deepspeech-gpu # Transcribe an audio file. deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio audio/2830-3980-0043.wav Please ensure you have the required CUDA dependencies <https://deepspeech.readthedocs.io/en/r0.9/USING.html#cuda-inference-deps>. See the output of deepspeech -h for more information on the use of deepspeech. (If you experience problems running deepspeech, please check required runtime dependencies <https://deepspeech.readthedocs.io/en/r0.9/USING.html#runtime-deps>). Introduction - Using a Pre-trained Model <https://deepspeech.readthedocs.io/en/r0.9/USING.html> - ... ...
# on my raspberry pi 4, running ubuntu hirsute arm64 $ pip3 install deepspeech Defaulting to user installation because normal site-packages is not writeable ERROR: Could not find a version that satisfies the requirement deepspeech (from versions: none) ERROR: No matching distribution found for deepspeech # uhhh....
Note that there is also a chinese project called deepspeech that likely does the same thing and is slightly more recently updated https://github.com/paddlepaddle/deepspeech . I'm ignoring it for now. Mozilla deepspeech apparently hasn't maintained support for raspberry pis, but it's still doable. If you have a raspberry pi 4, remember that you need to install a 64 bit OS if you want to run aarch64 code. This is from 1 week ago: https://github.com/mozilla/DeepSpeech/issues/3667 Summary Because the DeepSpeech project is not supporting Python bindings for Python 3.8 and later for 64 bit arm systems (aarch64), e.g. used on the Raspberry Pi 4 running Ubuntu server 20.04, DeepSpeech and DeepSpeech-TFLite must be compiled manually. InstructionsBazel You have to install Bazel <https://github.com/bazelbuild/bazel/releases> 3.1.0 first: wget https://github.com/bazelbuild/bazel/releases/download/3.7.2/bazel-3.7.2-linu... chmod 755 bazel-3.7.2-linux-arm64 mv bazel-3.7.2-linux-arm64 /usr/local/bin/bazel cd "/usr/local/lib/bazel/bin" && curl -fLO https://releases.bazel.build/3.1.0/release/bazel-3.1.0-linux-x86_64 && chmod +x bazel-3.1.0-linux-x86_64 Install Python 3.8 Install Python 3.8 locally and set it as default (this should be default on Ubuntu 20.04 and can be skipped): git clone https://github.com/pyenv/pyenv.git ~/.pyenv # Follow instructions on https://github.com/pyenv/pyenv sudo apt-get install libbz2-dev libssl-dev libreadline-dev libsqlite3-dev pyenv install 3.8.10 pyenv local 3.8.10 Compile DeepSpeech git clone https://github.com/mozilla/DeepSpeech.git git checkout v0.9.3 git submodule sync tensorflow/ git submodule update --init tensorflow/ cd tensorflow Configure: When configuring TensorFlow use "-march=armv8-a+crc -Wno-sign-compare" when you are asked: ./configure Compile: NOTE: This is targeting TFLite, but in the comments also the compilation for standard DeepSpeech is given. # Use tmux to keep the process running, when logged in over ssh, if doing this locally, you can skip it tmux bazel clean # For non TFLite version: #bazel --host_jvm_args=-Xmx6000m build --workspace_status_command="bash native_client/bazel_workspace_status_cmd.sh" --local_cpu_resources=1 -c opt --copt=-O3 --copt="-D_GLIBCXX_USE_CXX11_ABI=0" --config=monolithic --config=nogcp --config=nohdfs --config=nonccl --copt=-fvisibility=hidden --config=noaws --copt=-ftree-vectorize --copt=-funsafe-math-optimizations --copt=-ftree-loop-vectorize --copt=-fomit-frame-pointer //native_client:libdeepspeech.so # For TFLite version bazel --host_jvm_args=-Xmx6000m build --workspace_status_command="bash native_client/bazel_workspace_status_cmd.sh" --local_cpu_resources=1 -c opt --copt=-O3 --copt="-D_GLIBCXX_USE_CXX11_ABI=0" --define=runtime=tflite --config=monolithic --config=nogcp --config=nohdfs --config=nonccl --copt=-fvisibility=hidden --config=noaws --copt=-ftree-vectorize --copt=-funsafe-math-optimizations --copt=-ftree-loop-vectorize --copt=-fomit-frame-pointer //native_client:libdeepspeech.so //native_client:tflite cd ../native_client sudo apt-get install -y libsox-dev libpng-dev libgsm1-dev libmagic-dev libltdl-dev liblzma-dev libbz2-dev swig make deepspeech # Do not execute `PREFIX=/usr/local sudo make install` like instructed in the manual, otherwise the libdeepspeech.so will not be included in the Python wheel Python DeepSpeech BindingsPatch Apply this patch to use the correct naming for the Python wheel: diff --cc native_client/definitions.mk index a3af0970,72c12e3a..00000000 --- a/native_client/definitions.mk +++ b/native_client/definitions.mk @@@ -51,7 -52,11 +52,11 @@@ SOX_LDFLAGS := `pkg-config --libs s endif # OS others PYTHON_PACKAGES := numpy${NUMPY_BUILD_VERSION} ifeq ($(OS),Linux) + ifeq ($(PROCESSOR),x86_64) -PYTHON_PLATFORM_NAME := --plat-name manylinux1_x86_64 +PYTHON_PLATFORM_NAME ?= --plat-name manylinux1_x86_64 + else -PYTHON_PLATFORM_NAME := --plat-name linux_${PROCESSOR} ++PYTHON_PLATFORM_NAME ?= --plat-name linux_${PROCESSOR} + endif endif endif Create Python Bindings NOTE: This is targeting TFLite, but in the comments also the compilation for standard DeepSpeech is given. cd python # For non TFLite version # make bindings # For TFLite version make SETUP_FLAGS="--project_name deepspeech_tflite" bindings pip install dist/deepspeech*.whl
So, on Ubuntu 21 (the one I have installed), you can actually install a bazel binary from the signed repositories. Maybe earlier ubuntus too, I dunno. # installs version 3.5.1 of the standalone bazel build tool, for me $ sudo apt-get install bazel-bootstrap Still, it seems incredibly rude of tensorflow to use only such a rare build tool for their project with no reasonable explanation. Bazel doesn't even support 32 bit systems. It is not hard to put all the sourcefiles into cmake or autotools, for something so widely used. git clone https://github.com/mozilla/DeepSpeech.git cd DeepSpeech git checkout v0.9 git log commit ab714a8d1b8aecc92c0a96f85e958e1a617a2f4b (HEAD -> r0.9, origin/r0.9) Merge: 13a490d1 85d057e1 Author: lissyx <1645737+lissyx@users.noreply.github.com> Date: Wed Apr 7 18:35:56 2021 +0200 git submodule update --init --recursive ... long time passes while the entire tensorflow history is downloaded ... I think there are flags to do it faster if needed. Could also check .gitmodules and link the module repository to another tensorflow clone, to skip it.
On 7/5/21, Karl Semich <0xloem@gmail.com> wrote:
https://github.com/mozilla/DeepSpeech
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
If this thing is already well trained on English, needing not much more than reading maybe ten pages or so worth of a training book that it already knows and that comes with the docs, to get it 90+% tuned to a specific voice... it could be very useful for people needing handsfree dictation, thought capture, transcription, voice message conversion, etc. What other tools are out there in freeware to do speech to text? Did Google ever release its voicemail convertor? People have said that service sucks due to nature of needing to be trained for all callers, not to just one. Doubt any of them would be able to pull out whispers without it being trained on whispers for which there isn't much public corpus, and further tuned by a specific whisperer which is even rarer.
On Tue, Jul 6, 2021, 5:22 AM grarpamp <grarpamp@gmail.com> wrote:
On 7/5/21, Karl Semich <0xloem@gmail.com> wrote:
https://github.com/mozilla/DeepSpeech
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
If this thing is already well trained on English, needing not much more than reading maybe ten pages or so worth of a training book that it already knows and that comes with the docs, to get it 90+% tuned to a specific voice... it could be very useful for people needing handsfree dictation, thought capture, transcription, voice message conversion, etc
It's a little more complicated than that but is still quite useful and in use, for all the things you say. What other tools are out there in freeware to do
speech to text?
Did Google ever release its voicemail convertor?
This stuff is all over. All the tools I've found are based on reuse of machine learning models using associated frameworks. People have said that service sucks due to nature
of needing to be trained for all callers, not to just one.
Doubt any of them would be able to pull out whispers without it being trained on whispers for which there isn't much public corpus, and further tuned by a specific whisperer which is even rarer.
I'm surprised this information is new to you. I haven't successfully installed and used one of these myself, and I could really use a dictation service as my ability to direct my hands and such and remember auditory information deteriorates. But you can try them on webportals and/or notebook services like collab.
This is what I bumped into over the last 2 days. I think i've also seen speech to text models in model zoos. I have trouble navigating the internet around this. https://github.com/PaddlePaddle/DeepSpeech https://github.com/kaldi-asr/kaldi https://cmusphinx.github.io/
participants (2)
-
grarpamp
-
Karl Semich