Here is the python documentation on the soundfile.read function. We'd be reading in chunks that are appropriately-sized for the transformer model in use. We'd be resampling them to match the bitrate the model expects. Help on function read in module soundfile: read(file, frames=-1, start=0, stop=None, dtype='float64', always_2d=False, fill_value=None, out=None, samplerate=None, channels=None, format=None, subtype=None, endian=None, closefd=True) Provide audio data from a sound file as NumPy array. By default, the whole file is read from the beginning, but the position to start reading can be specified with `start` and the number of frames to read can be specified with `frames`. Alternatively, a range can be specified with `start` and `stop`. If there is less data left in the file than requested, the rest of the frames are filled with `fill_value`. If no `fill_value` is specified, a smaller array is returned. Parameters ---------- file : str or int or file-like object The file to read from. See :class:`SoundFile` for details. frames : int, optional The number of frames to read. If `frames` is negative, the whole rest of the file is read. Not allowed if `stop` is given. start : int, optional Where to start reading. A negative value counts from the end. stop : int, optional The index after the last frame to be read. A negative value counts from the end. Not allowed if `frames` is given. dtype : {'float64', 'float32', 'int32', 'int16'}, optional Data type of the returned array, by default ``'float64'``. Floating point audio data is typically in the range from ``-1.0`` to ``1.0``. Integer data is in the range from ``-2**15`` to ``2**15-1`` for ``'int16'`` and from ``-2**31`` to ``2**31-1`` for ``'int32'``. .. note:: Reading int values from a float file will *not* scale the data to [-1.0, 1.0). If the file contains ``np.array([42.6], dtype='float32')``, you will read ``np.array([43], dtype='int32')`` for ``dtype='int32'``. Returns ------- audiodata : numpy.ndarray or type(out) A two-dimensional (frames x channels) NumPy array is returned. If the sound file has only one channel, a one-dimensional array is returned. Use ``always_2d=True`` to return a two-dimensional array anyway. If `out` was specified, it is returned. If `out` has more frames than available in the file (or if `frames` is smaller than the length of `out`) and no `fill_value` is given, then only a part of `out` is overwritten and a view containing all valid frames is returned. samplerate : int The sample rate of the audio file. Other Parameters ---------------- always_2d : bool, optional By default, reading a mono sound file will return a one-dimensional array. With ``always_2d=True``, audio data is always returned as a two-dimensional array, even if the audio file has only one channel. fill_value : float, optional If more frames are requested than available in the file, the rest of the output is be filled with `fill_value`. If `fill_value` is not specified, a smaller array is returned. out : numpy.ndarray or subclass, optional If `out` is specified, the data is written into the given array instead of creating a new array. In this case, the arguments `dtype` and `always_2d` are silently ignored! If `frames` is not given, it is obtained from the length of `out`. samplerate, channels, format, subtype, endian, closefd See :class:`SoundFile`.