camenduru's picture
thanks to NVIDIA ❀
7934b29

Speaker Diarization

Documentation section for speaker related tasks can be found at:

Features of NeMo Speaker Diarization

  • Provides pretrained speaker embedding extractor models and VAD models.
  • Does not need to be tuned on dev-set while showing the better performance than AHC+PLDA method in general.
  • Estimates the number of speakers in the given session.
  • Provides example script for asr transcription with speaker labels.

Supported Pretrained Speaker Embedding Extractor models

Supported Pretrained VAD models

Supported ASR models

QuartzNet, CitriNet and Conformer-CTC models are supported. Recommended models on NGC:

Performance

Clustering Diarizer

Diarization Error Rate (DER) table of titanet_large.nemo model on well known evaluation datasets.

Evaluation Condition AMI(Lapel) AMI(MixHeadset) CH109 NIST SRE 2000
Domain Configuration Meeting Meeting Telephonic Telephonic
Oracle VAD
Known # of Speakers
1.28 1.07 0.56 5.62
Oracle VAD
Unknown # of Speakers
1.28 1.4 0.88 4.33
  • All models were tested using the domain specific .yaml files which can be found in conf/inference/ folder.
  • The above result is based on the oracle Voice Activity Detection (VAD) result.
  • This result is based on titanet_large.nemo model.

Neural Diarizer

Multi-scale Diarization Decoder (MSDD) model Multi-scale Diarization decoder Diarization Error Rate (DER) table of diar_msdd_telephonic.nemo model on telephonic speech datasets.

CH109 Forgiving Fair Full
(collar, ignore_overlap) (0.25, True) (0.25, True) (0.0, False)
False Alarm - 0.62% 1.80%
Miss - 2.47% 5.96%
Confusion - 0.43% 2.10%
DER 0.58% 3.52% 9.86%
CALLHOME Forgiving Fair Full
(collar, ignore_overlap) (0.25, True) (0.25, True) (0.0, False)
False Alarm - 1.05% 2.24%
Miss - 7.62% 11.09%
Confusion - 4.06% 6.03%
DER 4.15% 12.73% 19.37%
  • Evaluation setting: Oracle VAD
    Unknown number of speakers (max. 8)
  • Clustering parameter: max_rp_threshold=0.15
  • All models were tested using the domain specific .yaml files which can be found in conf/inference/ folder.
  • The above result is based on the oracle Voice Activity Detection (VAD) result.

Run Speaker Diarization on Your Audio Files

Example script for clustering diarizer: with system-VAD

  python clustering_diarizer/offline_diar_infer.py \
    diarizer.manifest_filepath=<path to manifest file> \
    diarizer.out_dir='demo_output' \
    diarizer.speaker_embeddings.parameters.save_embeddings=False \
    diarizer.vad.model_path=<pretrained model name or path to .nemo> \
    diarizer.speaker_embeddings.model_path=<pretrained speaker embedding model name or path to .nemo> 

Example script for neural diarizer: with system-VAD

  python neural_diarizer/multiscale_diar_decoder_infer.py \
    diarizer.manifest_filepath=<path to manifest file> \
    diarizer.out_dir='demo_output' \
    diarizer.speaker_embeddings.parameters.save_embeddings=False \
    diarizer.vad.model_path=<pretrained model name or path to .nemo> \
    diarizer.speaker_embeddings.model_path=<pretrained speaker embedding model name or path to .nemo> \
    diarizer.msdd_model.model_path=<pretrained MSDD model name or path .nemo> \

If you have oracle VAD files and groundtruth RTTM files for evaluation: Provide rttm files in the input manifest file and enable oracle_vad as shown below.

...
    diarizer.oracle_vad=True \
...

Arguments

      To run speaker diarization on your audio recordings, you need to prepare the following file.

  • diarizer.manifest_filepath: Path to manifest file

Example: manifest.json

{"audio_filepath": "/path/to/audio_file", "offset": 0, "duration": null, label: "infer", "text": "-", "num_speakers": null, "rttm_filepath": "/path/to/rttm/file", "uem_filepath"="/path/to/uem/filepath"}

Mandatory fields are audio_filepath, offset, duration, label:"infer" and text: <ground truth or "-" > , and the rest are optional keys which can be passed based on the type of evaluation

Some of important options in config file:

  • diarizer.vad.model_path: voice activity detection modle name or path to the model

Specify the name of VAD model, then the script will download the model from NGC. Currently, we have 'vad_multilingual_marblenet', 'vad_marblenet' and 'vad_telephony_marblenet' as options for VAD models.

diarizer.vad.model_path='vad_multilingual_marblenet'

Instead, you can also download the model from vad_multilingual_marblenet, vad_marblenet and vad_telephony_marblenet and specify the full path name to the model as below.

diarizer.vad.model_path='path/to/vad_multilingual_marblenet.nemo'

  • diarizer.speaker_embeddings.model_path: speaker embedding model name

Specify the name of speaker embedding model, then the script will download the model from NGC. Currently, we support 'titanet_large', 'ecapa_tdnn' and 'speakerverification_speakernet'.

diarizer.speaker_embeddings.model_path='titanet_large'

You could also download *.nemo files from this link and specify the full path name to the speaker embedding model file (*.nemo).

diarizer.speaker_embeddings.model_path='path/to/titanet_large.nemo'

  • **diarizer.speaker_embeddings.parameters.multiscale_weights: multiscale diarization **

Multiscale diarization system employs multiple scales at the same time to obtain a finer temporal resolution. To use multiscale feature, at least two scales and scale weights should be provided. The scales should be provided in descending order, from the longest scale to the base scale (the shortest). If multiple scales are provided, multiscale_weights must be provided in list format. The following example shows how multiscale parameters are specified and the recommended parameters.

  • diarizer.msdd_model.model_path: neural diarizer (multiscale diarization decoder) name

If you want to use a neural diarizer model (e.g., MSDD model), specify the name of the neural diarizer model, then the script will download the model from NGC. Currently, we support 'diar_msdd_telephonic'.

Note that you should not specify a scale setting that does not match with the MSDD model you are using. For example, diar_msdd_telephonic model is based on 5 scales as in the configs in model configs.

`diarizer.speaker_embeddings.model_path='diar_msdd_telephonic'

You could also download diar_msdd_telephonic and specify the full path name to the speaker embedding model file (*.nemo).

diarizer.msdd_model.model_path='path/to/diar_msdd_telephonic.nemo'

Example script: single-scale and multiscale

Single-scale setting:

  python offline_diar_infer.py \
     ... <other parameters> ...
     parameters.window_length_in_sec=1.5 \
     parameters.shift_length_in_sec=0.75 \
     parameters.multiscale_weights=null \

Multiscale setting (base scale - window_length 0.5 s and shift_length 0.25):

  python offline_diar_infer.py \
     ... <other parameters> ...
     parameters.window_length_in_sec=[1.5,1.0,0.5] \
     parameters.shift_length_in_sec=[0.75,0.5,0.25] \
     parameters.multiscale_weights=[0.33,0.33,0.33] \

Run Speech Recognition with Speaker Diarization

Using the script offline_diar_with_asr_infer.py, you can transcribe your audio recording with speaker labels as shown below:

[00:03.34 - 00:04.46] speaker_0: back from the gym oh good how's it going 
[00:04.46 - 00:09.96] speaker_1: oh pretty well it was really crowded today yeah i kind of assumed everyone would be at the shore uhhuh
[00:12.10 - 00:13.97] speaker_0: well it's the middle of the week or whatever so

Currently, NeMo offline diarization inference supports QuartzNet English model and ConformerCTC ASR models (e.g.,QuartzNet15x5Base-En, stt_en_conformer_ctc_large).

Example script

python offline_diar_with_asr_infer.py \
    diarizer.manifest_filepath=<path to manifest file> \
    diarizer.out_dir='demo_asr_output' \
    diarizer.speaker_embeddings.model_path=<pretrained model name or path to .nemo> \
    diarizer.asr.model_path=<pretrained model name or path to .nemo> \
    diarizer.speaker_embeddings.parameters.save_embeddings=False \
    diarizer.asr.parameters.asr_based_vad=True

If you have reference rttm files or oracle number of speaker information, you can provide those file paths and number of speakers in the manifest file path and pass diarizer.clustering.parameters.oracle_num_speakers=True as shown in the following example.

python offline_diar_with_asr_infer.py \
    diarizer.manifest_filepath=<path to manifest file> \
    diarizer.out_dir='demo_asr_output' \
    diarizer.speaker_embeddings.model_path=<pretrained model name or path to .nemo> \
    diarizer.asr.model_path=<pretrained model name or path to .nemo> \
    diarizer.speaker_embeddings.parameters.save_embeddings=False \
    diarizer.asr.parameters.asr_based_vad=True \
    diarizer.clustering.parameters.oracle_num_speakers=True

Output folders

The above script will create a folder named ./demo_asr_output/. For example, in ./demo_asr_output/, you can check the results as below.

./asr_with_diar
β”œβ”€β”€ pred_rttms
    └── my_audio1.json
    └── my_audio1.txt
    └── my_audio1.rttm
    └── my_audio1_gecko.json
β”‚
└── speaker_outputs
    └── oracle_vad_manifest.json
    └── subsegments_scale2_cluster.label
    └── subsegments_scale0.json
    └── subsegments_scale1.json
    └── subsegments_scale2.json
... 

my_audio1.json file contains word-by-word json output with speaker label and time stamps. We also provide a json output file for gecko tool, where you can visualize the diarization result along with the ASR output.

Example: ./demo_asr_output/pred_rttms/my_audio1.json

{
    "status": "Success",
    "session_id": "my_audio1",
    "transcription": "back from the gym oh good ...",
    "speaker_count": 2,
    "words": [
        {
            "word": "back",
            "start_time": 0.44,
            "end_time": 0.56,
            "speaker_label": "speaker_0"
        },
...
        {
            "word": "oh",
            "start_time": 1.74,
            "end_time": 1.88,
            "speaker_label": "speaker_1"
        },
        {
            "word": "good",
            "start_time": 2.08,
            "end_time": 3.28,
            "speaker_label": "speaker_1"
        },

*.txt files in pred_rttms folder contain transcriptions with speaker labels and corresponding time.

Example: ./demo_asr_output/pred_rttms/my_audio1.txt

[00:03.34 - 00:04.46] speaker_0: back from the gym oh good how's it going
[00:04.46 - 00:09.96] speaker_1: pretty well it was really crowded today yeah i kind of assumed everylonewould be at the shore uhhuh
[00:12.10 - 00:13.97] speaker_0: well it's the middle of the week or whatever so
[00:13.97 - 00:15.78] speaker_1: but it's the fourth of july mm
[00:16.90 - 00:21.80] speaker_0: so yeah people still work tomorrow do you have to work tomorrow did you drive off yesterday

In speaker_outputs folder we have three kinds of files as follows:

  • oracle_vad_manifest.json file contains oracle VAD labels that are extracted from RTTM files.
  • subsegments_scale<scale_index>.json is a manifest file for subsegments, which includes segment-by-segment start and end time with original wav file path. In multi-scale mode, this file is generated for each <scale_index>.
  • subsegments_scale<scale_index>_cluster.label file contains the estimated cluster labels for each segment. This file is only generated for the base scale index in multi-scale diarization mode.

Optional Features for Speech Recognition with Speaker Diarization

Beam Search Decoder

Beam-search decoder can be applied to CTC based ASR models. To use this feature, pyctcdecode should be installed. pyctcdecode supports word timestamp generation and can be applied to speaker diarization. pyctcdecode also requires KenLM and KenLM is recommended to be installed using PyPI. Install pyctcdecode in your environment with the following commands:

pip install pyctcdecode
pip install https://github.com/kpu/kenlm/archive/master.zip

You should provide a trained KenLM language model to use pyctcdecode. Binary or .arpa format can be provided to hydra configuration as below.

  python offline_diar_with_asr_infer.py \
    ... <other parameters> ...
    diarizer.asr.ctc_decoder_parameters.pretrained_language_model="/path/to/kenlm_language_model.binary"

You can download publicly available language models (.arpa files) at KALDI Tedlium Language Models. Download 4-gram Big ARPA and provide the model path.

The following CTC decoder parameters can be modified to optimize the performance.
diarizer.asr.ctc_decoder_parameters.beam_width (default: 32)
diarizer.asr.ctc_decoder_parameters.alpha (default: 0.5)
diarizer.asr.ctc_decoder_parameters.beta (default: 2.5)

Realign Words with a Language Model (Experimental)

Diarization result with ASR transcript can be enhanced by applying a language model. To use this feature, python package arpa should be installed.

pip install arpa

diarizer.asr.realigning_lm_parameters.logprob_diff_threshold can be modified to optimize the diarization performance (default value is 1.2). The lower the threshold, the more changes are expected to be seen in the output transcript.

arpa package also uses KenLM language models as in pyctcdecode. You can download publicly available 4-gram Big ARPA model and provide the model path to hydra configuration as follows.

python offline_diar_with_asr_infer.py \
    ... <other parameters> ...
    diarizer.asr.realigning_lm_parameters.logprob_diff_threshold=1.2 \
    diarizer.asr.realigning_lm_parameters.arpa_language_model="/path/to/4gram_big.arpa"\