File size: 14,672 Bytes
7934b29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
    "\n",
    "Instructions for setting up Colab are as follows:\n",
    "1. Open a new Python 3 notebook.\n",
    "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
    "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
    "4. Run this cell to set up dependencies.\n",
    "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n",
    "\"\"\"\n",
    "\n",
    "NEMO_DIR_PATH = \"NeMo\"\n",
    "BRANCH = 'r1.17.0'\n",
    "\n",
    "! git clone https://github.com/NVIDIA/NeMo\n",
    "%cd NeMo\n",
    "! python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n",
    "%cd .."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install dependencies\n",
    "!apt-get install sox libsndfile1 ffmpeg\n",
    "!pip install wget\n",
    "!pip install text-unidecode\n",
    "!pip install \"matplotlib>=3.3.2\"\n",
    "!pip install seaborn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multispeaker Simulator\n",
    "\n",
    "This tutorial shows how to use the speech data simulator to generate synthetic multispeaker audio sessions that can be used to train or evaluate models for multispeaker ASR or speaker diarization. This tool aims to address the lack of labelled multispeaker training data and to help models deal with overlapping speech."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 1: Download Required Datasets\n",
    "\n",
    "The LibriSpeech dataset and corresponding word alignments are required for generating synthetic multispeaker audio sessions. For simplicity, only the dev-clean dataset is used for generating synthetic sessions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# download scripts if not already there \n",
    "if not os.path.exists('NeMo/scripts'):\n",
    "  print(\"Downloading necessary scripts\")\n",
    "  !mkdir -p NeMo/scripts/dataset_processing\n",
    "  !mkdir -p NeMo/scripts/speaker_tasks\n",
    "  !wget -P NeMo/scripts/dataset_processing/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/dataset_processing/get_librispeech_data.py\n",
    "  !wget -P NeMo/scripts/speaker_tasks/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/speaker_tasks/create_alignment_manifest.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p LibriSpeech\n",
    "!python {NEMO_DIR_PATH}/scripts/dataset_processing/get_librispeech_data.py \\\n",
    "  --data_root LibriSpeech \\\n",
    "  --data_sets dev_clean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The LibriSpeech forced word alignments are from [this repository](https://github.com/CorentinJ/librispeech-alignments). You can access to the whole LibriSpeech splits at this google drive [link](https://drive.google.com/file/d/1WYfgr31T-PPwMcxuAq09XZfHQO5Mw8fE/view?usp=sharing). We will download the dev-clean part for demo purpose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget -nc https://dldata-public.s3.us-east-2.amazonaws.com/LibriSpeech_Alignments.tar.gz\n",
    "!tar -xzf LibriSpeech_Alignments.tar.gz\n",
    "!rm -f LibriSpeech_Alignments.tar.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 2: Produce Manifest File with Forced Alignments\n",
    "\n",
    "The LibriSpeech manifest file and LibriSpeech forced alignments will now be merged into one manifest file for ease of use when generating synthetic data. \n",
    "\n",
    "Here, the input LibriSpeech alignments are first converted to CTM files, and the CTM files are then combined with the base manifest in order to produce the manifest file with word alignments. When using another dataset, the --use_ctm argument can be used to generate the manifest file using alignments in CTM files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python NeMo/scripts/speaker_tasks/create_alignment_manifest.py \\\n",
    "  --input_manifest_filepath LibriSpeech/dev_clean.json \\\n",
    "  --base_alignment_path LibriSpeech_Alignments \\\n",
    "  --output_manifest_filepath ./dev-clean-align.json \\\n",
    "  --ctm_output_directory ./ctm_out \\\n",
    "  --libri_dataset_split dev-clean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 3: Download Background Noise Data\n",
    "\n",
    "The background noise is constructed from [here](https://www.openslr.org/resources/28/rirs_noises.zip) (although it can be constructed from other background noise datasets instead)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget -nc https://www.openslr.org/resources/28/rirs_noises.zip\n",
    "!unzip -o rirs_noises.zip\n",
    "!rm -f rirs_noises.zip"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Simulator Overview\n",
    "\n",
    "The simulator creates a speech session using utterances from a desired number of speakers. The simulator first selects the LibriSpeech speaker IDs that will be used for the current session, and sets speaker dominance values for each speaker to determine how often each speaker will talk in the session. The session is then constructed by iterating through the following steps:\n",
    "\n",
    "* The next speaker is selected (which could be the same speaker again with some probability, and which accounts for the speaker dominance values).\n",
    "* The sentence length is determined using a probability distribution, and an utterance of the desired length is then constructed by concatenating together (or truncating) LibriSpeech sentences corresponding to the desired speaker. Individual word alignments are used to truncate the last LibriSpeech sentence such that the entire utterance has the desired length.\n",
    "* Next, either the current utterance is overlapped with a previous utterance or silence is introduced before inserting the current utterance. \n",
    "\n",
    "The simulator includes a multi-microphone far-field mode that incorporates synthetic room impulse response generation in order to simulate multi-microphone multispeaker sessions. When using RIR generation, the RIR is computed once per batch of sessions, and then each constructed utterance is convolved with the RIR in order to get the sound recorded by each microphone before adding the utterance to the audio session. In this tutorial, only near field sessions will be generated.\n",
    "\n",
    "The simulator also has a speaker enforcement mode which ensures that the correct number of speakers appear in each session, since it is possible that fewer than the desired number may be present since speaker turns are probabilistic. In speaker enforcement mode, the length of the session or speaker probabilities may be adjusted to ensure all speakers are included before the session finishes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from omegaconf import OmegaConf\n",
    "ROOT = os.getcwd()\n",
    "conf_dir = os.path.join(ROOT,'conf')\n",
    "!mkdir -p $conf_dir\n",
    "CONFIG_PATH = os.path.join(conf_dir, 'data_simulator.yaml')\n",
    "if not os.path.exists(CONFIG_PATH):\n",
    "  !wget -P $conf_dir https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/tools/speech_data_simulator/conf/data_simulator.yaml\n",
    "\n",
    "config = OmegaConf.load(CONFIG_PATH)\n",
    "print(OmegaConf.to_yaml(config))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 4: Create Background Noise Manifest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!find RIRS_NOISES/real_rirs_isotropic_noises/*.wav > bg_noise.list \n",
    "\n",
    "# this function can also be run using the pathfiles_to_diarize_manifest.py script\n",
    "from nemo.collections.asr.parts.utils.manifest_utils import create_manifest\n",
    "create_manifest('bg_noise.list', 'bg_noise.json')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 5: Generate Simulated Audio Session\n",
    "\n",
    "A single 4-speaker session of 60 seconds is generated as an example. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nemo.collections.asr.data.data_simulation import MultiSpeakerSimulator\n",
    "\n",
    "ROOT = os.getcwd()\n",
    "data_dir = os.path.join(ROOT,'simulated_data')\n",
    "config.data_simulator.random_seed=42\n",
    "config.data_simulator.manifest_filepath=\"./dev-clean-align.json\"\n",
    "config.data_simulator.outputs.output_dir=data_dir\n",
    "config.data_simulator.session_config.num_sessions=1\n",
    "config.data_simulator.session_config.session_length=30\n",
    "config.data_simulator.background_noise.add_bg=True\n",
    "config.data_simulator.background_noise.background_manifest=\"./bg_noise.json\"\n",
    "config.data_simulator.session_params.mean_silence=0.2\n",
    "config.data_simulator.session_params.mean_silence_var=0.02\n",
    "config.data_simulator.session_params.mean_overlap=0.15\n",
    "\n",
    "lg = MultiSpeakerSimulator(cfg=config)\n",
    "lg.generate_sessions()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 5: Listen to and Visualize Session\n",
    "\n",
    "Listen to the audio and visualize the corresponding speaker timestamps (recorded in a RTTM file for each session)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import wget\n",
    "import IPython\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import librosa\n",
    "from nemo.collections.asr.parts.utils.speaker_utils import rttm_to_labels, labels_to_pyannote_object\n",
    "\n",
    "ROOT = os.getcwd()\n",
    "data_dir = os.path.join(ROOT,'simulated_data')\n",
    "audio = os.path.join(data_dir,'multispeaker_session_0.wav')\n",
    "rttm = os.path.join(data_dir,'multispeaker_session_0.rttm')\n",
    "\n",
    "sr = 16000\n",
    "signal, sr = librosa.load(audio,sr=sr) \n",
    "\n",
    "fig,ax = plt.subplots(1,1)\n",
    "fig.set_figwidth(20)\n",
    "fig.set_figheight(2)\n",
    "plt.plot(np.arange(len(signal)),signal,'gray')\n",
    "fig.suptitle('Synthetic Audio Session', fontsize=16)\n",
    "plt.xlabel('Time (s)', fontsize=18)\n",
    "ax.margins(x=0)\n",
    "plt.ylabel('Signal Strength', fontsize=16);\n",
    "a,_ = plt.xticks();plt.xticks(a,a/sr);\n",
    "\n",
    "IPython.display.Audio(audio)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The visualization is useful for seeing both the distribution of utterance lengths, the differing speaker dominance values, and the amount of overlap in the session."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#display speaker labels for reference\n",
    "labels = rttm_to_labels(rttm)\n",
    "reference = labels_to_pyannote_object(labels)\n",
    "reference"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 6: Get Simulated Data Statistics "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.path.exists(\"multispeaker_data_analysis.py\"):\n",
    "  !wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/speaker_tasks/multispeaker_data_analysis.py\n",
    "\n",
    "from multispeaker_data_analysis import run_multispeaker_data_analysis\n",
    "\n",
    "session_dur = 30\n",
    "silence_mean = 0.2\n",
    "silence_var = 0.1\n",
    "overlap_mean = 0.15\n",
    "run_multispeaker_data_analysis(data_dir, session_dur=session_dur, silence_var=silence_var, overlap_mean=overlap_mean, precise=True);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Files Produced\n",
    "\n",
    "The following files are produced by the simulator:\n",
    "\n",
    "* wav files (one per audio session) - the output audio sessions\n",
    "* rttm files (one per audio session) - the speaker timestamps for the corresponding audio session (used for diarization training)\n",
    "* json files (one per audio session) - the output manifest file for the corresponding audio session (containing text transcriptions, utterance durations, full paths to audio files, words, and word alignments)\n",
    "* ctm files (one per audio session) - contains word-by-word alignments, speaker ID, and word\n",
    "* txt files (one per audio session) - contains the full text transcription for a given session\n",
    "* list files (one per file type per batch of sessions) - a list of generated files of the given type (wav, rttm, json, ctm, or txt), used primarily for manifest creation\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls simulated_data"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}