File size: 14,672 Bytes
7934b29 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 |
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
"\n",
"Instructions for setting up Colab are as follows:\n",
"1. Open a new Python 3 notebook.\n",
"2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
"3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
"4. Run this cell to set up dependencies.\n",
"5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n",
"\"\"\"\n",
"\n",
"NEMO_DIR_PATH = \"NeMo\"\n",
"BRANCH = 'r1.17.0'\n",
"\n",
"! git clone https://github.com/NVIDIA/NeMo\n",
"%cd NeMo\n",
"! python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n",
"%cd .."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Install dependencies\n",
"!apt-get install sox libsndfile1 ffmpeg\n",
"!pip install wget\n",
"!pip install text-unidecode\n",
"!pip install \"matplotlib>=3.3.2\"\n",
"!pip install seaborn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Multispeaker Simulator\n",
"\n",
"This tutorial shows how to use the speech data simulator to generate synthetic multispeaker audio sessions that can be used to train or evaluate models for multispeaker ASR or speaker diarization. This tool aims to address the lack of labelled multispeaker training data and to help models deal with overlapping speech."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Step 1: Download Required Datasets\n",
"\n",
"The LibriSpeech dataset and corresponding word alignments are required for generating synthetic multispeaker audio sessions. For simplicity, only the dev-clean dataset is used for generating synthetic sessions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"# download scripts if not already there \n",
"if not os.path.exists('NeMo/scripts'):\n",
" print(\"Downloading necessary scripts\")\n",
" !mkdir -p NeMo/scripts/dataset_processing\n",
" !mkdir -p NeMo/scripts/speaker_tasks\n",
" !wget -P NeMo/scripts/dataset_processing/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/dataset_processing/get_librispeech_data.py\n",
" !wget -P NeMo/scripts/speaker_tasks/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/speaker_tasks/create_alignment_manifest.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!mkdir -p LibriSpeech\n",
"!python {NEMO_DIR_PATH}/scripts/dataset_processing/get_librispeech_data.py \\\n",
" --data_root LibriSpeech \\\n",
" --data_sets dev_clean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The LibriSpeech forced word alignments are from [this repository](https://github.com/CorentinJ/librispeech-alignments). You can access to the whole LibriSpeech splits at this google drive [link](https://drive.google.com/file/d/1WYfgr31T-PPwMcxuAq09XZfHQO5Mw8fE/view?usp=sharing). We will download the dev-clean part for demo purpose."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!wget -nc https://dldata-public.s3.us-east-2.amazonaws.com/LibriSpeech_Alignments.tar.gz\n",
"!tar -xzf LibriSpeech_Alignments.tar.gz\n",
"!rm -f LibriSpeech_Alignments.tar.gz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Step 2: Produce Manifest File with Forced Alignments\n",
"\n",
"The LibriSpeech manifest file and LibriSpeech forced alignments will now be merged into one manifest file for ease of use when generating synthetic data. \n",
"\n",
"Here, the input LibriSpeech alignments are first converted to CTM files, and the CTM files are then combined with the base manifest in order to produce the manifest file with word alignments. When using another dataset, the --use_ctm argument can be used to generate the manifest file using alignments in CTM files."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python NeMo/scripts/speaker_tasks/create_alignment_manifest.py \\\n",
" --input_manifest_filepath LibriSpeech/dev_clean.json \\\n",
" --base_alignment_path LibriSpeech_Alignments \\\n",
" --output_manifest_filepath ./dev-clean-align.json \\\n",
" --ctm_output_directory ./ctm_out \\\n",
" --libri_dataset_split dev-clean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Step 3: Download Background Noise Data\n",
"\n",
"The background noise is constructed from [here](https://www.openslr.org/resources/28/rirs_noises.zip) (although it can be constructed from other background noise datasets instead)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!wget -nc https://www.openslr.org/resources/28/rirs_noises.zip\n",
"!unzip -o rirs_noises.zip\n",
"!rm -f rirs_noises.zip"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Simulator Overview\n",
"\n",
"The simulator creates a speech session using utterances from a desired number of speakers. The simulator first selects the LibriSpeech speaker IDs that will be used for the current session, and sets speaker dominance values for each speaker to determine how often each speaker will talk in the session. The session is then constructed by iterating through the following steps:\n",
"\n",
"* The next speaker is selected (which could be the same speaker again with some probability, and which accounts for the speaker dominance values).\n",
"* The sentence length is determined using a probability distribution, and an utterance of the desired length is then constructed by concatenating together (or truncating) LibriSpeech sentences corresponding to the desired speaker. Individual word alignments are used to truncate the last LibriSpeech sentence such that the entire utterance has the desired length.\n",
"* Next, either the current utterance is overlapped with a previous utterance or silence is introduced before inserting the current utterance. \n",
"\n",
"The simulator includes a multi-microphone far-field mode that incorporates synthetic room impulse response generation in order to simulate multi-microphone multispeaker sessions. When using RIR generation, the RIR is computed once per batch of sessions, and then each constructed utterance is convolved with the RIR in order to get the sound recorded by each microphone before adding the utterance to the audio session. In this tutorial, only near field sessions will be generated.\n",
"\n",
"The simulator also has a speaker enforcement mode which ensures that the correct number of speakers appear in each session, since it is possible that fewer than the desired number may be present since speaker turns are probabilistic. In speaker enforcement mode, the length of the session or speaker probabilities may be adjusted to ensure all speakers are included before the session finishes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from omegaconf import OmegaConf\n",
"ROOT = os.getcwd()\n",
"conf_dir = os.path.join(ROOT,'conf')\n",
"!mkdir -p $conf_dir\n",
"CONFIG_PATH = os.path.join(conf_dir, 'data_simulator.yaml')\n",
"if not os.path.exists(CONFIG_PATH):\n",
" !wget -P $conf_dir https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/tools/speech_data_simulator/conf/data_simulator.yaml\n",
"\n",
"config = OmegaConf.load(CONFIG_PATH)\n",
"print(OmegaConf.to_yaml(config))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Step 4: Create Background Noise Manifest"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!find RIRS_NOISES/real_rirs_isotropic_noises/*.wav > bg_noise.list \n",
"\n",
"# this function can also be run using the pathfiles_to_diarize_manifest.py script\n",
"from nemo.collections.asr.parts.utils.manifest_utils import create_manifest\n",
"create_manifest('bg_noise.list', 'bg_noise.json')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Step 5: Generate Simulated Audio Session\n",
"\n",
"A single 4-speaker session of 60 seconds is generated as an example. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from nemo.collections.asr.data.data_simulation import MultiSpeakerSimulator\n",
"\n",
"ROOT = os.getcwd()\n",
"data_dir = os.path.join(ROOT,'simulated_data')\n",
"config.data_simulator.random_seed=42\n",
"config.data_simulator.manifest_filepath=\"./dev-clean-align.json\"\n",
"config.data_simulator.outputs.output_dir=data_dir\n",
"config.data_simulator.session_config.num_sessions=1\n",
"config.data_simulator.session_config.session_length=30\n",
"config.data_simulator.background_noise.add_bg=True\n",
"config.data_simulator.background_noise.background_manifest=\"./bg_noise.json\"\n",
"config.data_simulator.session_params.mean_silence=0.2\n",
"config.data_simulator.session_params.mean_silence_var=0.02\n",
"config.data_simulator.session_params.mean_overlap=0.15\n",
"\n",
"lg = MultiSpeakerSimulator(cfg=config)\n",
"lg.generate_sessions()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Step 5: Listen to and Visualize Session\n",
"\n",
"Listen to the audio and visualize the corresponding speaker timestamps (recorded in a RTTM file for each session)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import os\n",
"import wget\n",
"import IPython\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import librosa\n",
"from nemo.collections.asr.parts.utils.speaker_utils import rttm_to_labels, labels_to_pyannote_object\n",
"\n",
"ROOT = os.getcwd()\n",
"data_dir = os.path.join(ROOT,'simulated_data')\n",
"audio = os.path.join(data_dir,'multispeaker_session_0.wav')\n",
"rttm = os.path.join(data_dir,'multispeaker_session_0.rttm')\n",
"\n",
"sr = 16000\n",
"signal, sr = librosa.load(audio,sr=sr) \n",
"\n",
"fig,ax = plt.subplots(1,1)\n",
"fig.set_figwidth(20)\n",
"fig.set_figheight(2)\n",
"plt.plot(np.arange(len(signal)),signal,'gray')\n",
"fig.suptitle('Synthetic Audio Session', fontsize=16)\n",
"plt.xlabel('Time (s)', fontsize=18)\n",
"ax.margins(x=0)\n",
"plt.ylabel('Signal Strength', fontsize=16);\n",
"a,_ = plt.xticks();plt.xticks(a,a/sr);\n",
"\n",
"IPython.display.Audio(audio)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The visualization is useful for seeing both the distribution of utterance lengths, the differing speaker dominance values, and the amount of overlap in the session."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#display speaker labels for reference\n",
"labels = rttm_to_labels(rttm)\n",
"reference = labels_to_pyannote_object(labels)\n",
"reference"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Step 6: Get Simulated Data Statistics "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if not os.path.exists(\"multispeaker_data_analysis.py\"):\n",
" !wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/speaker_tasks/multispeaker_data_analysis.py\n",
"\n",
"from multispeaker_data_analysis import run_multispeaker_data_analysis\n",
"\n",
"session_dur = 30\n",
"silence_mean = 0.2\n",
"silence_var = 0.1\n",
"overlap_mean = 0.15\n",
"run_multispeaker_data_analysis(data_dir, session_dur=session_dur, silence_var=silence_var, overlap_mean=overlap_mean, precise=True);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Files Produced\n",
"\n",
"The following files are produced by the simulator:\n",
"\n",
"* wav files (one per audio session) - the output audio sessions\n",
"* rttm files (one per audio session) - the speaker timestamps for the corresponding audio session (used for diarization training)\n",
"* json files (one per audio session) - the output manifest file for the corresponding audio session (containing text transcriptions, utterance durations, full paths to audio files, words, and word alignments)\n",
"* ctm files (one per audio session) - contains word-by-word alignments, speaker ID, and word\n",
"* txt files (one per audio session) - contains the full text transcription for a given session\n",
"* list files (one per file type per batch of sessions) - a list of generated files of the given type (wav, rttm, json, ctm, or txt), used primarily for manifest creation\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!ls simulated_data"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"metadata": {
"collapsed": false
},
"source": []
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}
|