In [None]:
BRANCH = 'r1.17.0'

In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell

# install NeMo
BRANCH = 'r1.17.0'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[nlp]

In [None]:
import os
import wget
from nemo.collections import nlp as nemo_nlp
from nemo.collections import common as nemo_common
from omegaconf import OmegaConf

# Tokenizers Background

For Natural Language Processing, tokenization is an essential part of data preprocessing. It is the process of splitting a string into a list of tokens. One can think of token as parts like a word is a token in a sentence.
Depending on the application, different tokenizers are more suitable than others. 


For example, a WordTokenizer that splits the string on any whitespace, would tokenize the following string 

"My first program, Hello World." -> ["My", "first", "program,", "Hello", "World."]

To turn the tokens into numerical model input, the standard method is to use a vocabulary and one-hot vectors for [word embeddings](https://en.wikipedia.org/wiki/Word_embedding). If a token appears in the vocabulary, its index is returned, if not the index of the unknown token is returned to mitigate out-of-vocabulary (OOV).




# Tokenizers in NeMo

In NeMo, we support the most used tokenization algorithms. We offer a wrapper around [Hugging Faces's AutoTokenizer](https://huggingface.co/transformers/model_doc/auto.html#autotokenizer) - a factory class that gives access to all Hugging Face tokenizers. This includes particularly all BERT-like model tokenizers, such as BertTokenizer, AlbertTokenizer, RobertaTokenizer, GPT2Tokenizer. Apart from that, we also support other tokenizers such as WordTokenizer, CharTokenizer, and [Google's SentencePieceTokenizer](https://github.com/google/sentencepiece). 


We make sure that all tokenizers are compatible with BERT-like models, e.g. BERT, Roberta, Albert, and Megatron. For that, we provide a high-level user API `get_tokenizer()`, which allows the user to instantiate a tokenizer model with only four input arguments: 
* `tokenizer_name: str`
* `tokenizer_model: Optional[str] = None`
* `vocab_file: Optional[str] = None`
* `special_tokens: Optional[Dict[str, str]] = None`

Hugging Face and Megatron tokenizers (which uses Hugging Face underneath) can be automatically instantiated by only `tokenizer_name`, which downloads the corresponding `vocab_file` from the internet. 

For SentencePieceTokenizer, WordTokenizer, and CharTokenizers `tokenizer_model` or/and `vocab_file` can be generated offline in advance using [`scripts/tokenizers/process_asr_text_tokenizer.py`](https://github.com/NVIDIA/NeMo/blob/stable/scripts/tokenizers/process_asr_text_tokenizer.py)

The tokenizers in NeMo are designed to be used interchangeably, especially when
used in combination with a BERT-based model.

Let's take a look at the list of available tokenizers:

In [None]:
nemo_nlp.modules.get_tokenizer_list()

# Hugging Face AutoTokenizer

In [None]:
# instantiate tokenizer wrapper using pretrained model name only
tokenizer1 = nemo_nlp.modules.get_tokenizer(tokenizer_name="bert-base-cased")

# the wrapper has a reference to the original HuggingFace tokenizer
print(tokenizer1.tokenizer)

In [None]:
# check vocabulary (this can be very long)
print(tokenizer1.tokenizer.vocab)

In [None]:
# show all special tokens if it has any
print(tokenizer1.tokenizer.all_special_tokens)

In [None]:
# instantiate tokenizer using custom vocabulary
vocab_file = "myvocab.txt"
vocab = ["he", "llo", "world"]
with open(vocab_file, 'w', encoding='utf-8') as vocab_fp:
 vocab_fp.write("\n".join(vocab))

In [None]:
tokenizer2 = nemo_nlp.modules.get_tokenizer(tokenizer_name="bert-base-cased", vocab_file=vocab_file)

In [None]:
# Since we did not overwrite special tokens they should be the same as before
print(tokenizer1.tokenizer.all_special_tokens == tokenizer2.tokenizer.all_special_tokens )

## Adding Special tokens

We do not recommend overwriting special tokens for Hugging Face pretrained models, 
since these are the commonly used default values. 

If a user still wants to overwrite the special tokens, specify some of the following keys:

In [None]:
special_tokens_dict = {"unk_token": "", 
 "sep_token": "", 
 "pad_token": "", 
 "bos_token": "", 
 "mask_token": "",
 "eos_token": "",
 "cls_token": ""}
tokenizer3 = nemo_nlp.modules.get_tokenizer(tokenizer_name="bert-base-cased",
 vocab_file=vocab_file,
 special_tokens=special_tokens_dict)

# print newly set special tokens
print(tokenizer3.tokenizer.all_special_tokens)
# the special tokens should be different from the previous special tokens
print(tokenizer3.tokenizer.all_special_tokens != tokenizer1.tokenizer.all_special_tokens )

Notice, that if you specify tokens that were not previously included in the tokenizer's vocabulary file, new tokens will be added to the vocabulary file. You will see a message like this:

`['', '', '', '', '', '', ''] 
 will be added to the vocabulary.
 Please resize your model accordingly`

In [None]:
# A safer way to add special tokens is the following:

# define your model
pretrained_model_name = 'bert-base-uncased'
config = {"language_model": {"pretrained_model_name": pretrained_model_name}, "tokenizer": {}}
omega_conf = OmegaConf.create(config)
model = nemo_nlp.modules.get_lm_model(cfg=omega_conf)

# define pretrained tokenizer
tokenizer_default = nemo_nlp.modules.get_tokenizer(tokenizer_name=pretrained_model_name)

In [None]:
tokenizer_default.text_to_tokens(' and another word')

As you can see in the above, the tokenizer splits `` into subtokens. Let's add this to the special tokens to make sure the tokenizer does not split this into subtokens.

In [None]:
special_tokens = {'bos_token': '',
 'cls_token': '',
 'additional_special_tokens': ['', '']}
tokenizer_default.add_special_tokens(special_tokens_dict=special_tokens)

# resize your model so that the embeddings for newly added tokens are updated during training/finetuning
model.resize_token_embeddings(tokenizer_default.vocab_size)

# let's make sure the tokenizer doesn't split our special tokens into subtokens
tokenizer_default.text_to_tokens(' and another word')

Now, the model doesn't break down our special token into the subtokens.

## Megatron model tokenizer

In [None]:
# Megatron tokenizers are instances of the Hugging Face BertTokenizer. 
tokenizer4 = nemo_nlp.modules.get_tokenizer(tokenizer_name="megatron-bert-cased")

# Train custom tokenizer model and vocabulary from text file 

We use the [`scripts/tokenizers/process_asr_text_tokenizer.py`](https://github.com/NVIDIA/NeMo/blob/stable/scripts/tokenizers/process_asr_text_tokenizer.py) script to create a custom tokenizer model with its own vocabulary from an input file

In [None]:
# download tokenizer script
script_file = "process_asr_text_tokenizer.py"

if not os.path.exists(script_file):
 print('Downloading script file...')
 wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/scripts/tokenizers/process_asr_text_tokenizer.py')
else:
 print ('Script already exists')

In [None]:
# Let's prepare some small text data for the tokenizer
data_text = "NeMo is a toolkit for creating Conversational AI applications. \
NeMo toolkit makes it possible for researchers to easily compose complex neural network architectures \
for conversational AI using reusable components - Neural Modules. \
Neural Modules are conceptual blocks of neural networks that take typed inputs and produce typed outputs. \
Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations. \
The toolkit comes with extendable collections of pre-built modules and ready-to-use models for automatic speech recognition (ASR), \
natural language processing (NLP) and text synthesis (TTS). \
Built for speed, NeMo can utilize NVIDIA's Tensor Cores and scale out training to multiple GPUs and multiple nodes."

In [None]:
# Write the text data into a file
data_file="data.txt"

with open(data_file, 'w') as data_fp:
 data_fp.write(data_text)

In [None]:
# Some additional parameters for the tokenizer
# To tokenize at unigram, char or word boundary instead of using bpe, change --spe_type accordingly. 
# More details see https://github.com/google/sentencepiece#train-sentencepiece-model

tokenizer_spe_type = "bpe" # <-- Can be `bpe`, `unigram`, `word` or `char`
vocab_size = 32

In [None]:
! python process_asr_text_tokenizer.py --data_file=$data_file --data_root=. --vocab_size=$vocab_size --tokenizer=spe --spe_type=$tokenizer_spe_type

In [None]:
# See created tokenizer model and vocabulary
spe_model_dir=f"tokenizer_spe_{tokenizer_spe_type}_v{vocab_size}"
! ls $spe_model_dir

# Use custom tokenizer for data preprocessing
## Example: SentencePiece for BPE

In [None]:
# initialize tokenizer with created tokenizer model, which inherently includes the vocabulary and specify optional special tokens
tokenizer_spe = nemo_nlp.modules.get_tokenizer(tokenizer_name="sentencepiece", tokenizer_model=spe_model_dir+"/tokenizer.model", special_tokens=special_tokens_dict)

# specified special tokens are added to the vocabuary
print(tokenizer_spe.vocab_size)

# Using any tokenizer to tokenize text into BERT compatible input


In [None]:
text="hello world"

# create tokens
tokenized = [tokenizer_spe.bos_token] + tokenizer_spe.text_to_tokens(text) + [tokenizer_spe.eos_token]
print(tokenized)

# turn token into input_ids for a neural model, such as BERTModule

print(tokenizer_spe.tokens_to_ids(tokenized))