Spaces:
Running
on
L40S
Running
on
L40S
File size: 4,969 Bytes
2aebc50 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
# SentencePieceProcessor C++ API
## Load SentencePiece model
To start working with the SentencePiece model, you will want to include the `sentencepiece_processor.h` header file.
Then instantiate sentencepiece::SentencePieceProcessor class and calls `Load` method to load the model with file path or std::istream.
```C++
#include <sentencepiece_processor.h>
sentencepiece::SentencePieceProcessor processor;
const auto status = processor.Load("//path/to/model.model");
if (!status.ok()) {
std::cerr << status.ToString() << std::endl;
// error
}
// You can also load a serialized model from std::string.
// const std::string str = // Load blob contents from a file.
// auto status = processor.LoadFromSerializedProto(str);
```
## Tokenize text (preprocessing)
Calls `SentencePieceProcessor::Encode` method to tokenize text.
```C++
std::vector<std::string> pieces;
processor.Encode("This is a test.", &pieces);
for (const std::string &token : pieces) {
std::cout << token << std::endl;
}
```
You will obtain the sequence of vocab ids as follows:
```C++
std::vector<int> ids;
processor.Encode("This is a test.", &ids);
for (const int id : ids) {
std::cout << id << std::endl;
}
```
## Detokenize text (postprocessing)
Calls `SentencePieceProcessor::Decode` method to detokenize a sequence of pieces or ids into a text. Basically it is guaranteed that the detokenization is an inverse operation of Encode, i.e., `Decode(Encode(Normalize(input))) == Normalize(input)`.
```C++
std::vector<std::string> pieces = { "▁This", "▁is", "▁a", "▁", "te", "st", "." }; // sequence of pieces
std::string text;
processor.Decode(pieces, &text);
std::cout << text << std::endl;
std::vector<int> ids = { 451, 26, 20, 3, 158, 128, 12 }; // sequence of ids
processor.Decode(ids, &text);
std::cout << text << std::endl;
```
## Sampling (subword regularization)
Calls `SentencePieceProcessor::SampleEncode` method to sample one segmentation.
```C++
std::vector<std::string> pieces;
processor.SampleEncode("This is a test.", &pieces, -1, 0.2);
std::vector<int> ids;
processor.SampleEncode("This is a test.", &ids, -1, 0.2);
```
SampleEncode has two sampling parameters, `nbest_size` and `alpha`, which correspond to `l` and `alpha` in the [original paper](https://arxiv.org/abs/1804.10959). When `nbest_size` is -1, one segmentation is sampled from all hypotheses with forward-filtering and backward sampling algorithm.
## Training
Calls `SentencePieceTrainer::Train` function to train sentencepiece model. You can pass the same parameters of [spm_train](https://github.com/google/sentencepiece#train-sentencepiece-model) as a single string.
```C++
#include <sentencepiece_trainer.h>
sentencepiece::SentencePieceTrainer::Train("--input=test/botchan.txt --model_prefix=m --vocab_size=1000");
```
## ImmutableSentencePieceText
You will want to use `ImmutableSentencePieceText` class to obtain the pieces and ids at the same time.
This proto also encodes a utf8-byte offset of each piece over user input or detokenized text.
```C++
#include <sentencepiece_processor.h>
sentencepiece::ImmutableSentencePieceText spt;
// Encode
processor.Encode("This is a test.", spt.mutable_proto());
// or
// spt = processor.EncodeAsImmutableProto("This is a test.");
std::cout << spt.text() << std::endl; // This is the same as the input.
for (const auto &piece : spt.pieces()) {
std::cout << piece.begin() << std::endl; // beginning of byte offset
std::cout << piece.end() << std::endl; // end of byte offset
std::cout << piece.piece() << std::endl; // internal representation.
std::cout << piece.surface() << std::endl; // external representation. spt.text().substr(begin, end - begin) == surface().
std::cout << piece.id() << std::endl; // vocab id
}
// Decode
processor.Decode({10, 20, 30}, spt.mutable_proto());
std::cout << spt.text() << std::endl; // This is the same as the decoded string.
for (const auto &piece : spt.pieces()) {
// the same as above.
}
```
## Vocabulary management
You will want to use the following methods to obtain ids from/to pieces.
```C++
processor.GetPieceSize(); // returns the size of vocabs.
processor.PieceToId("foo"); // returns the vocab id of "foo"
processor.IdToPiece(10); // returns the string representation of id 10.
processor.IsUnknown(0); // returns true if the given id is an unknown token. e.g., <unk>
processor.IsControl(10); // returns true if the given id is a control token. e.g., <s>, </s>
```
## Extra Options
Use `SetEncodeExtraOptions` and `SetDecodeExtraOptions` methods to set extra options for encoding and decoding respectively. These methods need to be called just after `Load` methods.
```C++
processor.SetEncodeExtraOptions("bos:eos"); // add <s> and </s>.
processor.SetEncodeExtraOptions("reverse:bos:eos"); // reverse the input and then add <s> and </s>.
processor.SetDecodeExtraOptions("reverse"); // the decoder's output is reversed.
```
|