mebubo commited on
Commit
b68561a
·
1 Parent(s): cf0b6c7

Proof-reading

Browse files
Files changed (2) hide show
  1. README.md +9 -20
  2. SCRATCHPAD.md +14 -0
README.md CHANGED
@@ -12,43 +12,30 @@ app_port: 7860
12
 
13
  ![](img/GPTed.jpeg)
14
 
15
- What I want to cover:
16
- - The original blog post
17
- - Improvements that I wanted to make:
18
- - In addition to highlighting low-probability words, show replacement suggestions that are more likely
19
- - Operate at the level of whole words, not tokens
20
- - Justification for using a local model
21
- - Limitations of the logprobs returned by the APIs
22
- - Main parts of the project
23
- - Combining tokens into words to get the probabilities of whole words
24
- - The batched multi-token expansion with probability budget
25
- - Testable abstract implementation
26
-
27
-
28
  This post describes my attempt to build an improved version of GPTed from https://vgel.me/posts/gpted-launch/ and what I learned from it.
29
 
30
  Here is what has been done in the original GPTed:
31
- - Use logprobs returned by the OpenAI API (in particular, the /v1/completions legacy api https://platform.openai.com/docs/api-reference/completions) for tokens _in the existing text_ (as opposed to generated text) to detect the tokens the model is surprised by
32
  - Provide a basic text editing UI that has a mode in which the tokens with a logprob below a given threshold are highlighted. Not all highlighted tokens are necessarily a mistake, but the idea is that it may be worth checking that a low-probability token is indeed intended.
33
 
34
  Here are the improvements that I wanted to make:
35
- - Operate at the word level, instead of token level, to compute the log prob of whole words even if they are mutli-token, and to highlight whole words
36
  - Propose replacement words for the highlighted words
37
  - Specifically, words with probability higher than the flagging threshold
38
 
39
  ### On logprobs in OpenAI API
40
 
41
- The original GPTed project relied on the 2 features in the legacy OpenAI /v1/completions API:
42
 
43
  > logprobs: Include the log probabilities on the `logprobs` most likely output tokens, as well the chosen tokens. For example, if `logprobs` is 5, the API will return a list of the 5 most likely tokens. The API will always return the `logprob` of the sampled token, so there may be up to `logprobs+1` elements in the response. The maximum value for `logprobs` is 5.
44
 
45
  > echo: Echo back the prompt in addition to the completion
46
 
47
- The echo parameter doesn't exist anymore in the modern chat completions API /v1/chat/completions, making it impossible to get logprobs for an existing text (as opposed to generated text). The legacy completions API is not available for modern models like GPT4 (FIXME verify this claim).
48
 
49
- Also, the limit of 5 for the number of logprobs is also quite limiting: there may well be more than 5 tokens above the threshold, and I would like to be able to take all of them into account.
50
 
51
- Also, the case of multi-token words meant that it would be convenient to use batching, which is not available over the OpenAI API.
52
  For the above 3 reasons, I decided to switch to using local models.
53
 
54
  ### Local models with huggingface transformers
@@ -196,7 +183,9 @@ They are based on a non-llm expander based on a hardcoded list of possible expan
196
 
197
  ### Limitations of the decoder-only approach
198
 
199
- The main limitation of using decoder-only models like GPT or Llama for this task is the unidirectional attention. It means that we are not using the context on the right of the word. This is especially problematic at the start of the text: the first tokens get very little context, so the the probabilities we get from the model are not very useful. The obvious solution is to use a model with bi-directional attention, such as BERT. This will be covered in the part 2 of the post.
200
 
201
  ### Other potential possibilities / ideas
202
  - Instead of using a local model, investigate using an API of a provider that exposes logprobs e.g. replicate
 
 
 
12
 
13
  ![](img/GPTed.jpeg)
14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  This post describes my attempt to build an improved version of GPTed from https://vgel.me/posts/gpted-launch/ and what I learned from it.
16
 
17
  Here is what has been done in the original GPTed:
18
+ - Use logprobs returned by the OpenAI API (in particular, the [legacy /v1/completions API](https://platform.openai.com/docs/api-reference/completions)) for tokens _in the existing text_ (as opposed to generated text) to detect the tokens the model is surprised by
19
  - Provide a basic text editing UI that has a mode in which the tokens with a logprob below a given threshold are highlighted. Not all highlighted tokens are necessarily a mistake, but the idea is that it may be worth checking that a low-probability token is indeed intended.
20
 
21
  Here are the improvements that I wanted to make:
22
+ - Operate at the word level, instead of token level, to compute the logprobs of whole words even if they are mutli-token, and to highlight whole words
23
  - Propose replacement words for the highlighted words
24
  - Specifically, words with probability higher than the flagging threshold
25
 
26
  ### On logprobs in OpenAI API
27
 
28
+ The original GPTed project relied on the 2 features in the [legacy OpenAI /v1/completions API](https://platform.openai.com/docs/api-reference/completions):
29
 
30
  > logprobs: Include the log probabilities on the `logprobs` most likely output tokens, as well the chosen tokens. For example, if `logprobs` is 5, the API will return a list of the 5 most likely tokens. The API will always return the `logprob` of the sampled token, so there may be up to `logprobs+1` elements in the response. The maximum value for `logprobs` is 5.
31
 
32
  > echo: Echo back the prompt in addition to the completion
33
 
34
+ The echo parameter doesn't exist anymore in the [modern /v1/chat/completions API](https://platform.openai.com/docs/api-reference/chat), making it impossible to get logprobs for an existing text (as opposed to generated text). The legacy completions API is [not available](https://platform.openai.com/docs/models#model-endpoint-compatibility) for modern models like GPT4.
35
 
36
+ Also, the maximum of 5 for the number of logprobs is also quite limiting: there may well be more than 5 tokens above the threshold, and I would like to be able to take all of them into account.
37
 
38
+ Moreover, the case of multi-token words meant that it would be convenient to use batching, which is not available over the OpenAI API.
39
  For the above 3 reasons, I decided to switch to using local models.
40
 
41
  ### Local models with huggingface transformers
 
183
 
184
  ### Limitations of the decoder-only approach
185
 
186
+ The main limitation of using decoder-only models like GPT or Llama for this task is the unidirectional attention. It means that we are not using the context to the right of the word. This is especially problematic at the start of the text: the first tokens get very little context, so the the probabilities we get from the model are not very useful. The obvious solution is to use a model with bi-directional attention, such as BERT. This comes with its own set of challenges and will be covered in the part 2 of the post.
187
 
188
  ### Other potential possibilities / ideas
189
  - Instead of using a local model, investigate using an API of a provider that exposes logprobs e.g. replicate
190
+
191
+ ### Deployment on huggingface spaces
SCRATCHPAD.md CHANGED
@@ -1,3 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ## A digression on encoder vs decoder, unidirectional vs bidirectional attention, and whether we could use bidirectional attention for text generation
2
 
3
  It is a common misconseption that autoregressive text generation _requires_ unidirectional attention, whereas in reality it is only a matter of efficiency (efficiency at both training and inference time). It is possible to train models with bidirectional attention on next token prediction, and to use them autoregressively at inference, and arguably it would give better quality than unidirectional attention (the bidirectional flow of information between tokens in the current prefix can only be beneficial, e.g. if we are generating the next token in "the quick brown fox jumped over", there is no benefit in not letting "fox" to see "jumped"). However, bidirectional attention would mean that we cannot learn from every token in a text by passing only 1 instance of it through the model, we would have to pass every prefix individually. And at inference time, it would rule out the techniques such as KV caches which are used ubiquitously at all modern LLM deployments for inference, because all attention would need to be recomputed for every prefix.
 
1
+ ## Part 1
2
+
3
+ What I want to cover:
4
+ - [ ] The original blog post
5
+ - [ ]Improvements that I wanted to make:
6
+ - [ ] In addition to highlighting low-probability words, show replacement suggestions that are more likely
7
+ - [ ] Operate at the level of whole words, not tokens
8
+ - [ ] Justification for using a local model
9
+ - [ ] Limitations of the logprobs returned by the APIs
10
+ - [ ] Main parts of the project
11
+ - [ ] Combining tokens into words to get the probabilities of whole words
12
+ - [ ] The batched multi-token expansion with probability budget
13
+ - [ ] Testable abstract implementation
14
+
15
  ## A digression on encoder vs decoder, unidirectional vs bidirectional attention, and whether we could use bidirectional attention for text generation
16
 
17
  It is a common misconseption that autoregressive text generation _requires_ unidirectional attention, whereas in reality it is only a matter of efficiency (efficiency at both training and inference time). It is possible to train models with bidirectional attention on next token prediction, and to use them autoregressively at inference, and arguably it would give better quality than unidirectional attention (the bidirectional flow of information between tokens in the current prefix can only be beneficial, e.g. if we are generating the next token in "the quick brown fox jumped over", there is no benefit in not letting "fox" to see "jumped"). However, bidirectional attention would mean that we cannot learn from every token in a text by passing only 1 instance of it through the model, we would have to pass every prefix individually. And at inference time, it would rule out the techniques such as KV caches which are used ubiquitously at all modern LLM deployments for inference, because all attention would need to be recomputed for every prefix.