evaluate-measurement/perplexity · Results on Wikitext-2 with GPT2 don't match paper

Hey, I tested the example code and compared the results achieved on https://huggingface.co/docs/transformers/perplexity using gpt2 and wikitext-2-raw-v1.

The values reported on the post range from 16.44 to 19.64 (depending on the size of the stride)
The value achieved using the this lib is 546.62

The difference is quite bit. Am I missing something?

Code:

import datasets
import evaluate

input_texts = datasets.load_dataset("wikitext", "wikitext-2-raw-v1", split="test")["text"]
input_texts = [s for s in input_texts if s!='']
perplexity = evaluate.load("perplexity", module_type="measurement")
results = perplexity.compute(model_id='gpt2', data=input_texts)
print(results['mean_perplexity'])