noumanjavaid commited on
Commit
af78165
·
verified ·
1 Parent(s): 1e72a0c

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +50 -294
app.py CHANGED
@@ -1,93 +1,20 @@
1
  import streamlit as st
2
-
3
- # Quickstart
4
-
5
- # %% [markdown]
6
- # This quickstart is intended for developers who are ready to dive into the code and see an example of how to integrate Datasets into their model training workflow. If you're a beginner, we recommend starting with our [tutorials](https://huggingface.co/docs/datasets/main/en/./tutorial), where you'll get a more thorough introduction.
7
-
8
- # Each dataset is unique, and depending on the task, some datasets may require additional steps to prepare it for training. But you can always use datasets tools to load and process a dataset. The fastest and easiest way to get started is by loading an existing dataset from the [Hugging Face Hub](https://huggingface.co/datasets). There are thousands of datasets to choose from, spanning many tasks. Choose the type of dataset you want to work with, and let's get started!
9
-
10
- st.markdown("""
11
- <div class="mt-4">
12
- <div class="w-full flex flex-col space-y-4 md:space-y-0 md:grid md:grid-cols-3 md:gap-y-4 md:gap-x-5">
13
-
14
- <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="#audio">
15
- <div class="w-full text-center bg-gradient-to-r from-violet-300 via-sky-400 to-green-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Audio</div>
16
- <p class="text-gray-700">Resample an audio dataset and get it ready for a model to classify what type of banking issue a speaker is calling about.</p>
17
- </a>
18
-
19
- <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="#vision">
20
- <div class="w-full text-center bg-gradient-to-r from-pink-400 via-purple-400 to-blue-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Vision</div>
21
- <p class="text-gray-700">Apply data augmentation to an image dataset and get it ready for a model to diagnose disease in bean plants.</p>
22
- </a>
23
-
24
- <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="#nlp">
25
- <div class="w-full text-center bg-gradient-to-r from-orange-300 via-red-400 to-violet-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">NLP</div>
26
- <p class="text-gray-700">Tokenize a dataset and get it ready for a model to determine whether a pair of sentences have the same meaning.</p>
27
- </a>
28
-
29
- </div>
30
- </div>
31
-
32
- <div class="mt-4"> </div>
33
- <p>
34
- Check out <a href="https://huggingface.co/course/chapter5/1?fw=pt">Chapter 5</a> of the Hugging Face course to learn more about other important topics such as loading remote or local datasets, tools for cleaning up a dataset, and creating your own dataset.
35
- </p>
36
-
37
-
38
- """, unsafe_allow_html=True)
39
-
40
  from datasets import load_dataset, Audio
 
 
 
41
 
42
- num_epochs = 10 # Set the desired number of epochs; this is just an example!
43
-
44
- import transformers
45
- transformers.utils.move_cache()
46
-
47
-
48
  dataset = load_dataset("PolyAI/minds14", "en-US", split="train", trust_remote_code=True)
49
 
50
- # %% [markdown]
51
- # **2**. Next, load a pretrained [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model and its corresponding feature extractor from the [🤗 Transformers](https://huggingface.co/transformers/) library. It is totally normal to see a warning after you load the model about some weights not being initialized. This is expected because you are loading this model checkpoint for training with another task.
52
-
53
- # %%
54
- from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
55
-
56
  model = AutoModelForAudioClassification.from_pretrained("facebook/wav2vec2-base")
57
  feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
58
 
59
- # %% [markdown]
60
- # **3**. The [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset card indicates the sampling rate is 8kHz, but the Wav2Vec2 model was pretrained on a sampling rate of 16kHZ. You'll need to upsample the `audio` column with the [cast_column()](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.cast_column) function and [Audio](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Audio) feature to match the model's sampling rate.
61
-
62
- # %%
63
  dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
64
- dataset[0]["audio"]
65
 
66
- # %% [markdown]
67
- # **4**. Create a function to preprocess the audio `array` with the feature extractor, and truncate and pad the sequences into tidy rectangular tensors. The most important thing to remember is to call the audio `array` in the feature extractor since the `array` - the actual speech signal - is the model input.
68
- #
69
- # Once you have a preprocessing function, use the [map()](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function to speed up processing by applying the function to batches of examples in the dataset.
70
- import torch
71
-
72
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Use GPU if available
73
- model.to(device) # Move model to the device
74
-
75
- optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5) # Example optimizer
76
-
77
- for epoch in range(num_epochs): # Replace num_epochs
78
- for batch in dataloader:
79
- input_values = batch["input_values"].to(device)
80
- labels = batch["labels"].to(device)
81
-
82
- optimizer.zero_grad()
83
- outputs = model(input_values, labels=labels)
84
- loss = outputs.loss
85
- loss.backward()
86
- optimizer.step()
87
-
88
- print(f"Epoch: {epoch+1}, Loss: {loss.item()}") # Print Loss for monitoring
89
-
90
- # %%
91
  def preprocess_function(examples):
92
  audio_arrays = [x["array"] for x in examples["audio"]]
93
  inputs = feature_extractor(
@@ -99,228 +26,57 @@ def preprocess_function(examples):
99
  )
100
  return inputs
101
 
102
- dataset = dataset.map(preprocess_function, batched=True)
103
-
104
- # %% [markdown]
105
- # **5**. Use the [rename_column()](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.rename_column) function to rename the `intent_class` column to `labels`, which is the expected input name in [Wav2Vec2ForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/wav2vec2#transformers.Wav2Vec2ForSequenceClassification):
106
 
107
- # %%
108
  dataset = dataset.rename_column("intent_class", "labels")
 
109
 
110
- # %% [markdown]
111
- # **6**. Set the dataset format according to the machine learning framework you're using.
112
- #
113
- # Use the [set_format()](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.set_format) function to set the dataset format to `torch` and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc_view/data.html?highlight=torch%20utils%20data%20dataloader#torch.utils.data.DataLoader):
114
-
115
- # %%
116
- from torch.utils.data import DataLoader
117
-
118
- dataset.set_format(type="torch", columns=["input_values", "labels"])
119
- dataloader = DataLoader(dataset, batch_size=4)
120
-
121
- # %% [markdown]
122
- # Use the [prepare_tf_dataset](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset) method from 🤗 Transformers to prepare the dataset to be compatible with
123
- # TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) as a `tf.data.Dataset`
124
- # with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.
125
-
126
- # %%
127
- import tensorflow as tf
128
-
129
- tf_dataset = model.prepare_tf_dataset(
130
- dataset,
131
- batch_size=4,
132
- shuffle=True,
133
- )
134
-
135
- # %% [markdown]
136
- # **7**. Start training with your machine learning framework! Check out the 🤗 Transformers [audio classification guide](https://huggingface.co/docs/transformers/tasks/audio_classification) for an end-to-end example of how to train a model on an audio dataset.
137
-
138
- # %% [markdown]
139
- # ## Vision
140
-
141
- # %% [markdown]
142
- # Image datasets are loaded just like text datasets. However, instead of a tokenizer, you'll need a [feature extractor](https://huggingface.co/docs/transformers/main_classes/feature_extractor#feature-extractor) to preprocess the dataset. Applying data augmentation to an image is common in computer vision to make the model more robust against overfitting. You're free to use any data augmentation library you want, and then you can apply the augmentations with 🤗 Datasets. In this quickstart, you'll load the [Beans](https://huggingface.co/datasets/beans) dataset and get it ready for the model to train on and identify disease from the leaf images.
143
- #
144
- # **1**. Load the Beans dataset by providing the [load_dataset()](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset) function with the dataset name and a dataset split:
145
-
146
- # %%
147
- from datasets import load_dataset, Image
148
-
149
- dataset = load_dataset("beans", split="train")
150
 
151
- # %% [markdown]
152
- # **2**. Now you can add some data augmentations with any library ([Albumentations](https://albumentations.ai/), [imgaug](https://imgaug.readthedocs.io/en/latest/), [Kornia](https://kornia.readthedocs.io/en/latest/)) you like. Here, you'll use [torchvision](https://pytorch.org/vision/stable/transforms.html) to randomly change the color properties of an image:
 
153
 
154
- # %%
155
- from torchvision.transforms import Compose, ColorJitter, ToTensor
 
156
 
157
- jitter = Compose(
158
- [ColorJitter(brightness=0.5, hue=0.5), ToTensor()]
159
- )
160
-
161
- # %% [markdown]
162
- # **3**. Create a function to apply your transform to the dataset and generate the model input: `pixel_values`.
163
-
164
- # %%
165
- def transforms(examples):
166
- examples["pixel_values"] = [jitter(image.convert("RGB")) for image in examples["image"]]
167
- return examples
168
-
169
- # %% [markdown]
170
- # **4**. Use the [with_transform()](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.with_transform) function to apply the data augmentations on-the-fly:
171
-
172
- # %%
173
- dataset = dataset.with_transform(transforms)
174
-
175
- # %% [markdown]
176
- # **5**. Set the dataset format according to the machine learning framework you're using.
177
- #
178
- # Wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc_view/data.html?highlight=torch%20utils%20data%20dataloader#torch.utils.data.DataLoader). You'll also need to create a collate function to collate the samples into batches:
179
-
180
- # %%
181
- from torch.utils.data import DataLoader
182
-
183
- def collate_fn(examples):
184
- images = []
185
- labels = []
186
- for example in examples:
187
- images.append((example["pixel_values"]))
188
- labels.append(example["labels"])
189
-
190
- pixel_values = torch.stack(images)
191
- labels = torch.tensor(labels)
192
- return {"pixel_values": pixel_values, "labels": labels}
193
- dataloader = DataLoader(dataset, collate_fn=collate_fn, batch_size=4)
194
-
195
- # %% [markdown]
196
- # Use the [prepare_tf_dataset](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset) method from 🤗 Transformers to prepare the dataset to be compatible with
197
- # TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) as a `tf.data.Dataset`
198
- # with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.
199
- #
200
- # Before you start, make sure you have up-to-date versions of `albumentations` and `cv2` installed:
201
- #
202
- # ```bash
203
- # pip install -U albumentations opencv-python
204
- # ```
205
-
206
- # %%
207
- import albumentations
208
- import numpy as np
209
-
210
- transform = albumentations.Compose([
211
- albumentations.RandomCrop(width=256, height=256),
212
- albumentations.HorizontalFlip(p=0.5),
213
- albumentations.RandomBrightnessContrast(p=0.2),
214
- ])
215
-
216
- def transforms(examples):
217
- examples["pixel_values"] = [
218
- transform(image=np.array(image))["image"] for image in examples["image"]
219
- ]
220
- return examples
221
-
222
- dataset.set_transform(transforms)
223
- tf_dataset = model.prepare_tf_dataset(
224
- dataset,
225
- batch_size=4,
226
- shuffle=True,
227
- )
228
-
229
- # %% [markdown]
230
- # **6**. Start training with your machine learning framework! Check out the 🤗 Transformers [image classification guide](https://huggingface.co/docs/transformers/tasks/image_classification) for an end-to-end example of how to train a model on an image dataset.
231
-
232
- # %% [markdown]
233
- # ## NLP
234
-
235
- # %% [markdown]
236
- # Text needs to be tokenized into individual tokens by a [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer). For the quickstart, you'll load the [Microsoft Research Paraphrase Corpus (MRPC)](https://huggingface.co/datasets/glue/viewer/mrpc) training dataset to train a model to determine whether a pair of sentences mean the same thing.
237
- #
238
- # **1**. Load the MRPC dataset by providing the [load_dataset()](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset) function with the dataset name, dataset configuration (not all datasets will have a configuration), and dataset split:
239
-
240
- # %%
241
- from datasets import load_dataset
242
-
243
- dataset = load_dataset("glue", "mrpc", split="train")
244
-
245
- # %% [markdown]
246
- # **2**. Next, load a pretrained [BERT](https://huggingface.co/bert-base-uncased) model and its corresponding tokenizer from the [🤗 Transformers](https://huggingface.co/transformers/) library. It is totally normal to see a warning after you load the model about some weights not being initialized. This is expected because you are loading this model checkpoint for training with another task.
247
-
248
- # %%
249
- from transformers import AutoModelForSequenceClassification, AutoTokenizer
250
-
251
- model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
252
- tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
253
-
254
- # %%
255
- from transformers import TFAutoModelForSequenceClassification, AutoTokenizer
256
-
257
- model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
258
- tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
259
-
260
- # %% [markdown]
261
- # **3**. Create a function to tokenize the dataset, and you should also truncate and pad the text into tidy rectangular tensors. The tokenizer generates three new columns in the dataset: `input_ids`, `token_type_ids`, and an `attention_mask`. These are the model inputs.
262
- #
263
- # Use the [map()](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function to speed up processing by applying your tokenization function to batches of examples in the dataset:
264
-
265
- # %%
266
- def encode(examples):
267
- return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length")
268
-
269
- dataset = dataset.map(encode, batched=True)
270
- dataset[0]
271
-
272
- # %% [markdown]
273
- # **4**. Rename the `label` column to `labels`, which is the expected input name in [BertForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertForSequenceClassification):
274
-
275
- # %%
276
- dataset = dataset.map(lambda examples: {"labels": examples["label"]}, batched=True)
277
-
278
- # %% [markdown]
279
- # **5**. Set the dataset format according to the machine learning framework you're using.
280
- #
281
- # Use the [set_format()](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.set_format) function to set the dataset format to `torch` and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc_view/data.html?highlight=torch%20utils%20data%20dataloader#torch.utils.data.DataLoader):
282
-
283
- # %%
284
- # prompt: Set the dataset format according to the machine learning framework you're using.
285
- # Use the set_format() function to set the dataset format to torch and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in torch.utils.data.DataLoader:
286
-
287
- dataset.set_format(type="torch", columns=["input_ids", "labels"])
288
- dataloader = DataLoader(dataset, batch_size=4)
289
-
290
-
291
- # %%
292
- import torch
293
- from torch.utils.data import DataLoader
294
-
295
- # %%
296
- # prompt: Set the dataset format to torch and specify the columns to format:
297
- # set_format(parquet_file, "torch", columns=["column1", "column2"])
298
-
299
- dataset.set_format(type="torch", columns=["input_ids", "labels"])
300
-
301
-
302
- # %% [markdown]
303
- # Use the [prepare_tf_dataset](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset) method from 🤗 Transformers to prepare the dataset to be compatible with
304
- # TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) as a `tf.data.Dataset`
305
- # with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.
306
-
307
- # %%
308
-
309
- tf_dataset = model.prepare_tf_dataset(
310
- dataset,
311
- batch_size=4,
312
- shuffle=True,
313
- trust_remote_code=True)
314
 
315
- # %% [markdown]
316
- # **6**. Start training with your machine learning framework! Check out the 🤗 Transformers [text classification guide](https://huggingface.co/docs/transformers/tasks/sequence_classification) for an end-to-end example of how to train a model on a text dataset.
 
 
 
317
 
318
- # %% [markdown]
319
- # ## What's next?
320
 
321
- # %% [markdown]
322
- # This completes the 🤗 Datasets quickstart! You can load any text, audio, or image dataset with a single function and get it ready for your model to train on.
323
- #
324
- # For your next steps, take a look at our [How-to guides](https://huggingface.co/docs/datasets/main/en/./how_to) and learn how to do more specific things like loading different dataset formats, aligning labels, and streaming large datasets. If you're interested in learning more about 🤗 Datasets core concepts, grab a cup of coffee and read our [Conceptual Guides](https://huggingface.co/docs/datasets/main/en/./about_arrow)!
325
 
326
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import streamlit as st
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  from datasets import load_dataset, Audio
3
+ from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
4
+ import torch
5
+ from torch.utils.data import DataLoader
6
 
7
+ # Load the MInDS-14 dataset
 
 
 
 
 
8
  dataset = load_dataset("PolyAI/minds14", "en-US", split="train", trust_remote_code=True)
9
 
10
+ # Load pretrained model and feature extractor
 
 
 
 
 
11
  model = AutoModelForAudioClassification.from_pretrained("facebook/wav2vec2-base")
12
  feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
13
 
14
+ # Resample audio to 16kHz
 
 
 
15
  dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
 
16
 
17
+ # Preprocessing function
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  def preprocess_function(examples):
19
  audio_arrays = [x["array"] for x in examples["audio"]]
20
  inputs = feature_extractor(
 
26
  )
27
  return inputs
28
 
 
 
 
 
29
 
30
+ dataset = dataset.map(preprocess_function, batched=True)
31
  dataset = dataset.rename_column("intent_class", "labels")
32
+ dataset = dataset.set_format(type="torch", columns=["input_values", "labels"])
33
 
34
+ # Create DataLoader
35
+ batch_size = 4 # Adjust as needed
36
+ dataloader = DataLoader(dataset, batch_size=batch_size)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
+ # Set device and move model
39
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
40
+ model.to(device)
41
 
42
+ # Training loop (example)
43
+ num_epochs = 2 # Keep small for testing on Spaces!
44
+ optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
45
 
46
+ for epoch in range(num_epochs):
47
+ for batch in dataloader:
48
+ input_values = batch["input_values"].to(device)
49
+ labels = batch["labels"].to(device)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
+ optimizer.zero_grad()
52
+ outputs = model(input_values, labels=labels)
53
+ loss = outputs.loss
54
+ loss.backward()
55
+ optimizer.step()
56
 
57
+ print(f"Epoch: {epoch+1}, Loss: {loss.item()}")
 
58
 
59
+ # Streamlit UI
60
+ st.title("Audio Classification with Minds14")
61
+ st.write("Training complete!") # You'll want to add more insightful outputs here eventually
 
62
 
63
 
64
+ st.markdown("""
65
+ <div class="mt-4">
66
+ <div class="w-full flex flex-col space-y-4 md:space-y-0 md:grid md:grid-cols-3 md:gap-y-4 md:gap-x-5">
67
+ <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="#audio"> <div class="w-full text-center bg-gradient-to-r from-violet-300 via-sky-400 to-green-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Audio</div>
68
+ <p class="text-gray-700">Resample an audio dataset and get it ready for a model to classify what type of banking issue a speaker is calling about.</p>
69
+ </a>
70
+ <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="#vision"> <div class="w-full text-center bg-gradient-to-r from-pink-400 via-purple-400 to-blue-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Vision</div>
71
+ <p class="text-gray-700">Apply data augmentation to an image dataset and get it ready for a model to diagnose disease in bean plants.</p>
72
+ </a>
73
+ <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="#nlp"> <div class="w-full text-center bg-gradient-to-r from-orange-300 via-red-400 to-violet-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">NLP</div>
74
+ <p class="text-gray-700">Tokenize a dataset and get it ready for a model to determine whether a pair of sentences have the same meaning.</p>
75
+ </a>
76
+ </div>
77
+ </div>
78
+ <div class="mt-4"> </div>
79
+ <p>
80
+ Check out <a href="https://huggingface.co/course/chapter5/1?fw=pt">Chapter 5</a> of the Hugging Face course to learn more about other important topics such as loading remote or local datasets, tools for cleaning up a dataset, and creating your own dataset.
81
+ </p>
82
+ """, unsafe_allow_html=True)