irmchek commited on
Commit
57d40ed
·
1 Parent(s): 462fea8

summarizer version 1: used a different model for creating a summary. The summary generated includes the title in the first sentence.

Browse files
Files changed (3) hide show
  1. enhanced_notebook.ipynb +298 -0
  2. notebook_enhancer.py +48 -48
  3. test.ipynb +104 -0
enhanced_notebook.ipynb ADDED
@@ -0,0 +1,298 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": 1,
6
+ "metadata": {},
7
+ "outputs": [],
8
+ "source": [
9
+ "# Data Science Analysis Notebook\n",
10
+ "\n",
11
+ "This notebook contains some example Python code for data analysis."
12
+ ]
13
+ },
14
+ {
15
+ "cell_type": "markdown",
16
+ "id": 9,
17
+ "metadata": {},
18
+ "outputs": [],
19
+ "source": [
20
+ "# Create a function to summarize the code.\n"
21
+ ]
22
+ },
23
+ {
24
+ "cell_type": "markdown",
25
+ "id": 8,
26
+ "metadata": {},
27
+ "outputs": [],
28
+ "source": [
29
+ "At first, we will start by importing the pandas and numpy modules.\n",
30
+ " Then we will use the seaborn library.\n",
31
+ " Next step is to set the style of the visualization.\n"
32
+ ]
33
+ },
34
+ {
35
+ "cell_type": "code",
36
+ "execution_count": null,
37
+ "id": 2,
38
+ "metadata": {},
39
+ "outputs": [],
40
+ "source": [
41
+ "# Import libraries\n",
42
+ "import pandas as pd\n",
43
+ "import numpy as np\n",
44
+ "import matplotlib.pyplot as plt\n",
45
+ "import seaborn as sns\n",
46
+ "\n",
47
+ "# Set visualization style\n",
48
+ "sns.set(style='whitegrid')\n",
49
+ "%matplotlib inline"
50
+ ]
51
+ },
52
+ {
53
+ "cell_type": "markdown",
54
+ "id": 11,
55
+ "metadata": {},
56
+ "outputs": [],
57
+ "source": [
58
+ "# Create a function summarize and load the dataset.\n"
59
+ ]
60
+ },
61
+ {
62
+ "cell_type": "markdown",
63
+ "id": 10,
64
+ "metadata": {},
65
+ "outputs": [],
66
+ "source": [
67
+ "To Load the dataset\n",
68
+ " To display the basic information, use the print statement in the function.\n",
69
+ " To print the dataset shape and head method.\n",
70
+ "\n",
71
+ " Create a new dataframe with the shape of the dataframe and the head method"
72
+ ]
73
+ },
74
+ {
75
+ "cell_type": "code",
76
+ "execution_count": null,
77
+ "id": 3,
78
+ "metadata": {},
79
+ "outputs": [],
80
+ "source": [
81
+ "# Load the dataset\n",
82
+ "df = pd.read_csv('housing_data.csv')\n",
83
+ "\n",
84
+ "# Display basic information\n",
85
+ "print(f\"Dataset shape: {df.shape}\")\n",
86
+ "df.head()"
87
+ ]
88
+ },
89
+ {
90
+ "cell_type": "markdown",
91
+ "id": 13,
92
+ "metadata": {},
93
+ "outputs": [],
94
+ "source": [
95
+ "# Create a function summarize to perform the data cleaning.\n"
96
+ ]
97
+ },
98
+ {
99
+ "cell_type": "markdown",
100
+ "id": 12,
101
+ "metadata": {},
102
+ "outputs": [],
103
+ "source": [
104
+ "In the for loop we iterate through the dataframe and fill missing values with median.\n",
105
+ " For each column in the dataframe, we check if the column is float64 or int64 type. If it is then we use the mode() function"
106
+ ]
107
+ },
108
+ {
109
+ "cell_type": "code",
110
+ "execution_count": null,
111
+ "id": 4,
112
+ "metadata": {},
113
+ "outputs": [],
114
+ "source": [
115
+ "# Perform data cleaning\n",
116
+ "# Fill missing values with median\n",
117
+ "for column in df.columns:\n",
118
+ " if df[column].dtype in ['float64', 'int64']:\n",
119
+ " df[column].fillna(df[column].median(), inplace=True)\n",
120
+ " else:\n",
121
+ " df[column].fillna(df[column].mode()[0], inplace=True)\n",
122
+ "\n",
123
+ "# Check for remaining missing values\n",
124
+ "print(\"Missing values after cleaning:\")\n",
125
+ "print(df.isnull().sum())"
126
+ ]
127
+ },
128
+ {
129
+ "cell_type": "markdown",
130
+ "id": 15,
131
+ "metadata": {},
132
+ "outputs": [],
133
+ "source": [
134
+ "# Create a function to summarize the data.\n"
135
+ ]
136
+ },
137
+ {
138
+ "cell_type": "markdown",
139
+ "id": 14,
140
+ "metadata": {},
141
+ "outputs": [],
142
+ "source": [
143
+ "For each column in the dataframe, create a list of numeric columns.\n",
144
+ " Then create a correlation matrix.\n",
145
+ " Next step is to create a function that takes in a dataframe and returns the correlation matrix as an argument."
146
+ ]
147
+ },
148
+ {
149
+ "cell_type": "code",
150
+ "execution_count": null,
151
+ "id": 5,
152
+ "metadata": {},
153
+ "outputs": [],
154
+ "source": [
155
+ "# Exploratory data analysis\n",
156
+ "# Create correlation matrix\n",
157
+ "numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns\n",
158
+ "correlation_matrix = df[numeric_columns].corr()\n",
159
+ "\n",
160
+ "# Plot heatmap\n",
161
+ "plt.figure(figsize=(12, 10))\n",
162
+ "sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)\n",
163
+ "plt.title('Correlation Matrix of Numeric Features', fontsize=18)\n",
164
+ "plt.xticks(rotation=45, ha='right')\n",
165
+ "plt.tight_layout()\n",
166
+ "plt.show()"
167
+ ]
168
+ },
169
+ {
170
+ "cell_type": "markdown",
171
+ "id": 17,
172
+ "metadata": {},
173
+ "outputs": [],
174
+ "source": [
175
+ "# Create a variable called bedrooms_ratio and rooms_per_household.\n"
176
+ ]
177
+ },
178
+ {
179
+ "cell_type": "markdown",
180
+ "id": 16,
181
+ "metadata": {},
182
+ "outputs": [],
183
+ "source": [
184
+ "If 'bedrooms' in the column and total_rooms is the column then create a new feature and scale it.\n"
185
+ ]
186
+ },
187
+ {
188
+ "cell_type": "code",
189
+ "execution_count": null,
190
+ "id": 6,
191
+ "metadata": {},
192
+ "outputs": [],
193
+ "source": [
194
+ "# Feature engineering\n",
195
+ "# Create new features\n",
196
+ "if 'bedrooms' in df.columns and 'total_rooms' in df.columns:\n",
197
+ " df['bedrooms_ratio'] = df['bedrooms'] / df['total_rooms']\n",
198
+ "\n",
199
+ "if 'total_rooms' in df.columns and 'households' in df.columns:\n",
200
+ " df['rooms_per_household'] = df['total_rooms'] / df['households']\n",
201
+ "\n",
202
+ "# Scale numeric features\n",
203
+ "from sklearn.preprocessing import StandardScaler\n",
204
+ "scaler = StandardScaler()\n",
205
+ "df[numeric_columns] = scaler.fit_transform(df[numeric_columns])\n",
206
+ "\n",
207
+ "# Display transformed data\n",
208
+ "df.head()"
209
+ ]
210
+ },
211
+ {
212
+ "cell_type": "markdown",
213
+ "id": 19,
214
+ "metadata": {},
215
+ "outputs": [],
216
+ "source": [
217
+ "# Create a simple prediction model\n"
218
+ ]
219
+ },
220
+ {
221
+ "cell_type": "markdown",
222
+ "id": 18,
223
+ "metadata": {},
224
+ "outputs": [],
225
+ "source": [
226
+ "This function will build a model that can be used to train and evaluate the model.\n",
227
+ " Next step is to split the dataframe into training and test data and predict the median_house_value column using the train_test_split function."
228
+ ]
229
+ },
230
+ {
231
+ "cell_type": "code",
232
+ "execution_count": null,
233
+ "id": 7,
234
+ "metadata": {},
235
+ "outputs": [],
236
+ "source": [
237
+ "# Build a simple prediction model\n",
238
+ "from sklearn.model_selection import train_test_split\n",
239
+ "from sklearn.linear_model import LinearRegression\n",
240
+ "from sklearn.metrics import mean_squared_error, r2_score\n",
241
+ "\n",
242
+ "# Assume we're predicting median_house_value\n",
243
+ "if 'median_house_value' in df.columns:\n",
244
+ " # Prepare features and target\n",
245
+ " X = df.drop('median_house_value', axis=1)\n",
246
+ " y = df['median_house_value']\n",
247
+ " \n",
248
+ " # Split the data\n",
249
+ " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
250
+ " \n",
251
+ " # Train the model\n",
252
+ " model = LinearRegression()\n",
253
+ " model.fit(X_train, y_train)\n",
254
+ " \n",
255
+ " # Make predictions\n",
256
+ " y_pred = model.predict(X_test)\n",
257
+ " \n",
258
+ " # Evaluate the model\n",
259
+ " mse = mean_squared_error(y_test, y_pred)\n",
260
+ " r2 = r2_score(y_test, y_pred)\n",
261
+ " \n",
262
+ " print(f\"Mean Squared Error: {mse:.2f}\")\n",
263
+ " print(f\"R² Score: {r2:.2f}\")\n",
264
+ " \n",
265
+ " # Plot actual vs predicted values\n",
266
+ " plt.figure(figsize=(10, 6))\n",
267
+ " plt.scatter(y_test, y_pred, alpha=0.5)\n",
268
+ " plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')\n",
269
+ " plt.xlabel('Actual Values')\n",
270
+ " plt.ylabel('Predicted Values')\n",
271
+ " plt.title('Actual vs Predicted Values')\n",
272
+ " plt.tight_layout()\n",
273
+ " plt.show()"
274
+ ]
275
+ }
276
+ ],
277
+ "metadata": {
278
+ "kernelspec": {
279
+ "display_name": "Python 3",
280
+ "language": "python",
281
+ "name": "python3"
282
+ },
283
+ "language_info": {
284
+ "codemirror_mode": {
285
+ "name": "ipython",
286
+ "version": 3
287
+ },
288
+ "file_extension": ".py",
289
+ "mimetype": "text/x-python",
290
+ "name": "python",
291
+ "nbconvert_exporter": "python",
292
+ "pygments_lexer": "ipython3",
293
+ "version": "3.8.10"
294
+ }
295
+ },
296
+ "nbformat": 4,
297
+ "nbformat_minor": 5
298
+ }
notebook_enhancer.py CHANGED
@@ -8,42 +8,52 @@ from transformers import (
8
  AutoTokenizer,
9
  AutoConfig,
10
  pipeline,
11
- SummarizationPipeline,
12
  )
13
  import re
 
14
 
15
- MODEL_NAME = "sagard21/python-code-explainer"
 
16
 
17
 
18
  class NotebookEnhancer:
19
  def __init__(self):
20
- self.config = AutoConfig.from_pretrained(MODEL_NAME)
21
- self.tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, padding=True)
22
- self.model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
23
- self.model.eval()
24
- self.pipeline = pipeline(
 
 
 
 
 
 
 
25
  "summarization",
26
- model=MODEL_NAME,
27
- config=self.config,
28
- tokenizer=self.tokenizer,
29
  )
 
30
  self.nlp = spacy.load("en_core_web_sm")
31
 
32
- def generate_title(self, code):
33
  """Generate a concise title for a code cell"""
34
- # Limit input length to match model constraints
35
- max_length = len(code) // 2
36
- print("Title Max length", max_length)
37
-
38
- truncated_code = code[:max_length] if len(code) > max_length else code
39
- max_length = len(truncated_code) // 2
40
- title = self.pipeline(code, min_length=5, max_length=30)[0][
41
- "summary_text"
42
- ].strip()
43
-
44
- print("Result title", title)
45
- # Format as a markdown title
46
- return f"# {title.capitalize()}"
 
47
 
48
  def _count_num_words(self, code):
49
  words = code.split(" ")
@@ -51,23 +61,16 @@ class NotebookEnhancer:
51
 
52
  def generate_summary(self, code):
53
  """Generate a detailed summary for a code cell"""
54
- # result = self.pipeline([code], min_length=3, max_length=len(code // 2))
55
- print("Code", code)
56
- result = self.pipeline(code, min_length=5, max_length=30)
57
- print(result)
58
  summary = result[0]["summary_text"].strip()
59
- summary = self._postprocess_summary(summary)
60
- print("Result summary", summary)
61
- # print(self._is_valid_sentence_nlp(summary))
62
- # summary = result[0]["summary_text"].strip()
63
- return f"{summary}"
64
 
65
  def enhance_notebook(self, notebook: nbformat.notebooknode.NotebookNode):
66
  """Add title and summary markdown cells before each code cell"""
67
  # Create a new notebook
68
  enhanced_notebook = nbformat.v4.new_notebook()
69
  enhanced_notebook.metadata = notebook.metadata
70
- print(len(notebook.cells))
71
  # Process each cell
72
  i = 0
73
  id = len(notebook.cells) + 1
@@ -76,14 +79,11 @@ class NotebookEnhancer:
76
  # For code cells, add title and summary markdown cells
77
  if cell.cell_type == "code" and cell.source.strip():
78
  # Generate summary
79
- summary = self.generate_summary(cell.source)
80
  summary_cell = nbformat.v4.new_markdown_cell(summary)
81
  summary_cell.outputs = []
82
  summary_cell.id = id
83
  id += 1
84
-
85
- # Generate title based on the summary cell
86
- title = self.generate_title(summary)
87
  title_cell = nbformat.v4.new_markdown_cell(title)
88
  title_cell.outputs = []
89
  title_cell.id = id
@@ -91,7 +91,6 @@ class NotebookEnhancer:
91
 
92
  enhanced_notebook.cells.append(title_cell)
93
  enhanced_notebook.cells.append(summary_cell)
94
-
95
  # Add the original cell
96
  cell.outputs = []
97
  enhanced_notebook.cells.append(cell)
@@ -111,14 +110,16 @@ class NotebookEnhancer:
111
  def _postprocess_summary(self, summary: str):
112
  doc = self.nlp(summary)
113
  sentences = list(doc.sents)
114
- # ignore the first sentence
115
- sentences = sentences[1:]
116
  # remove the trailing list enumeration
117
  postprocessed_sentences = []
118
  for sentence in sentences:
119
  if self.is_valid(sentence):
120
- postprocessed_sentences.append(sentence.text)
121
- return " ".join(postprocessed_sentences)
 
 
 
 
122
 
123
 
124
  def process_notebook(file_path):
@@ -129,7 +130,6 @@ def process_notebook(file_path):
129
  nb = nbformat.read(f, as_version=4)
130
  # Process the notebook
131
  enhanced_notebook = enhancer.enhance_notebook(nb)
132
- print(enhanced_notebook)
133
  enhanced_notebook_str = nbformat.writes(enhanced_notebook, version=4)
134
  # Save to temp file
135
  output_path = "enhanced_notebook.ipynb"
@@ -168,7 +168,7 @@ def build_gradio_interface():
168
 
169
  # This will be the entry point when running the script
170
  if __name__ == "__main__":
171
- file_input = "my_notebook.json"
172
- test = process_notebook(file_input)
173
- # demo = build_gradio_interface()
174
- # demo.launch()
 
8
  AutoTokenizer,
9
  AutoConfig,
10
  pipeline,
 
11
  )
12
  import re
13
+ import nltk
14
 
15
+ PYTHON_CODE_MODEL = "sagard21/python-code-explainer"
16
+ TITLE_SUMMARIZE_MODEL = "fabiochiu/t5-small-medium-title-generation"
17
 
18
 
19
  class NotebookEnhancer:
20
  def __init__(self):
21
+ # models + tokenizer for generating titles from code summaries
22
+ self.title_tokenizer = AutoTokenizer.from_pretrained(TITLE_SUMMARIZE_MODEL)
23
+ self.title_summarization_model = AutoModelForSeq2SeqLM.from_pretrained(
24
+ TITLE_SUMMARIZE_MODEL
25
+ )
26
+
27
+ # models + tokenizer for generating summaries from Python code
28
+ self.python_model = AutoModelForSeq2SeqLM.from_pretrained(PYTHON_CODE_MODEL)
29
+ self.python_tokenizer = AutoTokenizer.from_pretrained(
30
+ PYTHON_CODE_MODEL, padding=True
31
+ )
32
+ self.python_pipeline = pipeline(
33
  "summarization",
34
+ model=PYTHON_CODE_MODEL,
35
+ config=AutoConfig.from_pretrained(PYTHON_CODE_MODEL),
36
+ tokenizer=self.python_tokenizer,
37
  )
38
+ # initiate the language model
39
  self.nlp = spacy.load("en_core_web_sm")
40
 
41
+ def generate_title(self, summary: str):
42
  """Generate a concise title for a code cell"""
43
+ inputs = self.title_tokenizer.batch_encode_plus(
44
+ ["summarize: " + summary],
45
+ max_length=1024,
46
+ return_tensors="pt",
47
+ padding=True,
48
+ ) # Batch size 1
49
+ output = self.title_summarization_model.generate(
50
+ **inputs, num_beams=8, do_sample=True, min_length=10, max_length=10
51
+ )
52
+ decoded_output = self.title_tokenizer.batch_decode(
53
+ output, skip_special_tokens=True
54
+ )[0]
55
+ predicted_title = nltk.sent_tokenize(decoded_output.strip())[0]
56
+ return f"# {predicted_title}"
57
 
58
  def _count_num_words(self, code):
59
  words = code.split(" ")
 
61
 
62
  def generate_summary(self, code):
63
  """Generate a detailed summary for a code cell"""
64
+ result = self.python_pipeline(code, min_length=5, max_length=64)
 
 
 
65
  summary = result[0]["summary_text"].strip()
66
+ title, summary = self._postprocess_summary(summary)
67
+ return f"# {title}", f"{summary}"
 
 
 
68
 
69
  def enhance_notebook(self, notebook: nbformat.notebooknode.NotebookNode):
70
  """Add title and summary markdown cells before each code cell"""
71
  # Create a new notebook
72
  enhanced_notebook = nbformat.v4.new_notebook()
73
  enhanced_notebook.metadata = notebook.metadata
 
74
  # Process each cell
75
  i = 0
76
  id = len(notebook.cells) + 1
 
79
  # For code cells, add title and summary markdown cells
80
  if cell.cell_type == "code" and cell.source.strip():
81
  # Generate summary
82
+ title, summary = self.generate_summary(cell.source)
83
  summary_cell = nbformat.v4.new_markdown_cell(summary)
84
  summary_cell.outputs = []
85
  summary_cell.id = id
86
  id += 1
 
 
 
87
  title_cell = nbformat.v4.new_markdown_cell(title)
88
  title_cell.outputs = []
89
  title_cell.id = id
 
91
 
92
  enhanced_notebook.cells.append(title_cell)
93
  enhanced_notebook.cells.append(summary_cell)
 
94
  # Add the original cell
95
  cell.outputs = []
96
  enhanced_notebook.cells.append(cell)
 
110
  def _postprocess_summary(self, summary: str):
111
  doc = self.nlp(summary)
112
  sentences = list(doc.sents)
 
 
113
  # remove the trailing list enumeration
114
  postprocessed_sentences = []
115
  for sentence in sentences:
116
  if self.is_valid(sentence):
117
+ sentence_text = sentence.text
118
+ sentence_text = re.sub("[0-9]+\.", "", sentence_text)
119
+ postprocessed_sentences.append(sentence_text)
120
+ title = postprocessed_sentences[0]
121
+ summary = postprocessed_sentences[1:]
122
+ return title, " ".join(summary)
123
 
124
 
125
  def process_notebook(file_path):
 
130
  nb = nbformat.read(f, as_version=4)
131
  # Process the notebook
132
  enhanced_notebook = enhancer.enhance_notebook(nb)
 
133
  enhanced_notebook_str = nbformat.writes(enhanced_notebook, version=4)
134
  # Save to temp file
135
  output_path = "enhanced_notebook.ipynb"
 
168
 
169
  # This will be the entry point when running the script
170
  if __name__ == "__main__":
171
+ # file_input = "my_notebook.json"
172
+ # test = process_notebook(file_input)
173
+ demo = build_gradio_interface()
174
+ demo.launch()
test.ipynb CHANGED
@@ -124,6 +124,110 @@
124
  " print(word, word.is_alpha, word.pos_)\n"
125
  ]
126
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  {
128
  "cell_type": "code",
129
  "execution_count": null,
 
124
  " print(word, word.is_alpha, word.pos_)\n"
125
  ]
126
  },
127
+ {
128
+ "cell_type": "code",
129
+ "execution_count": 50,
130
+ "metadata": {},
131
+ "outputs": [
132
+ {
133
+ "name": "stderr",
134
+ "output_type": "stream",
135
+ "text": [
136
+ "Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.\n"
137
+ ]
138
+ },
139
+ {
140
+ "name": "stdout",
141
+ "output_type": "stream",
142
+ "text": [
143
+ "['this function will build a model that can be used to train and']\n"
144
+ ]
145
+ }
146
+ ],
147
+ "source": [
148
+ "from transformers import T5Tokenizer, T5ForConditionalGeneration\n",
149
+ "example_text = \"This function will build a model that can be used to train and evaluate the model.\"\n",
150
+ "tokenizer = T5Tokenizer.from_pretrained('t5-small')\n",
151
+ "model = T5ForConditionalGeneration.from_pretrained('t5-small')\n",
152
+ "inputs = tokenizer.batch_encode_plus([\"summarize: \" + example_text], max_length=1024, return_tensors=\"pt\", pad_to_max_length=True) # Batch size 1\n",
153
+ "outputs = model.generate(inputs['input_ids'], num_beams=2, max_length=15, early_stopping=True)\n",
154
+ "\n",
155
+ "print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in outputs])"
156
+ ]
157
+ },
158
+ {
159
+ "cell_type": "code",
160
+ "execution_count": 59,
161
+ "metadata": {},
162
+ "outputs": [
163
+ {
164
+ "name": "stderr",
165
+ "output_type": "stream",
166
+ "text": [
167
+ "Device set to use mps:0\n"
168
+ ]
169
+ },
170
+ {
171
+ "data": {
172
+ "text/plain": [
173
+ "[{'summary_text': 'An apple a day, keeps the'}]"
174
+ ]
175
+ },
176
+ "execution_count": 59,
177
+ "metadata": {},
178
+ "output_type": "execute_result"
179
+ }
180
+ ],
181
+ "source": [
182
+ "from transformers import pipeline\n",
183
+ "summarizer = pipeline(\"summarization\", model=\"facebook/bart-large-cnn\", tokenizer=\"facebook/bart-large-cnn\")\n",
184
+ "summarizer(\"An apple a day, keeps the doctor away\", min_length=5, max_length=10)"
185
+ ]
186
+ },
187
+ {
188
+ "cell_type": "code",
189
+ "execution_count": 76,
190
+ "metadata": {},
191
+ "outputs": [
192
+ {
193
+ "name": "stderr",
194
+ "output_type": "stream",
195
+ "text": [
196
+ "[nltk_data] Downloading package punkt to /Users/irma/nltk_data...\n",
197
+ "[nltk_data] Package punkt is already up-to-date!\n"
198
+ ]
199
+ },
200
+ {
201
+ "name": "stdout",
202
+ "output_type": "stream",
203
+ "text": [
204
+ "This function will build a model that can be used to train and evaluate the model.\n",
205
+ "27\n"
206
+ ]
207
+ }
208
+ ],
209
+ "source": [
210
+ "from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n",
211
+ "import nltk\n",
212
+ "nltk.download('punkt')\n",
213
+ "\n",
214
+ "tokenizer = AutoTokenizer.from_pretrained(\"fabiochiu/t5-small-medium-title-generation\")\n",
215
+ "model = AutoModelForSeq2SeqLM.from_pretrained(\"fabiochiu/t5-small-medium-title-generation\")\n",
216
+ "\n",
217
+ "text = \"This function will build a model that can be used to train and evaluate the model.\"\n",
218
+ "\n",
219
+ "inputs = [\"summarize: \" + text]\n",
220
+ "\n",
221
+ "inputs = tokenizer(inputs, max_length=1024, truncation=True, return_tensors=\"pt\")\n",
222
+ "output = model.generate(**inputs, num_beams=4, do_sample=True, min_length=10, max_length=len(text) // 3)\n",
223
+ "decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]\n",
224
+ "predicted_title = nltk.sent_tokenize(decoded_output.strip())[0]\n",
225
+ "\n",
226
+ "print(predicted_title)\n",
227
+ "# Conversational AI: The Future of Customer Service\n",
228
+ "print(len(text) // 3)"
229
+ ]
230
+ },
231
  {
232
  "cell_type": "code",
233
  "execution_count": null,