raymondEDS commited on
Commit
49e3aec
·
1 Parent(s): fb62875

Updating week 4 content

Browse files
Reference files/Week_4_content.txt ADDED
@@ -0,0 +1,630 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ In this course, you'll learn the complete NLP workflow by exploring a fascinating real-world question: Does review length and language relate to reviewer ratings and decisions in academic peer review? If so, how?
3
+ Using data from the International Conference on Learning Representations (ICLR), you'll develop practical NLP skills while investigating how reviewers express their opinions. Each module builds upon the previous one, creating a coherent analytical pipeline from raw data to insight.
4
+ Learning Path
5
+ Data Loading and Initial Exploration: Setting up your environment and understanding your dataset
6
+ Text Preprocessing and Normalization: Cleaning and standardizing text data
7
+ Feature Extraction and Measurement: Calculating metrics from text
8
+ Visualization and Pattern Recognition: Creating insightful visualizations
9
+ Drawing Conclusions from Text Analysis: Synthesizing findings into actionable insights
10
+ Let's begin our exploration of how NLP can provide insights into academic peer review!
11
+
12
+ Module 1: Initial Exploration
13
+ The Challenge
14
+ Before we can analyze how review length relates to paper evaluations, we need to understand our dataset. In this module, we'll set up our Python environment and explore the ICLR conference data.
15
+ 1.1: Set up and get to your data
16
+ The first step in any NLP project is loading and understanding your data. Let's set up our environment and examine what we're working with:
17
+ python
18
+ # Import necessary libraries
19
+ import pandas as pd
20
+ import numpy as np
21
+ import matplotlib.pyplot as plt
22
+ import seaborn as sns
23
+ import string
24
+ from nltk.corpus import stopwords
25
+ from nltk.tokenize import word_tokenize, sent_tokenize
26
+ from wordcloud import WordCloud
27
+
28
+ # Load the datasets
29
+ df_reviews = pd.read_csv('../data/reviews.csv')
30
+ df_submissions = pd.read_csv('../data/Submissions.csv')
31
+ df_dec = pd.read_csv('../data/decision.csv')
32
+ df_keyword = pd.read_csv('../data/submission_keyword.csv')
33
+ Let's look at the first few rows of each dataset to understand what information we have:
34
+ python
35
+ # View the first few rows of the submissions dataset
36
+ df_submissions.head()
37
+ # View the first few rows of the reviews dataset
38
+ df_reviews.head()
39
+ # View all columns and rows in the reviews dataset
40
+ df_reviews
41
+ # View the first few rows of the keywords dataset
42
+ df_keyword.head()
43
+ 1.2: Looking at Review Content
44
+ Let's examine an actual review to understand the text we'll be analyzing:
45
+ python
46
+ # Display a sample review
47
+ df_reviews['review'][1]
48
+ Think about: What kinds of information do you see in this review? What language patterns do you notice?
49
+ 1.3: Calculating Basic Metrics
50
+ Let's calculate our first simple metric - the average review score for each paper:
51
+ python
52
+ # Get the average review score for each paper
53
+ df_average_review_score = df_reviews.groupby('forum')['rating_int'].mean().reset_index()
54
+ df_average_review_score
55
+ Key Insight: Each paper (identified by 'forum') receives multiple reviews with different scores. The average score gives us an overall assessment of each paper.
56
+ Module 2: Data Integration
57
+ In this module, we'll merge datasets for later analysis.
58
+ 2.1 Understanding the Need for Data Integration
59
+ In many NLP projects, the data we need is spread across multiple files or tables. In our case:
60
+ The df_reviews dataset contains the review text and ratings
61
+ The df_dec dataset contains the final decisions for each paper
62
+ To analyze how review text relates to paper decisions, we need to merge these datasets.
63
+ 2.2 Performing a Dataset Merge
64
+ Let's combine our review data with the decision data:
65
+ python
66
+ # Step 1 - Merge the reviews dataframe with the decisions dataframe
67
+ df_rev_dec = pd.merge(
68
+ df_reviews, # First dataframe (reviews)
69
+ df_dec, # Second dataframe (decisions)
70
+ left_on='forum', # Join key in the first dataframe
71
+ right_on='forum', # Join key in the second dataframe
72
+ how='inner' # Keep only matching rows
73
+ )[['review','decision','conf_name_y','rating_int','forum']] # Select only these columns
74
+ # Display the first few rows of the merged dataframe
75
+ df_rev_dec.head()
76
+ 2.3 Understanding Merge Concepts
77
+ Join Key: The 'forum' column identifies the paper and connects our datasets
78
+ Inner Join: Only keeps papers that appear in both datasets
79
+ Column Selection: We keep only relevant columns for our analysis
80
+ How to Verify: Always check the shape of your merged dataset to ensure you haven't lost data unexpectedly
81
+ Try it yourself: How many rows does the merged dataframe have compared to the original review dataframe? What might explain any differences?
82
+
83
+ Module 3: Basic Text Preprocessing
84
+ In this module, you'll learn essential data preprocessing techniques for NLP projects. We'll standardize text through case folding, clean up categorical variables, and prepare our review text for analysis.
85
+ 3.1 Case Folding (Lowercase Conversion)
86
+ A fundamental text preprocessing step is converting all text to lowercase to ensure consistency:
87
+ python
88
+ # Convert all review text to lowercase (case folding)
89
+ df_rev_dec['review'] = df_rev_dec['review'].str.lower()
90
+ # Display the updated dataframe
91
+ df_rev_dec
92
+ Why Case Folding Matters
93
+ Consistency: "Novel" and "novel" will be treated as the same word
94
+ Reduced Dimensionality: Fewer unique tokens to process
95
+ Improved Pattern Recognition: Easier to identify word frequencies and patterns
96
+ Note: While case folding is generally helpful, it can sometimes remove meaningful distinctions (e.g., "US" vs. "us"). For our academic review analysis, lowercase conversion is appropriate.
97
+
98
+ 3.2 Examining Categorical Values
99
+ Let's first check what unique decision categories exist in our dataset:
100
+ python
101
+ # Display the unique decision categories
102
+ df_rev_dec['decision'].unique()
103
+ 3.3 Standardizing Decision Categories
104
+ We can see that there are multiple "Accept" categories with different presentation formats. Let's standardize these:
105
+ python
106
+ # Define a function to clean up and standardize decision categories
107
+ def clean_up_decision(text):
108
+ if text in ['Accept (Poster)','Accept (Spotlight)', 'Accept (Oral)','Accept (Talk)']:
109
+ return 'Accept'
110
+ else:
111
+ return text
112
+ # Apply the function to create a new standardized decision column
113
+ df_rev_dec['decision_clean'] = df_rev_dec['decision'].apply(clean_up_decision)
114
+ # Check our new standardized decision categories
115
+ df_rev_dec['decision_clean'].unique()
116
+ Why Standardization Matters
117
+ Simplified Analysis: Reduces the number of categories to analyze
118
+ Clearer Patterns: Makes it easier to identify trends by decision outcome
119
+ Better Visualization: Creates more meaningful and readable plots
120
+ Consistent Terminology: Aligns with how conferences typically report accept/reject decisions
121
+ Try it yourself: What other ways could you group or standardize these decision categories? What information might be lost in our current approach?
122
+
123
+ Module 4: Text Tokenization
124
+ 4.1 Introduction to Tokenization
125
+ Tokenization is the process of breaking text into smaller units like sentences or words. Let's examine a review:
126
+ python
127
+ # Display a sample review
128
+ df_reviews['review'][1]
129
+ 4.2 Sentence Tokenization
130
+ Let's break this review into sentences using NLTK's sentence tokenizer:
131
+ python
132
+ # Import the necessary library if not already imported
133
+ from nltk.tokenize import sent_tokenize
134
+ # Tokenize the review into sentences
135
+ sent_tokenize(df_reviews['review'][1])
136
+ 4.3 Counting Sentences
137
+ Now let's count the number of sentences in the review:
138
+ python
139
+ # Count the number of sentences
140
+ len(sent_tokenize(df_reviews['review'][1]))
141
+ 4.4 Creating a Reusable Function
142
+ Let's create a function to count sentences in any text:
143
+ python
144
+ # Define a function to count sentences in a text
145
+ def sentence_count(text):
146
+ return len(sent_tokenize(text))
147
+ 4.5 Applying Our Function to All Reviews
148
+ Now we'll apply our function to all reviews to get sentence counts:
149
+ python
150
+ # Add a new column with the sentence count for each review
151
+ df_rev_dec['sent_count'] = df_rev_dec['review'].apply(sentence_count)
152
+ # Display the updated dataframe
153
+ df_rev_dec.head()
154
+ Key Insight: Sentence count is a simple yet effective way to quantify review length. The number of sentences can indicate how thoroughly a reviewer has evaluated a paper.
155
+
156
+ Module 5: Visualization of Text Metrics
157
+ 5.1 Creating a 2D Histogram
158
+ Let's visualize the relationship between review length (in sentences), rating, and decision outcome:
159
+ python
160
+ # Create a 2D histogram with sentence count, rating, and decision
161
+ ax = sns.histplot(data=df_rev_dec, x='sent_count',
162
+ y='rating_int',
163
+ hue='decision_clean',
164
+ kde=True,
165
+ log_scale=(True,False),
166
+ legend=True)
167
+ 5.2 Enhancing Our Visualization
168
+ Let's improve our visualization with better labels and formatting:
169
+ python
170
+ # Set axis labels
171
+ ax.set(xlabel='Review Length (# Sentences)', ylabel='Review Rating')
172
+ # Move the legend outside the plot for better visibility
173
+ sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
174
+ # Ensure the layout is properly configured
175
+ plt.tight_layout()
176
+ # Display the plot
177
+ plt.show()
178
+ 5.3 Interpreting the Visualization
179
+ This visualization reveals several interesting patterns:
180
+ Length-Rating Relationship: Is there a pattern that entails how length of review is correlated with rating?
181
+ Decision Patterns: Are there visible clusters for accepted vs. rejected papers?
182
+ Density Distribution: Where are most reviews concentrated in terms of length and rating?
183
+ Outliers: Are there unusually long or short reviews at certain rating levels?
184
+ Discussion Question: Based on this visualization, do reviewers tend to write longer reviews when they're more positive or more critical? What might explain this pattern?
185
+
186
+ Module 6: Additional Text Processing - Tokenization
187
+ Tokenization is the process of breaking text into smaller units (tokens) that serve as the building blocks for natural language processing. In this lesson, we'll explore how to tokenize text, remove stopwords and punctuation, and analyze the results.
188
+ 6.1 Text Cleaning
189
+ Before tokenization, we often clean the text to remove unwanted characters. Let's start by removing punctuation:
190
+ python
191
+ # Removing punctuation
192
+ df_rev_dec['clean_review_word'] = df_rev_dec['review'].str.translate(str.maketrans('', '', string.punctuation))
193
+ What's happening here?
194
+ string.punctuation contains all punctuation characters (.,!?;:'"()[]{}-_)
195
+ str.maketrans('', '', string.punctuation) creates a translation table to remove these characters
196
+ df_rev_dec['review'].str.translate() applies this translation to all review texts
197
+ 6.2 Word Tokenization
198
+ After cleaning, we can tokenize the text into individual words:
199
+ python
200
+ # Tokenizing the text
201
+ df_rev_dec['tokens'] = df_rev_dec['clean_review_word'].apply(word_tokenize)
202
+
203
+ # Example: Look at tokens for the 6th review
204
+ df_rev_dec['tokens'][5]
205
+ What's happening here?
206
+ word_tokenize() is an NLTK function that splits text into a list of words
207
+ We apply this function to each review using pandas' apply() method
208
+ The result is a new column containing lists of words for each review
209
+ 6.3 Removing Stopwords
210
+ Stopwords are common words like "the," "and," "is" that often don't add meaningful information for analysis:
211
+ python
212
+ # Getting the list of English stopwords
213
+ stop_words = set(stopwords.words('english'))
214
+
215
+ # Removing stopwords from our tokens
216
+ df_rev_dec['tokens'] = df_rev_dec['tokens'].apply(lambda x: [word for word in x if word not in stop_words])
217
+ What's happening here?
218
+ stopwords.words('english') returns a list of common English stopwords
219
+ We convert it to a set for faster lookup
220
+ The lambda function filters each token list, keeping only words that aren't stopwords
221
+ This creates more meaningful token lists focused on content words
222
+ 6.4 Counting Tokens
223
+ Now that we have our cleaned and filtered tokens, let's count them to measure review length:
224
+ python
225
+ # Count tokens for each review
226
+ df_rev_dec['tokens_counts'] = df_rev_dec['tokens'].apply(len)
227
+
228
+ # View the token counts
229
+ df_rev_dec['tokens_counts']
230
+ What's happening here?
231
+ We use apply(len) to count the number of tokens in each review
232
+ This gives us a quantitative measure of review length after removing stopwords
233
+ The difference between this and raw word count shows the prevalence of stopwords
234
+ 6.5 Visualizing Token Counts vs. Ratings
235
+ Let's visualize the relationship between token count, rating, and decision:
236
+ python
237
+ # Create a 2D histogram with token count, rating, and decision
238
+ ax = sns.histplot(data=df_rev_dec, x='tokens_counts',
239
+ y='rating_int',
240
+ hue='decision_clean',
241
+ kde=True,
242
+ log_scale=(True,False),
243
+ legend=True)
244
+
245
+ # Set axis labels
246
+ ax.set(xlabel='Review Length (# Tokens)', ylabel='Review Rating')
247
+
248
+ # Move the legend outside the plot
249
+ sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
250
+
251
+ plt.tight_layout()
252
+ plt.show()
253
+ What's happening here?
254
+ We create a 2D histogram showing the distribution of token counts and ratings
255
+ Colors distinguish between accepted and rejected papers
256
+ Log scale on the x-axis helps visualize the wide range of token counts
257
+ Kernel density estimation (KDE) shows the concentration of reviews
258
+ Module 7: Aggregating Data by Paper
259
+ 7.1 Understanding Data Aggregation
260
+ So far, we've been analyzing individual reviews. However, each paper (identified by 'forum') may have multiple reviews. To understand paper-level patterns, we need to aggregate our data.
261
+ 7.2 Calculating Paper-Level Metrics
262
+ Let's aggregate our review metrics to the paper level by calculating means:
263
+ python
264
+ # Aggregate reviews to paper level (mean of metrics for each paper)
265
+ df_rev_dec_ave = df_rev_dec.groupby(['forum','decision_clean'])[['rating_int','tokens_counts','sent_count']].mean().reset_index()
266
+ What's happening here?
267
+ We're grouping reviews by both 'forum' (paper ID) and 'decision_clean' (accept/reject)
268
+ For each group, we calculate the mean of 'rating_int', 'tokens_counts', and 'sent_count'
269
+ The reset_index() turns the result back into a regular DataFrame
270
+ The result is a paper-level dataset with average metrics for each paper
271
+ Try it yourself: How many papers do we have in our dataset compared to reviews? What does this tell us about the review process?
272
+ Module 8: Visualizing Token Count vs. Rating
273
+ 8.1 Creating an Advanced Visualization
274
+ Now let's visualize the relationship between token count and rating at the paper level:
275
+ python
276
+ # Create a 2D histogram with token count, rating, and decision
277
+ ax = sns.histplot(data=df_rev_dec_ave, x='tokens_counts',
278
+ y='rating_int',
279
+ hue='decision_clean',
280
+ kde=True,
281
+ log_scale=(True,False),
282
+ legend=True)
283
+
284
+ # Set axis labels
285
+ ax.set(xlabel='Review Length (# Tokens)', ylabel='Review Rating')
286
+
287
+ # Move the legend outside the plot
288
+ sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
289
+
290
+ plt.tight_layout()
291
+ plt.show()
292
+ 8.2 Interpreting the Visualization
293
+ This visualization reveals important patterns in our data:
294
+ Decision Boundaries: Notice where the color changes from one decision to another
295
+ Length-Rating Relationship: Is there a correlation between review length and rating?
296
+ Clustering: Are there natural clusters in the data?
297
+ Outliers: What papers received unusually long or short reviews?
298
+ Key Insight: At the paper level, we can see if the average review length for a paper relates to its likelihood of acceptance.
299
+ Module 9: Comparing Token Count and Sentence Count
300
+ 9.1 Visualizing Sentence Count vs. Rating
301
+ Let's create a similar visualization using sentence count instead of token count:
302
+ python
303
+ # Create a 2D histogram with sentence count, rating, and decision
304
+ ax = sns.histplot(data=df_rev_dec_ave, x='sent_count',
305
+ y='rating_int',
306
+ hue='decision_clean',
307
+ kde=True,
308
+ log_scale=(True,False),
309
+ legend=True)
310
+
311
+ # Set axis labels
312
+ ax.set(xlabel='Review Length (# Sentences)', ylabel='Review Rating')
313
+
314
+ # Move the legend outside the plot
315
+ sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
316
+
317
+ plt.tight_layout()
318
+ plt.show()
319
+ 9.2 Comparing Token vs. Sentence Metrics
320
+ By comparing these two visualizations, we can understand:
321
+ Which Metric is More Informative: Do token counts or sentence counts better differentiate accepted vs. rejected papers?
322
+ Different Patterns: Do some papers have many short sentences while others have fewer long ones?
323
+ Consistency: Are the patterns consistent across both metrics?
324
+ Discussion Question: Which metric—tokens or sentences—seems to be a better predictor of paper acceptance? Why might that be?
325
+ Module 10: Word Cloud Visualizations
326
+ 10.1 Creating a Word Cloud from Review Text
327
+ Word clouds are a powerful way to visualize the most frequent words in a text corpus:
328
+ python
329
+ # Concatenate all review text
330
+ text = ' '.join(df_rev_dec['clean_review_word'])
331
+
332
+ # Generate word cloud
333
+ wordcloud = WordCloud().generate(text)
334
+
335
+ # Display word cloud
336
+ plt.figure(figsize=(8, 6))
337
+ plt.imshow(wordcloud, interpolation='bilinear')
338
+ plt.axis('off')
339
+ plt.show()
340
+ 10.2 Visualizing Paper Keywords
341
+ Now let's visualize the primary keywords associated with the papers:
342
+ python
343
+ # Concatenate all primary keywords
344
+ text = ' '.join(df_keyword['primary_keyword'])
345
+
346
+ # Generate word cloud
347
+ wordcloud = WordCloud().generate(text)
348
+
349
+ # Display word cloud
350
+ plt.figure(figsize=(8, 6))
351
+ plt.imshow(wordcloud, interpolation='bilinear')
352
+ plt.axis('off')
353
+ plt.show()
354
+ 10.3 Visualizing Paper Abstracts
355
+ Finally, let's create a word cloud from paper abstracts:
356
+ python
357
+ # Concatenate all abstracts
358
+ text = ' '.join(df_submissions['abstract'])
359
+
360
+ # Generate word cloud
361
+ wordcloud = WordCloud().generate(text)
362
+
363
+ # Display word cloud
364
+ plt.figure(figsize=(8, 6))
365
+ plt.imshow(wordcloud, interpolation='bilinear')
366
+ plt.axis('off')
367
+ plt.show()
368
+ Interpreting Word Clouds
369
+ Word clouds provide insights about:
370
+ Dominant Themes: The most frequent words appear largest
371
+ Vocabulary Differences: Compare terms across different sources (reviews vs. abstracts)
372
+ Field-Specific Terminology: Technical terms reveal the focus of the conference
373
+ Sentiment Indicators: Evaluative words in reviews reveal assessment patterns
374
+ Try it yourself: What differences do you notice between the word clouds from reviews, keywords, and abstracts? What do these differences tell you about academic communication?
375
+
376
+
377
+
378
+
379
+
380
+
381
+
382
+
383
+
384
+
385
+
386
+
387
+
388
+
389
+
390
+
391
+
392
+
393
+
394
+
395
+
396
+ V1.1 Week 4 - Intro to NLP
397
+ Course Overview
398
+ In this course, you'll learn fundamental Natural Language Processing (NLP) concepts by exploring a fascinating real-world question: What is the effect of releasing a preprint of a paper before it is submitted for peer review?
399
+ Using the ICLR (International Conference on Learning Representations) database - which contains submissions, reviews, and author profiles from 2017-2022 - you'll develop practical NLP skills while investigating potential biases and patterns in academic publishing.
400
+ Learning Path
401
+ Understanding Text as Data: How computers represent and work with text
402
+ Text Processing Fundamentals: Basic cleaning and normalization
403
+ Quantitative Text Analysis: Measuring and comparing text features
404
+ Tokenization Approaches: Breaking text into meaningful units
405
+ Text Visualization Techniques: Creating insightful visual representations
406
+ From Analysis to Insights: Drawing evidence-based conclusions
407
+ Let's dive in!
408
+
409
+ Step 4: Text Cleaning and Normalization for Academic Content
410
+ Academic papers contain specialized vocabulary, citations, equations, and other elements that require careful normalization.
411
+ Key Concept: Scientific text normalization preserves meaningful technical content while standardizing format.
412
+ Stop Words Removal
413
+ Definition: Stop words are extremely common words that appear frequently in text but typically carry little meaningful information for analysis purposes. In English, these include articles (the, a, an), conjunctions (and, but, or), prepositions (in, on, at), and certain pronouns (I, you, it).
414
+ Stop words removal is the process of filtering these words out before analysis to:
415
+ Reduce noise in the data
416
+ Decrease the dimensionality of the text representation
417
+ Focus analysis on the content-bearing words
418
+ In academic text, we often extend standard stop word lists to include domain-specific terms that are ubiquitous but not analytically useful (e.g., "paper," "method," "result").
419
+ python
420
+ # Load standard English stop words
421
+ from nltk.corpus import stopwords
422
+ standard_stop_words = set(stopwords.words('english'))
423
+
424
+ # Add academic-specific stop words
425
+ academic_stop_words = ['et', 'al', 'fig', 'table', 'paper', 'using', 'used',
426
+ 'method', 'result', 'show', 'propose', 'use']
427
+ all_stop_words = standard_stop_words.union(academic_stop_words)
428
+
429
+ # Apply stop word removal
430
+ def remove_stop_words(text):
431
+ words = text.split()
432
+ filtered_words = [word for word in words if word.lower() not in all_stop_words]
433
+ return ' '.join(filtered_words)
434
+
435
+ # Compare before and after
436
+ example = "We propose a novel method that shows impressive results on the benchmark dataset."
437
+ filtered = remove_stop_words(example)
438
+
439
+ print("Original:", example)
440
+ print("After stop word removal:", filtered)
441
+ # Output: "propose novel method shows impressive results benchmark dataset."
442
+ Stemming and Lemmatization
443
+ Definition: Stemming and lemmatization are text normalization techniques that reduce words to their root or base forms, allowing different inflections or derivations of the same word to be treated as equivalent.
444
+ Stemming is a simpler, rule-based approach that works by truncating words to their stems, often by removing suffixes. For example:
445
+ "running," "runs," and "runner" might all be reduced to "run"
446
+ "connection," "connected," and "connecting" might all become "connect"
447
+ Stemming is faster but can sometimes produce non-words or incorrect reductions.
448
+ Lemmatization is a more sophisticated approach that uses vocabulary and morphological analysis to return the dictionary base form (lemma) of a word. For example:
449
+ "better" becomes "good"
450
+ "was" and "were" become "be"
451
+ "studying" becomes "study"
452
+ Lemmatization generally produces more accurate results but requires more computational resources.
453
+ python
454
+ from nltk.stem import PorterStemmer, WordNetLemmatizer
455
+ import nltk
456
+ nltk.download('wordnet')
457
+
458
+ # Initialize stemmer and lemmatizer
459
+ stemmer = PorterStemmer()
460
+ lemmatizer = WordNetLemmatizer()
461
+
462
+ # Example words
463
+ academic_terms = ["algorithms", "computing", "learning", "trained",
464
+ "networks", "better", "studies", "analyzed"]
465
+
466
+ # Compare stemming and lemmatization
467
+ for term in academic_terms:
468
+ print(f"Original: {term}")
469
+ print(f"Stemmed: {stemmer.stem(term)}")
470
+ print(f"Lemmatized: {lemmatizer.lemmatize(term)}")
471
+ print()
472
+
473
+ # Demonstration in context
474
+ academic_sentence = "The training algorithms performed better than expected when analyzing multiple neural networks."
475
+
476
+ # Apply stemming
477
+ stemmed_words = [stemmer.stem(word) for word in academic_sentence.lower().split()]
478
+ stemmed_sentence = ' '.join(stemmed_words)
479
+
480
+ # Apply lemmatization
481
+ lemmatized_words = [lemmatizer.lemmatize(word) for word in academic_sentence.lower().split()]
482
+ lemmatized_sentence = ' '.join(lemmatized_words)
483
+
484
+ print("Original:", academic_sentence)
485
+ print("Stemmed:", stemmed_sentence)
486
+ print("Lemmatized:", lemmatized_sentence)
487
+ When to use which approach:
488
+ For academic text analysis:
489
+ Stemming is useful when processing speed is important and approximate matching is sufficient
490
+ Lemmatization is preferred when precision is crucial, especially for technical terms where preserving meaning is essential
491
+ In our ICLR paper analysis, lemmatization would likely be more appropriate since technical terminology often carries specific meanings that should be preserved accurately.
492
+ Challenge Question: How might stemming versus lemmatization affect our analysis of technical innovation in ICLR papers? Can you think of specific machine learning terms where these approaches would yield different results?
493
+
494
+
495
+ V1.0 Week 4 - Intro to NLP
496
+ The Real-World Problem
497
+ Imagine you're part of a small business team that has just launched a new product. You've received hundreds of customer reviews across various platforms, and your manager has asked you to make sense of this feedback. Looking at the mountain of text data, you realize you need a systematic way to understand what customers are saying without reading each review individually.
498
+ Your challenge: How can you efficiently analyze customer feedback to identify common themes, sentiments, and specific product issues?
499
+ Our Approach
500
+ In this module, we'll learn how to transform unstructured text feedback into structured insights using Natural Language Processing. Here's our journey:
501
+ Understanding text as data
502
+ Basic processing of text information
503
+ Measuring text properties
504
+ Cleaning and normalizing customer feedback
505
+ Visualizing patterns in the feedback
506
+ Analyzing words vs. tokens
507
+ Let's begin!
508
+ Step 1: Text as Data - A New Perspective
509
+ When we look at customer reviews like:
510
+ "Love this product! So easy to use and the battery lasts forever."
511
+ "Terrible design. Buttons stopped working after two weeks."
512
+ We naturally understand the meaning and sentiment. But how can a computer understand this?
513
+ Key Concept: Text can be treated as data that we can analyze quantitatively.
514
+ Unlike numerical data (age, price, temperature) that has inherent mathematical properties, text data needs to be transformed before we can analyze it.
515
+ Interactive Exercise: Look at these two reviews. As a human, what information can you extract? Now think about how a computer might "see" this text without any processing.
516
+ Challenge Question: What types of information might we want to extract from customer reviews? List at least three analytical goals.
517
+ Step 2: Basic Text Processing - Breaking Down Language
518
+ Before we can analyze text, we need to break it down into meaningful units.
519
+ Key Concept: Tokenization is the process of splitting text into smaller pieces (tokens) such as words, phrases, or characters.
520
+ For example, the review "Love this product!" can be tokenized into ["Love", "this", "product", "!"] or ["Love", "this", "product", "!"] depending on our approach.
521
+ Interactive Example: Let's tokenize these customer reviews:
522
+ python
523
+ # Simple word tokenization
524
+ review = "Battery life is amazing but the app crashes frequently."
525
+ tokens = review.split() # Results in ["Battery", "life", "is", "amazing", "but", "the", "app", "crashes", "frequently."]
526
+ Notice how "frequently." includes the period. Basic tokenization has limitations!
527
+ Challenge Question: How might we handle contractions like "doesn't" or hyphenated words like "user-friendly" when tokenizing?
528
+ Step 3: Measuring Text - Quantifying Feedback
529
+ Now that we've broken text into pieces, we can start measuring properties of our customer feedback.
530
+ Key Concept: Text metrics help us quantify and compare text data.
531
+ Common metrics include:
532
+ Length (words, characters)
533
+ Complexity (average word length, unique words ratio)
534
+ Sentiment scores (positive/negative)
535
+ Interactive Example: Let's calculate basic metrics for customer reviews:
536
+ python
537
+ # Word count
538
+ review = "The interface is intuitive and responsive."
539
+ word_count = len(review.split()) # 6 words
540
+
541
+ # Character count (including spaces)
542
+ char_count = len(review) # 41 characters
543
+
544
+ # Unique words ratio
545
+ unique_words = len(set(review.lower().split()))
546
+ unique_ratio = unique_words / word_count # 1.0 (all words are unique)
547
+ Challenge Question: Why might longer reviews not necessarily contain more information than shorter ones? What other metrics beyond length might better capture information content?
548
+ Step 4: Text Cleaning and Normalization
549
+ Customer feedback often contains inconsistencies: spelling variations, punctuation, capitalization, etc.
550
+ Key Concept: Text normalization creates a standardized format for analysis.
551
+ Common normalization steps:
552
+ Converting to lowercase
553
+ Removing punctuation
554
+ Correcting spelling
555
+ Removing stop words (common words like "the", "is")
556
+ Stemming or lemmatizing (reducing words to their base form)
557
+ Interactive Example: Let's normalize a review:
558
+ python
559
+ # Original review
560
+ review = "The battery LIFE is amazing!!! Works for days."
561
+
562
+ # Lowercase
563
+ review = review.lower() # "the battery life is amazing!!! works for days."
564
+
565
+ # Remove punctuation and extra spaces
566
+ import re
567
+ review = re.sub(r'[^\w\s]', '', review) # "the battery life is amazing works for days"
568
+
569
+ # Remove stop words
570
+ stop_words = ["the", "is", "for"]
571
+ words = review.split()
572
+ filtered_words = [word for word in words if word not in stop_words]
573
+ # Result: ["battery", "life", "amazing", "works", "days"]
574
+ Challenge Question: How might normalization affect sentiment analysis? Could removing punctuation or stop words change the perceived sentiment of a review?
575
+ Step 5: Text Visualization - Seeing Patterns
576
+ Visual representations help us identify patterns across many reviews.
577
+ Key Concept: Text visualization techniques reveal insights that are difficult to see in raw text.
578
+ Common visualization methods:
579
+ Word clouds
580
+ Frequency distributions
581
+ Sentiment over time
582
+ Topic clusters
583
+ Interactive Example: Creating a simple word frequency chart:
584
+ python
585
+ from collections import Counter
586
+
587
+ # Combined reviews
588
+ reviews = ["Battery life is amazing", "Battery drains too quickly",
589
+ "Great battery performance", "Screen is too small"]
590
+
591
+ # Count word frequencies
592
+ all_words = " ".join(reviews).lower().split()
593
+ word_counts = Counter(all_words)
594
+ # Result: {'battery': 3, 'life': 1, 'is': 2, 'amazing': 1, 'drains': 1, 'too': 2, 'quickly': 1, 'great': 1, 'performance': 1, 'screen': 1, 'small': 1}
595
+
596
+ # We could visualize this as a bar chart
597
+ # Most frequent: 'battery' (3), 'is' (2), 'too' (2)
598
+ Challenge Question: Why might a word cloud be misleading for understanding customer sentiment? What additional information would make the visualization more informative?
599
+ Step 6: Words vs. Tokens - Making Choices
600
+ As we advance in NLP, we face an important decision: should we analyze whole words or more sophisticated tokens?
601
+ Key Concept: Different tokenization approaches have distinct advantages and limitations.
602
+ Word-based analysis:
603
+ Intuitive and interpretable
604
+ Misses connections between related words (run/running/ran)
605
+ Struggles with compound words and new terms
606
+ Token-based analysis:
607
+ Can capture subword information
608
+ Handles unknown words better
609
+ May lose some human interpretability
610
+ Interactive Example: Comparing approaches:
611
+ python
612
+ # Word-based
613
+ review = "The touchscreen is unresponsive"
614
+ words = review.lower().split() # ['the', 'touchscreen', 'is', 'unresponsive']
615
+
616
+ # Subword tokenization (simplified example)
617
+ subwords = ['the', 'touch', 'screen', 'is', 'un', 'responsive']
618
+ Challenge Question: For our customer feedback analysis, which approach would be better: analyzing whole words or subword tokens? What factors would influence this decision?
619
+ Putting It All Together: Solving Our Problem
620
+ Now that we've learned these fundamental NLP concepts, let's return to our original challenge: analyzing customer feedback at scale.
621
+ Here's how we'd approach it:
622
+ Collect and tokenize all customer reviews
623
+ Clean and normalize the text
624
+ Calculate key metrics (length, sentiment scores)
625
+ Visualize common terms and topics
626
+ Identify positive and negative feedback themes
627
+ Generate an automated summary for the product team
628
+ By applying these NLP fundamentals, we've transformed an overwhelming mass of text into actionable insights that can drive product improvements!
629
+ Final Challenge: How could we extend this analysis to track customer sentiment over time as we release product updates? What additional NLP techniques might be helpful?
630
+
app/__pycache__/main.cpython-311.pyc CHANGED
Binary files a/app/__pycache__/main.cpython-311.pyc and b/app/__pycache__/main.cpython-311.pyc differ
 
app/main.py CHANGED
@@ -16,7 +16,7 @@ from app.components.login import login
16
  from app.pages import week_1
17
  from app.pages import week_2
18
  from app.pages import week_3
19
-
20
  # Page configuration
21
  st.set_page_config(
22
  page_title="Data Science Course App",
@@ -139,6 +139,8 @@ def show_week_content():
139
  week_2.show()
140
  elif st.session_state.current_week == 3:
141
  week_3.show()
 
 
142
  else:
143
  st.warning("Content for this week is not yet available.")
144
 
@@ -151,7 +153,7 @@ def main():
151
  return
152
 
153
  # User is logged in, show course content
154
- if st.session_state.current_week in [1, 2, 3]:
155
  show_week_content()
156
  else:
157
  st.title("Data Science Research Paper Course")
 
16
  from app.pages import week_1
17
  from app.pages import week_2
18
  from app.pages import week_3
19
+ from app.pages import week_4
20
  # Page configuration
21
  st.set_page_config(
22
  page_title="Data Science Course App",
 
139
  week_2.show()
140
  elif st.session_state.current_week == 3:
141
  week_3.show()
142
+ elif st.session_state.current_week == 4:
143
+ week_4.show()
144
  else:
145
  st.warning("Content for this week is not yet available.")
146
 
 
153
  return
154
 
155
  # User is logged in, show course content
156
+ if st.session_state.current_week in [1, 2, 3, 4]:
157
  show_week_content()
158
  else:
159
  st.title("Data Science Research Paper Course")
app/pages/__pycache__/week_2.cpython-311.pyc CHANGED
Binary files a/app/pages/__pycache__/week_2.cpython-311.pyc and b/app/pages/__pycache__/week_2.cpython-311.pyc differ
 
app/pages/__pycache__/week_4.cpython-311.pyc ADDED
Binary file (11 kB). View file
 
app/pages/week_4.py ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import pandas as pd
3
+ import numpy as np
4
+ import matplotlib.pyplot as plt
5
+ import seaborn as sns
6
+ import nltk
7
+ from nltk.corpus import stopwords
8
+ from nltk.tokenize import word_tokenize, sent_tokenize
9
+ nltk.download('punkt_tab')
10
+ nltk.download('stopwords')
11
+ from nltk.stem import PorterStemmer, WordNetLemmatizer
12
+ from wordcloud import WordCloud
13
+ import string
14
+ import io
15
+ from contextlib import redirect_stdout
16
+
17
+ # Initialize session state for notebook-like cells
18
+ if 'cells' not in st.session_state:
19
+ st.session_state.cells = []
20
+ if 'df' not in st.session_state:
21
+ st.session_state.df = None
22
+
23
+ def capture_output(code, df=None):
24
+ """Helper function to capture print output"""
25
+ f = io.StringIO()
26
+ with redirect_stdout(f):
27
+ try:
28
+ # Create a dictionary of variables to use in exec
29
+ variables = {'pd': pd, 'np': np, 'plt': plt, 'sns': sns, 'nltk': nltk}
30
+ if df is not None:
31
+ variables['df'] = df
32
+ exec(code, variables)
33
+ except Exception as e:
34
+ return f"Error: {str(e)}"
35
+ return f.getvalue()
36
+
37
+ def show():
38
+ st.title("Week 4: Introduction to Natural Language Processing")
39
+
40
+ # Introduction Section
41
+ st.header("Course Overview")
42
+ st.write("""
43
+ In this course, you'll learn fundamental Natural Language Processing (NLP) concepts by exploring a fascinating real-world question:
44
+ What is the effect of releasing a preprint of a paper before it is submitted for peer review?
45
+
46
+ Using the ICLR (International Conference on Learning Representations) database - which contains submissions, reviews, and author profiles
47
+ from 2017-2022 - you'll develop practical NLP skills while investigating potential biases and patterns in academic publishing.
48
+ """)
49
+
50
+ # Learning Path
51
+ st.subheader("Learning Path")
52
+ st.write("""
53
+ 1. Understanding Text as Data: How computers represent and work with text
54
+ 2. Text Processing Fundamentals: Basic cleaning and normalization
55
+ 3. Quantitative Text Analysis: Measuring and comparing text features
56
+ 4. Tokenization Approaches: Breaking text into meaningful units
57
+ 5. Text Visualization Techniques: Creating insightful visual representations
58
+ 6. From Analysis to Insights: Drawing evidence-based conclusions
59
+ """)
60
+
61
+ # Module 1: Text as Data
62
+ st.header("Module 1: Text as Data")
63
+ st.write("""
64
+ When we look at text like customer reviews or academic papers, we naturally understand the meaning.
65
+ But how can a computer understand this?
66
+
67
+ Key Concept: Text can be treated as data that we can analyze quantitatively.
68
+ Unlike numerical data (age, price, temperature) that has inherent mathematical properties,
69
+ text data needs to be transformed before we can analyze it.
70
+ """)
71
+
72
+ # Interactive Example
73
+ st.subheader("Interactive Example: Text Tokenization")
74
+ st.write("Let's try tokenizing some text:")
75
+
76
+ example_text = st.text_area(
77
+ "Enter some text to tokenize:",
78
+ "The quick brown fox jumps over the lazy dog."
79
+ )
80
+
81
+ if st.button("Tokenize Text"):
82
+ tokens = word_tokenize(example_text)
83
+ st.write("Tokens:", tokens)
84
+ st.write("Number of tokens:", len(tokens))
85
+
86
+ # Module 2: Text Processing
87
+ st.header("Module 2: Text Processing")
88
+ st.write("""
89
+ Before we can analyze text, we need to clean and normalize it. This includes:
90
+ - Converting to lowercase
91
+ - Removing punctuation
92
+ - Removing stop words
93
+ - Stemming or lemmatization
94
+ """)
95
+
96
+ # Interactive Text Processing
97
+ st.subheader("Try Text Processing")
98
+ st.write("""
99
+ Let's process some text using different techniques:
100
+ """)
101
+
102
+ process_text = st.text_area(
103
+ "Enter text to process:",
104
+ "The quick brown fox jumps over the lazy dog.",
105
+ key="process_text"
106
+ )
107
+
108
+ col1, col2 = st.columns(2)
109
+
110
+ with col1:
111
+ if st.button("Remove Stop Words"):
112
+ stop_words = set(stopwords.words('english'))
113
+ words = word_tokenize(process_text.lower())
114
+ filtered_words = [word for word in words if word not in stop_words]
115
+ st.write("After removing stop words:", filtered_words)
116
+
117
+ with col2:
118
+ if st.button("Remove Punctuation"):
119
+ no_punct = process_text.translate(str.maketrans('', '', string.punctuation))
120
+ st.write("After removing punctuation:", no_punct)
121
+
122
+ # Module 3: Text Visualization
123
+ st.header("Module 3: Text Visualization")
124
+ st.write("""
125
+ Visual representations help us identify patterns across text data.
126
+ Common visualization methods include:
127
+ - Word clouds
128
+ - Frequency distributions
129
+ - Sentiment over time
130
+ - Topic clusters
131
+ """)
132
+
133
+ # Interactive Word Cloud
134
+ st.subheader("Create a Word Cloud")
135
+ st.write("""
136
+ Let's create a word cloud from some text:
137
+ """)
138
+
139
+ wordcloud_text = st.text_area(
140
+ "Enter text for word cloud:",
141
+ "The quick brown fox jumps over the lazy dog. The fox is quick and brown. The dog is lazy.",
142
+ key="wordcloud_text"
143
+ )
144
+
145
+ if st.button("Generate Word Cloud"):
146
+ # Create and generate a word cloud image
147
+ wordcloud = WordCloud().generate(wordcloud_text)
148
+
149
+ # Display the word cloud
150
+ plt.figure(figsize=(10, 6))
151
+ plt.imshow(wordcloud, interpolation='bilinear')
152
+ plt.axis('off')
153
+ st.pyplot(plt)
154
+
155
+ # Practice Exercises
156
+ st.header("Practice Exercises")
157
+
158
+ with st.expander("Exercise 1: Text Processing"):
159
+ st.write("""
160
+ 1. Load a sample text
161
+ 2. Remove stop words and punctuation
162
+ 3. Create a word cloud
163
+ 4. Analyze word frequencies
164
+ """)
165
+
166
+ st.code("""
167
+ # Solution
168
+ import nltk
169
+ from nltk.corpus import stopwords
170
+ from wordcloud import WordCloud
171
+ import string
172
+
173
+ # Sample text
174
+ text = "Your text here"
175
+
176
+ # Remove punctuation
177
+ text = text.translate(str.maketrans('', '', string.punctuation))
178
+
179
+ # Remove stop words
180
+ stop_words = set(stopwords.words('english'))
181
+ words = text.split()
182
+ filtered_words = [word for word in words if word.lower() not in stop_words]
183
+
184
+ # Create word cloud
185
+ wordcloud = WordCloud().generate(' '.join(filtered_words))
186
+ plt.imshow(wordcloud)
187
+ plt.axis('off')
188
+ plt.show()
189
+ """)
190
+
191
+ with st.expander("Exercise 2: Text Analysis"):
192
+ st.write("""
193
+ 1. Calculate basic text metrics (word count, unique words)
194
+ 2. Perform stemming and lemmatization
195
+ 3. Compare the results
196
+ 4. Visualize the differences
197
+ """)
198
+
199
+ st.code("""
200
+ # Solution
201
+ from nltk.stem import PorterStemmer, WordNetLemmatizer
202
+
203
+ # Initialize stemmer and lemmatizer
204
+ stemmer = PorterStemmer()
205
+ lemmatizer = WordNetLemmatizer()
206
+
207
+ # Sample words
208
+ words = ["running", "runs", "ran", "better", "good"]
209
+
210
+ # Apply stemming and lemmatization
211
+ stemmed = [stemmer.stem(word) for word in words]
212
+ lemmatized = [lemmatizer.lemmatize(word) for word in words]
213
+
214
+ # Compare results
215
+ for word, stem, lemma in zip(words, stemmed, lemmatized):
216
+ print(f"Original: {word}, Stemmed: {stem}, Lemmatized: {lemma}")
217
+ """)
requirements.txt CHANGED
@@ -4,4 +4,6 @@ numpy==1.26.4
4
  scikit-learn==1.4.0
5
  matplotlib==3.8.3
6
  seaborn==0.13.2
7
- plotly==5.18.0
 
 
 
4
  scikit-learn==1.4.0
5
  matplotlib==3.8.3
6
  seaborn==0.13.2
7
+ plotly==5.18.0
8
+ nltk==3.8.1
9
+ wordcloud==1.9.3