Spaces:
Running
Running
raymondEDS
commited on
Commit
·
49e3aec
1
Parent(s):
fb62875
Updating week 4 content
Browse files- Reference files/Week_4_content.txt +630 -0
- app/__pycache__/main.cpython-311.pyc +0 -0
- app/main.py +4 -2
- app/pages/__pycache__/week_2.cpython-311.pyc +0 -0
- app/pages/__pycache__/week_4.cpython-311.pyc +0 -0
- app/pages/week_4.py +217 -0
- requirements.txt +3 -1
Reference files/Week_4_content.txt
ADDED
@@ -0,0 +1,630 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
In this course, you'll learn the complete NLP workflow by exploring a fascinating real-world question: Does review length and language relate to reviewer ratings and decisions in academic peer review? If so, how?
|
3 |
+
Using data from the International Conference on Learning Representations (ICLR), you'll develop practical NLP skills while investigating how reviewers express their opinions. Each module builds upon the previous one, creating a coherent analytical pipeline from raw data to insight.
|
4 |
+
Learning Path
|
5 |
+
Data Loading and Initial Exploration: Setting up your environment and understanding your dataset
|
6 |
+
Text Preprocessing and Normalization: Cleaning and standardizing text data
|
7 |
+
Feature Extraction and Measurement: Calculating metrics from text
|
8 |
+
Visualization and Pattern Recognition: Creating insightful visualizations
|
9 |
+
Drawing Conclusions from Text Analysis: Synthesizing findings into actionable insights
|
10 |
+
Let's begin our exploration of how NLP can provide insights into academic peer review!
|
11 |
+
|
12 |
+
Module 1: Initial Exploration
|
13 |
+
The Challenge
|
14 |
+
Before we can analyze how review length relates to paper evaluations, we need to understand our dataset. In this module, we'll set up our Python environment and explore the ICLR conference data.
|
15 |
+
1.1: Set up and get to your data
|
16 |
+
The first step in any NLP project is loading and understanding your data. Let's set up our environment and examine what we're working with:
|
17 |
+
python
|
18 |
+
# Import necessary libraries
|
19 |
+
import pandas as pd
|
20 |
+
import numpy as np
|
21 |
+
import matplotlib.pyplot as plt
|
22 |
+
import seaborn as sns
|
23 |
+
import string
|
24 |
+
from nltk.corpus import stopwords
|
25 |
+
from nltk.tokenize import word_tokenize, sent_tokenize
|
26 |
+
from wordcloud import WordCloud
|
27 |
+
|
28 |
+
# Load the datasets
|
29 |
+
df_reviews = pd.read_csv('../data/reviews.csv')
|
30 |
+
df_submissions = pd.read_csv('../data/Submissions.csv')
|
31 |
+
df_dec = pd.read_csv('../data/decision.csv')
|
32 |
+
df_keyword = pd.read_csv('../data/submission_keyword.csv')
|
33 |
+
Let's look at the first few rows of each dataset to understand what information we have:
|
34 |
+
python
|
35 |
+
# View the first few rows of the submissions dataset
|
36 |
+
df_submissions.head()
|
37 |
+
# View the first few rows of the reviews dataset
|
38 |
+
df_reviews.head()
|
39 |
+
# View all columns and rows in the reviews dataset
|
40 |
+
df_reviews
|
41 |
+
# View the first few rows of the keywords dataset
|
42 |
+
df_keyword.head()
|
43 |
+
1.2: Looking at Review Content
|
44 |
+
Let's examine an actual review to understand the text we'll be analyzing:
|
45 |
+
python
|
46 |
+
# Display a sample review
|
47 |
+
df_reviews['review'][1]
|
48 |
+
Think about: What kinds of information do you see in this review? What language patterns do you notice?
|
49 |
+
1.3: Calculating Basic Metrics
|
50 |
+
Let's calculate our first simple metric - the average review score for each paper:
|
51 |
+
python
|
52 |
+
# Get the average review score for each paper
|
53 |
+
df_average_review_score = df_reviews.groupby('forum')['rating_int'].mean().reset_index()
|
54 |
+
df_average_review_score
|
55 |
+
Key Insight: Each paper (identified by 'forum') receives multiple reviews with different scores. The average score gives us an overall assessment of each paper.
|
56 |
+
Module 2: Data Integration
|
57 |
+
In this module, we'll merge datasets for later analysis.
|
58 |
+
2.1 Understanding the Need for Data Integration
|
59 |
+
In many NLP projects, the data we need is spread across multiple files or tables. In our case:
|
60 |
+
The df_reviews dataset contains the review text and ratings
|
61 |
+
The df_dec dataset contains the final decisions for each paper
|
62 |
+
To analyze how review text relates to paper decisions, we need to merge these datasets.
|
63 |
+
2.2 Performing a Dataset Merge
|
64 |
+
Let's combine our review data with the decision data:
|
65 |
+
python
|
66 |
+
# Step 1 - Merge the reviews dataframe with the decisions dataframe
|
67 |
+
df_rev_dec = pd.merge(
|
68 |
+
df_reviews, # First dataframe (reviews)
|
69 |
+
df_dec, # Second dataframe (decisions)
|
70 |
+
left_on='forum', # Join key in the first dataframe
|
71 |
+
right_on='forum', # Join key in the second dataframe
|
72 |
+
how='inner' # Keep only matching rows
|
73 |
+
)[['review','decision','conf_name_y','rating_int','forum']] # Select only these columns
|
74 |
+
# Display the first few rows of the merged dataframe
|
75 |
+
df_rev_dec.head()
|
76 |
+
2.3 Understanding Merge Concepts
|
77 |
+
Join Key: The 'forum' column identifies the paper and connects our datasets
|
78 |
+
Inner Join: Only keeps papers that appear in both datasets
|
79 |
+
Column Selection: We keep only relevant columns for our analysis
|
80 |
+
How to Verify: Always check the shape of your merged dataset to ensure you haven't lost data unexpectedly
|
81 |
+
Try it yourself: How many rows does the merged dataframe have compared to the original review dataframe? What might explain any differences?
|
82 |
+
|
83 |
+
Module 3: Basic Text Preprocessing
|
84 |
+
In this module, you'll learn essential data preprocessing techniques for NLP projects. We'll standardize text through case folding, clean up categorical variables, and prepare our review text for analysis.
|
85 |
+
3.1 Case Folding (Lowercase Conversion)
|
86 |
+
A fundamental text preprocessing step is converting all text to lowercase to ensure consistency:
|
87 |
+
python
|
88 |
+
# Convert all review text to lowercase (case folding)
|
89 |
+
df_rev_dec['review'] = df_rev_dec['review'].str.lower()
|
90 |
+
# Display the updated dataframe
|
91 |
+
df_rev_dec
|
92 |
+
Why Case Folding Matters
|
93 |
+
Consistency: "Novel" and "novel" will be treated as the same word
|
94 |
+
Reduced Dimensionality: Fewer unique tokens to process
|
95 |
+
Improved Pattern Recognition: Easier to identify word frequencies and patterns
|
96 |
+
Note: While case folding is generally helpful, it can sometimes remove meaningful distinctions (e.g., "US" vs. "us"). For our academic review analysis, lowercase conversion is appropriate.
|
97 |
+
|
98 |
+
3.2 Examining Categorical Values
|
99 |
+
Let's first check what unique decision categories exist in our dataset:
|
100 |
+
python
|
101 |
+
# Display the unique decision categories
|
102 |
+
df_rev_dec['decision'].unique()
|
103 |
+
3.3 Standardizing Decision Categories
|
104 |
+
We can see that there are multiple "Accept" categories with different presentation formats. Let's standardize these:
|
105 |
+
python
|
106 |
+
# Define a function to clean up and standardize decision categories
|
107 |
+
def clean_up_decision(text):
|
108 |
+
if text in ['Accept (Poster)','Accept (Spotlight)', 'Accept (Oral)','Accept (Talk)']:
|
109 |
+
return 'Accept'
|
110 |
+
else:
|
111 |
+
return text
|
112 |
+
# Apply the function to create a new standardized decision column
|
113 |
+
df_rev_dec['decision_clean'] = df_rev_dec['decision'].apply(clean_up_decision)
|
114 |
+
# Check our new standardized decision categories
|
115 |
+
df_rev_dec['decision_clean'].unique()
|
116 |
+
Why Standardization Matters
|
117 |
+
Simplified Analysis: Reduces the number of categories to analyze
|
118 |
+
Clearer Patterns: Makes it easier to identify trends by decision outcome
|
119 |
+
Better Visualization: Creates more meaningful and readable plots
|
120 |
+
Consistent Terminology: Aligns with how conferences typically report accept/reject decisions
|
121 |
+
Try it yourself: What other ways could you group or standardize these decision categories? What information might be lost in our current approach?
|
122 |
+
|
123 |
+
Module 4: Text Tokenization
|
124 |
+
4.1 Introduction to Tokenization
|
125 |
+
Tokenization is the process of breaking text into smaller units like sentences or words. Let's examine a review:
|
126 |
+
python
|
127 |
+
# Display a sample review
|
128 |
+
df_reviews['review'][1]
|
129 |
+
4.2 Sentence Tokenization
|
130 |
+
Let's break this review into sentences using NLTK's sentence tokenizer:
|
131 |
+
python
|
132 |
+
# Import the necessary library if not already imported
|
133 |
+
from nltk.tokenize import sent_tokenize
|
134 |
+
# Tokenize the review into sentences
|
135 |
+
sent_tokenize(df_reviews['review'][1])
|
136 |
+
4.3 Counting Sentences
|
137 |
+
Now let's count the number of sentences in the review:
|
138 |
+
python
|
139 |
+
# Count the number of sentences
|
140 |
+
len(sent_tokenize(df_reviews['review'][1]))
|
141 |
+
4.4 Creating a Reusable Function
|
142 |
+
Let's create a function to count sentences in any text:
|
143 |
+
python
|
144 |
+
# Define a function to count sentences in a text
|
145 |
+
def sentence_count(text):
|
146 |
+
return len(sent_tokenize(text))
|
147 |
+
4.5 Applying Our Function to All Reviews
|
148 |
+
Now we'll apply our function to all reviews to get sentence counts:
|
149 |
+
python
|
150 |
+
# Add a new column with the sentence count for each review
|
151 |
+
df_rev_dec['sent_count'] = df_rev_dec['review'].apply(sentence_count)
|
152 |
+
# Display the updated dataframe
|
153 |
+
df_rev_dec.head()
|
154 |
+
Key Insight: Sentence count is a simple yet effective way to quantify review length. The number of sentences can indicate how thoroughly a reviewer has evaluated a paper.
|
155 |
+
|
156 |
+
Module 5: Visualization of Text Metrics
|
157 |
+
5.1 Creating a 2D Histogram
|
158 |
+
Let's visualize the relationship between review length (in sentences), rating, and decision outcome:
|
159 |
+
python
|
160 |
+
# Create a 2D histogram with sentence count, rating, and decision
|
161 |
+
ax = sns.histplot(data=df_rev_dec, x='sent_count',
|
162 |
+
y='rating_int',
|
163 |
+
hue='decision_clean',
|
164 |
+
kde=True,
|
165 |
+
log_scale=(True,False),
|
166 |
+
legend=True)
|
167 |
+
5.2 Enhancing Our Visualization
|
168 |
+
Let's improve our visualization with better labels and formatting:
|
169 |
+
python
|
170 |
+
# Set axis labels
|
171 |
+
ax.set(xlabel='Review Length (# Sentences)', ylabel='Review Rating')
|
172 |
+
# Move the legend outside the plot for better visibility
|
173 |
+
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
|
174 |
+
# Ensure the layout is properly configured
|
175 |
+
plt.tight_layout()
|
176 |
+
# Display the plot
|
177 |
+
plt.show()
|
178 |
+
5.3 Interpreting the Visualization
|
179 |
+
This visualization reveals several interesting patterns:
|
180 |
+
Length-Rating Relationship: Is there a pattern that entails how length of review is correlated with rating?
|
181 |
+
Decision Patterns: Are there visible clusters for accepted vs. rejected papers?
|
182 |
+
Density Distribution: Where are most reviews concentrated in terms of length and rating?
|
183 |
+
Outliers: Are there unusually long or short reviews at certain rating levels?
|
184 |
+
Discussion Question: Based on this visualization, do reviewers tend to write longer reviews when they're more positive or more critical? What might explain this pattern?
|
185 |
+
|
186 |
+
Module 6: Additional Text Processing - Tokenization
|
187 |
+
Tokenization is the process of breaking text into smaller units (tokens) that serve as the building blocks for natural language processing. In this lesson, we'll explore how to tokenize text, remove stopwords and punctuation, and analyze the results.
|
188 |
+
6.1 Text Cleaning
|
189 |
+
Before tokenization, we often clean the text to remove unwanted characters. Let's start by removing punctuation:
|
190 |
+
python
|
191 |
+
# Removing punctuation
|
192 |
+
df_rev_dec['clean_review_word'] = df_rev_dec['review'].str.translate(str.maketrans('', '', string.punctuation))
|
193 |
+
What's happening here?
|
194 |
+
string.punctuation contains all punctuation characters (.,!?;:'"()[]{}-_)
|
195 |
+
str.maketrans('', '', string.punctuation) creates a translation table to remove these characters
|
196 |
+
df_rev_dec['review'].str.translate() applies this translation to all review texts
|
197 |
+
6.2 Word Tokenization
|
198 |
+
After cleaning, we can tokenize the text into individual words:
|
199 |
+
python
|
200 |
+
# Tokenizing the text
|
201 |
+
df_rev_dec['tokens'] = df_rev_dec['clean_review_word'].apply(word_tokenize)
|
202 |
+
|
203 |
+
# Example: Look at tokens for the 6th review
|
204 |
+
df_rev_dec['tokens'][5]
|
205 |
+
What's happening here?
|
206 |
+
word_tokenize() is an NLTK function that splits text into a list of words
|
207 |
+
We apply this function to each review using pandas' apply() method
|
208 |
+
The result is a new column containing lists of words for each review
|
209 |
+
6.3 Removing Stopwords
|
210 |
+
Stopwords are common words like "the," "and," "is" that often don't add meaningful information for analysis:
|
211 |
+
python
|
212 |
+
# Getting the list of English stopwords
|
213 |
+
stop_words = set(stopwords.words('english'))
|
214 |
+
|
215 |
+
# Removing stopwords from our tokens
|
216 |
+
df_rev_dec['tokens'] = df_rev_dec['tokens'].apply(lambda x: [word for word in x if word not in stop_words])
|
217 |
+
What's happening here?
|
218 |
+
stopwords.words('english') returns a list of common English stopwords
|
219 |
+
We convert it to a set for faster lookup
|
220 |
+
The lambda function filters each token list, keeping only words that aren't stopwords
|
221 |
+
This creates more meaningful token lists focused on content words
|
222 |
+
6.4 Counting Tokens
|
223 |
+
Now that we have our cleaned and filtered tokens, let's count them to measure review length:
|
224 |
+
python
|
225 |
+
# Count tokens for each review
|
226 |
+
df_rev_dec['tokens_counts'] = df_rev_dec['tokens'].apply(len)
|
227 |
+
|
228 |
+
# View the token counts
|
229 |
+
df_rev_dec['tokens_counts']
|
230 |
+
What's happening here?
|
231 |
+
We use apply(len) to count the number of tokens in each review
|
232 |
+
This gives us a quantitative measure of review length after removing stopwords
|
233 |
+
The difference between this and raw word count shows the prevalence of stopwords
|
234 |
+
6.5 Visualizing Token Counts vs. Ratings
|
235 |
+
Let's visualize the relationship between token count, rating, and decision:
|
236 |
+
python
|
237 |
+
# Create a 2D histogram with token count, rating, and decision
|
238 |
+
ax = sns.histplot(data=df_rev_dec, x='tokens_counts',
|
239 |
+
y='rating_int',
|
240 |
+
hue='decision_clean',
|
241 |
+
kde=True,
|
242 |
+
log_scale=(True,False),
|
243 |
+
legend=True)
|
244 |
+
|
245 |
+
# Set axis labels
|
246 |
+
ax.set(xlabel='Review Length (# Tokens)', ylabel='Review Rating')
|
247 |
+
|
248 |
+
# Move the legend outside the plot
|
249 |
+
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
|
250 |
+
|
251 |
+
plt.tight_layout()
|
252 |
+
plt.show()
|
253 |
+
What's happening here?
|
254 |
+
We create a 2D histogram showing the distribution of token counts and ratings
|
255 |
+
Colors distinguish between accepted and rejected papers
|
256 |
+
Log scale on the x-axis helps visualize the wide range of token counts
|
257 |
+
Kernel density estimation (KDE) shows the concentration of reviews
|
258 |
+
Module 7: Aggregating Data by Paper
|
259 |
+
7.1 Understanding Data Aggregation
|
260 |
+
So far, we've been analyzing individual reviews. However, each paper (identified by 'forum') may have multiple reviews. To understand paper-level patterns, we need to aggregate our data.
|
261 |
+
7.2 Calculating Paper-Level Metrics
|
262 |
+
Let's aggregate our review metrics to the paper level by calculating means:
|
263 |
+
python
|
264 |
+
# Aggregate reviews to paper level (mean of metrics for each paper)
|
265 |
+
df_rev_dec_ave = df_rev_dec.groupby(['forum','decision_clean'])[['rating_int','tokens_counts','sent_count']].mean().reset_index()
|
266 |
+
What's happening here?
|
267 |
+
We're grouping reviews by both 'forum' (paper ID) and 'decision_clean' (accept/reject)
|
268 |
+
For each group, we calculate the mean of 'rating_int', 'tokens_counts', and 'sent_count'
|
269 |
+
The reset_index() turns the result back into a regular DataFrame
|
270 |
+
The result is a paper-level dataset with average metrics for each paper
|
271 |
+
Try it yourself: How many papers do we have in our dataset compared to reviews? What does this tell us about the review process?
|
272 |
+
Module 8: Visualizing Token Count vs. Rating
|
273 |
+
8.1 Creating an Advanced Visualization
|
274 |
+
Now let's visualize the relationship between token count and rating at the paper level:
|
275 |
+
python
|
276 |
+
# Create a 2D histogram with token count, rating, and decision
|
277 |
+
ax = sns.histplot(data=df_rev_dec_ave, x='tokens_counts',
|
278 |
+
y='rating_int',
|
279 |
+
hue='decision_clean',
|
280 |
+
kde=True,
|
281 |
+
log_scale=(True,False),
|
282 |
+
legend=True)
|
283 |
+
|
284 |
+
# Set axis labels
|
285 |
+
ax.set(xlabel='Review Length (# Tokens)', ylabel='Review Rating')
|
286 |
+
|
287 |
+
# Move the legend outside the plot
|
288 |
+
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
|
289 |
+
|
290 |
+
plt.tight_layout()
|
291 |
+
plt.show()
|
292 |
+
8.2 Interpreting the Visualization
|
293 |
+
This visualization reveals important patterns in our data:
|
294 |
+
Decision Boundaries: Notice where the color changes from one decision to another
|
295 |
+
Length-Rating Relationship: Is there a correlation between review length and rating?
|
296 |
+
Clustering: Are there natural clusters in the data?
|
297 |
+
Outliers: What papers received unusually long or short reviews?
|
298 |
+
Key Insight: At the paper level, we can see if the average review length for a paper relates to its likelihood of acceptance.
|
299 |
+
Module 9: Comparing Token Count and Sentence Count
|
300 |
+
9.1 Visualizing Sentence Count vs. Rating
|
301 |
+
Let's create a similar visualization using sentence count instead of token count:
|
302 |
+
python
|
303 |
+
# Create a 2D histogram with sentence count, rating, and decision
|
304 |
+
ax = sns.histplot(data=df_rev_dec_ave, x='sent_count',
|
305 |
+
y='rating_int',
|
306 |
+
hue='decision_clean',
|
307 |
+
kde=True,
|
308 |
+
log_scale=(True,False),
|
309 |
+
legend=True)
|
310 |
+
|
311 |
+
# Set axis labels
|
312 |
+
ax.set(xlabel='Review Length (# Sentences)', ylabel='Review Rating')
|
313 |
+
|
314 |
+
# Move the legend outside the plot
|
315 |
+
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
|
316 |
+
|
317 |
+
plt.tight_layout()
|
318 |
+
plt.show()
|
319 |
+
9.2 Comparing Token vs. Sentence Metrics
|
320 |
+
By comparing these two visualizations, we can understand:
|
321 |
+
Which Metric is More Informative: Do token counts or sentence counts better differentiate accepted vs. rejected papers?
|
322 |
+
Different Patterns: Do some papers have many short sentences while others have fewer long ones?
|
323 |
+
Consistency: Are the patterns consistent across both metrics?
|
324 |
+
Discussion Question: Which metric—tokens or sentences—seems to be a better predictor of paper acceptance? Why might that be?
|
325 |
+
Module 10: Word Cloud Visualizations
|
326 |
+
10.1 Creating a Word Cloud from Review Text
|
327 |
+
Word clouds are a powerful way to visualize the most frequent words in a text corpus:
|
328 |
+
python
|
329 |
+
# Concatenate all review text
|
330 |
+
text = ' '.join(df_rev_dec['clean_review_word'])
|
331 |
+
|
332 |
+
# Generate word cloud
|
333 |
+
wordcloud = WordCloud().generate(text)
|
334 |
+
|
335 |
+
# Display word cloud
|
336 |
+
plt.figure(figsize=(8, 6))
|
337 |
+
plt.imshow(wordcloud, interpolation='bilinear')
|
338 |
+
plt.axis('off')
|
339 |
+
plt.show()
|
340 |
+
10.2 Visualizing Paper Keywords
|
341 |
+
Now let's visualize the primary keywords associated with the papers:
|
342 |
+
python
|
343 |
+
# Concatenate all primary keywords
|
344 |
+
text = ' '.join(df_keyword['primary_keyword'])
|
345 |
+
|
346 |
+
# Generate word cloud
|
347 |
+
wordcloud = WordCloud().generate(text)
|
348 |
+
|
349 |
+
# Display word cloud
|
350 |
+
plt.figure(figsize=(8, 6))
|
351 |
+
plt.imshow(wordcloud, interpolation='bilinear')
|
352 |
+
plt.axis('off')
|
353 |
+
plt.show()
|
354 |
+
10.3 Visualizing Paper Abstracts
|
355 |
+
Finally, let's create a word cloud from paper abstracts:
|
356 |
+
python
|
357 |
+
# Concatenate all abstracts
|
358 |
+
text = ' '.join(df_submissions['abstract'])
|
359 |
+
|
360 |
+
# Generate word cloud
|
361 |
+
wordcloud = WordCloud().generate(text)
|
362 |
+
|
363 |
+
# Display word cloud
|
364 |
+
plt.figure(figsize=(8, 6))
|
365 |
+
plt.imshow(wordcloud, interpolation='bilinear')
|
366 |
+
plt.axis('off')
|
367 |
+
plt.show()
|
368 |
+
Interpreting Word Clouds
|
369 |
+
Word clouds provide insights about:
|
370 |
+
Dominant Themes: The most frequent words appear largest
|
371 |
+
Vocabulary Differences: Compare terms across different sources (reviews vs. abstracts)
|
372 |
+
Field-Specific Terminology: Technical terms reveal the focus of the conference
|
373 |
+
Sentiment Indicators: Evaluative words in reviews reveal assessment patterns
|
374 |
+
Try it yourself: What differences do you notice between the word clouds from reviews, keywords, and abstracts? What do these differences tell you about academic communication?
|
375 |
+
|
376 |
+
|
377 |
+
|
378 |
+
|
379 |
+
|
380 |
+
|
381 |
+
|
382 |
+
|
383 |
+
|
384 |
+
|
385 |
+
|
386 |
+
|
387 |
+
|
388 |
+
|
389 |
+
|
390 |
+
|
391 |
+
|
392 |
+
|
393 |
+
|
394 |
+
|
395 |
+
|
396 |
+
V1.1 Week 4 - Intro to NLP
|
397 |
+
Course Overview
|
398 |
+
In this course, you'll learn fundamental Natural Language Processing (NLP) concepts by exploring a fascinating real-world question: What is the effect of releasing a preprint of a paper before it is submitted for peer review?
|
399 |
+
Using the ICLR (International Conference on Learning Representations) database - which contains submissions, reviews, and author profiles from 2017-2022 - you'll develop practical NLP skills while investigating potential biases and patterns in academic publishing.
|
400 |
+
Learning Path
|
401 |
+
Understanding Text as Data: How computers represent and work with text
|
402 |
+
Text Processing Fundamentals: Basic cleaning and normalization
|
403 |
+
Quantitative Text Analysis: Measuring and comparing text features
|
404 |
+
Tokenization Approaches: Breaking text into meaningful units
|
405 |
+
Text Visualization Techniques: Creating insightful visual representations
|
406 |
+
From Analysis to Insights: Drawing evidence-based conclusions
|
407 |
+
Let's dive in!
|
408 |
+
…
|
409 |
+
Step 4: Text Cleaning and Normalization for Academic Content
|
410 |
+
Academic papers contain specialized vocabulary, citations, equations, and other elements that require careful normalization.
|
411 |
+
Key Concept: Scientific text normalization preserves meaningful technical content while standardizing format.
|
412 |
+
Stop Words Removal
|
413 |
+
Definition: Stop words are extremely common words that appear frequently in text but typically carry little meaningful information for analysis purposes. In English, these include articles (the, a, an), conjunctions (and, but, or), prepositions (in, on, at), and certain pronouns (I, you, it).
|
414 |
+
Stop words removal is the process of filtering these words out before analysis to:
|
415 |
+
Reduce noise in the data
|
416 |
+
Decrease the dimensionality of the text representation
|
417 |
+
Focus analysis on the content-bearing words
|
418 |
+
In academic text, we often extend standard stop word lists to include domain-specific terms that are ubiquitous but not analytically useful (e.g., "paper," "method," "result").
|
419 |
+
python
|
420 |
+
# Load standard English stop words
|
421 |
+
from nltk.corpus import stopwords
|
422 |
+
standard_stop_words = set(stopwords.words('english'))
|
423 |
+
|
424 |
+
# Add academic-specific stop words
|
425 |
+
academic_stop_words = ['et', 'al', 'fig', 'table', 'paper', 'using', 'used',
|
426 |
+
'method', 'result', 'show', 'propose', 'use']
|
427 |
+
all_stop_words = standard_stop_words.union(academic_stop_words)
|
428 |
+
|
429 |
+
# Apply stop word removal
|
430 |
+
def remove_stop_words(text):
|
431 |
+
words = text.split()
|
432 |
+
filtered_words = [word for word in words if word.lower() not in all_stop_words]
|
433 |
+
return ' '.join(filtered_words)
|
434 |
+
|
435 |
+
# Compare before and after
|
436 |
+
example = "We propose a novel method that shows impressive results on the benchmark dataset."
|
437 |
+
filtered = remove_stop_words(example)
|
438 |
+
|
439 |
+
print("Original:", example)
|
440 |
+
print("After stop word removal:", filtered)
|
441 |
+
# Output: "propose novel method shows impressive results benchmark dataset."
|
442 |
+
Stemming and Lemmatization
|
443 |
+
Definition: Stemming and lemmatization are text normalization techniques that reduce words to their root or base forms, allowing different inflections or derivations of the same word to be treated as equivalent.
|
444 |
+
Stemming is a simpler, rule-based approach that works by truncating words to their stems, often by removing suffixes. For example:
|
445 |
+
"running," "runs," and "runner" might all be reduced to "run"
|
446 |
+
"connection," "connected," and "connecting" might all become "connect"
|
447 |
+
Stemming is faster but can sometimes produce non-words or incorrect reductions.
|
448 |
+
Lemmatization is a more sophisticated approach that uses vocabulary and morphological analysis to return the dictionary base form (lemma) of a word. For example:
|
449 |
+
"better" becomes "good"
|
450 |
+
"was" and "were" become "be"
|
451 |
+
"studying" becomes "study"
|
452 |
+
Lemmatization generally produces more accurate results but requires more computational resources.
|
453 |
+
python
|
454 |
+
from nltk.stem import PorterStemmer, WordNetLemmatizer
|
455 |
+
import nltk
|
456 |
+
nltk.download('wordnet')
|
457 |
+
|
458 |
+
# Initialize stemmer and lemmatizer
|
459 |
+
stemmer = PorterStemmer()
|
460 |
+
lemmatizer = WordNetLemmatizer()
|
461 |
+
|
462 |
+
# Example words
|
463 |
+
academic_terms = ["algorithms", "computing", "learning", "trained",
|
464 |
+
"networks", "better", "studies", "analyzed"]
|
465 |
+
|
466 |
+
# Compare stemming and lemmatization
|
467 |
+
for term in academic_terms:
|
468 |
+
print(f"Original: {term}")
|
469 |
+
print(f"Stemmed: {stemmer.stem(term)}")
|
470 |
+
print(f"Lemmatized: {lemmatizer.lemmatize(term)}")
|
471 |
+
print()
|
472 |
+
|
473 |
+
# Demonstration in context
|
474 |
+
academic_sentence = "The training algorithms performed better than expected when analyzing multiple neural networks."
|
475 |
+
|
476 |
+
# Apply stemming
|
477 |
+
stemmed_words = [stemmer.stem(word) for word in academic_sentence.lower().split()]
|
478 |
+
stemmed_sentence = ' '.join(stemmed_words)
|
479 |
+
|
480 |
+
# Apply lemmatization
|
481 |
+
lemmatized_words = [lemmatizer.lemmatize(word) for word in academic_sentence.lower().split()]
|
482 |
+
lemmatized_sentence = ' '.join(lemmatized_words)
|
483 |
+
|
484 |
+
print("Original:", academic_sentence)
|
485 |
+
print("Stemmed:", stemmed_sentence)
|
486 |
+
print("Lemmatized:", lemmatized_sentence)
|
487 |
+
When to use which approach:
|
488 |
+
For academic text analysis:
|
489 |
+
Stemming is useful when processing speed is important and approximate matching is sufficient
|
490 |
+
Lemmatization is preferred when precision is crucial, especially for technical terms where preserving meaning is essential
|
491 |
+
In our ICLR paper analysis, lemmatization would likely be more appropriate since technical terminology often carries specific meanings that should be preserved accurately.
|
492 |
+
Challenge Question: How might stemming versus lemmatization affect our analysis of technical innovation in ICLR papers? Can you think of specific machine learning terms where these approaches would yield different results?
|
493 |
+
|
494 |
+
|
495 |
+
V1.0 Week 4 - Intro to NLP
|
496 |
+
The Real-World Problem
|
497 |
+
Imagine you're part of a small business team that has just launched a new product. You've received hundreds of customer reviews across various platforms, and your manager has asked you to make sense of this feedback. Looking at the mountain of text data, you realize you need a systematic way to understand what customers are saying without reading each review individually.
|
498 |
+
Your challenge: How can you efficiently analyze customer feedback to identify common themes, sentiments, and specific product issues?
|
499 |
+
Our Approach
|
500 |
+
In this module, we'll learn how to transform unstructured text feedback into structured insights using Natural Language Processing. Here's our journey:
|
501 |
+
Understanding text as data
|
502 |
+
Basic processing of text information
|
503 |
+
Measuring text properties
|
504 |
+
Cleaning and normalizing customer feedback
|
505 |
+
Visualizing patterns in the feedback
|
506 |
+
Analyzing words vs. tokens
|
507 |
+
Let's begin!
|
508 |
+
Step 1: Text as Data - A New Perspective
|
509 |
+
When we look at customer reviews like:
|
510 |
+
"Love this product! So easy to use and the battery lasts forever."
|
511 |
+
"Terrible design. Buttons stopped working after two weeks."
|
512 |
+
We naturally understand the meaning and sentiment. But how can a computer understand this?
|
513 |
+
Key Concept: Text can be treated as data that we can analyze quantitatively.
|
514 |
+
Unlike numerical data (age, price, temperature) that has inherent mathematical properties, text data needs to be transformed before we can analyze it.
|
515 |
+
Interactive Exercise: Look at these two reviews. As a human, what information can you extract? Now think about how a computer might "see" this text without any processing.
|
516 |
+
Challenge Question: What types of information might we want to extract from customer reviews? List at least three analytical goals.
|
517 |
+
Step 2: Basic Text Processing - Breaking Down Language
|
518 |
+
Before we can analyze text, we need to break it down into meaningful units.
|
519 |
+
Key Concept: Tokenization is the process of splitting text into smaller pieces (tokens) such as words, phrases, or characters.
|
520 |
+
For example, the review "Love this product!" can be tokenized into ["Love", "this", "product", "!"] or ["Love", "this", "product", "!"] depending on our approach.
|
521 |
+
Interactive Example: Let's tokenize these customer reviews:
|
522 |
+
python
|
523 |
+
# Simple word tokenization
|
524 |
+
review = "Battery life is amazing but the app crashes frequently."
|
525 |
+
tokens = review.split() # Results in ["Battery", "life", "is", "amazing", "but", "the", "app", "crashes", "frequently."]
|
526 |
+
Notice how "frequently." includes the period. Basic tokenization has limitations!
|
527 |
+
Challenge Question: How might we handle contractions like "doesn't" or hyphenated words like "user-friendly" when tokenizing?
|
528 |
+
Step 3: Measuring Text - Quantifying Feedback
|
529 |
+
Now that we've broken text into pieces, we can start measuring properties of our customer feedback.
|
530 |
+
Key Concept: Text metrics help us quantify and compare text data.
|
531 |
+
Common metrics include:
|
532 |
+
Length (words, characters)
|
533 |
+
Complexity (average word length, unique words ratio)
|
534 |
+
Sentiment scores (positive/negative)
|
535 |
+
Interactive Example: Let's calculate basic metrics for customer reviews:
|
536 |
+
python
|
537 |
+
# Word count
|
538 |
+
review = "The interface is intuitive and responsive."
|
539 |
+
word_count = len(review.split()) # 6 words
|
540 |
+
|
541 |
+
# Character count (including spaces)
|
542 |
+
char_count = len(review) # 41 characters
|
543 |
+
|
544 |
+
# Unique words ratio
|
545 |
+
unique_words = len(set(review.lower().split()))
|
546 |
+
unique_ratio = unique_words / word_count # 1.0 (all words are unique)
|
547 |
+
Challenge Question: Why might longer reviews not necessarily contain more information than shorter ones? What other metrics beyond length might better capture information content?
|
548 |
+
Step 4: Text Cleaning and Normalization
|
549 |
+
Customer feedback often contains inconsistencies: spelling variations, punctuation, capitalization, etc.
|
550 |
+
Key Concept: Text normalization creates a standardized format for analysis.
|
551 |
+
Common normalization steps:
|
552 |
+
Converting to lowercase
|
553 |
+
Removing punctuation
|
554 |
+
Correcting spelling
|
555 |
+
Removing stop words (common words like "the", "is")
|
556 |
+
Stemming or lemmatizing (reducing words to their base form)
|
557 |
+
Interactive Example: Let's normalize a review:
|
558 |
+
python
|
559 |
+
# Original review
|
560 |
+
review = "The battery LIFE is amazing!!! Works for days."
|
561 |
+
|
562 |
+
# Lowercase
|
563 |
+
review = review.lower() # "the battery life is amazing!!! works for days."
|
564 |
+
|
565 |
+
# Remove punctuation and extra spaces
|
566 |
+
import re
|
567 |
+
review = re.sub(r'[^\w\s]', '', review) # "the battery life is amazing works for days"
|
568 |
+
|
569 |
+
# Remove stop words
|
570 |
+
stop_words = ["the", "is", "for"]
|
571 |
+
words = review.split()
|
572 |
+
filtered_words = [word for word in words if word not in stop_words]
|
573 |
+
# Result: ["battery", "life", "amazing", "works", "days"]
|
574 |
+
Challenge Question: How might normalization affect sentiment analysis? Could removing punctuation or stop words change the perceived sentiment of a review?
|
575 |
+
Step 5: Text Visualization - Seeing Patterns
|
576 |
+
Visual representations help us identify patterns across many reviews.
|
577 |
+
Key Concept: Text visualization techniques reveal insights that are difficult to see in raw text.
|
578 |
+
Common visualization methods:
|
579 |
+
Word clouds
|
580 |
+
Frequency distributions
|
581 |
+
Sentiment over time
|
582 |
+
Topic clusters
|
583 |
+
Interactive Example: Creating a simple word frequency chart:
|
584 |
+
python
|
585 |
+
from collections import Counter
|
586 |
+
|
587 |
+
# Combined reviews
|
588 |
+
reviews = ["Battery life is amazing", "Battery drains too quickly",
|
589 |
+
"Great battery performance", "Screen is too small"]
|
590 |
+
|
591 |
+
# Count word frequencies
|
592 |
+
all_words = " ".join(reviews).lower().split()
|
593 |
+
word_counts = Counter(all_words)
|
594 |
+
# Result: {'battery': 3, 'life': 1, 'is': 2, 'amazing': 1, 'drains': 1, 'too': 2, 'quickly': 1, 'great': 1, 'performance': 1, 'screen': 1, 'small': 1}
|
595 |
+
|
596 |
+
# We could visualize this as a bar chart
|
597 |
+
# Most frequent: 'battery' (3), 'is' (2), 'too' (2)
|
598 |
+
Challenge Question: Why might a word cloud be misleading for understanding customer sentiment? What additional information would make the visualization more informative?
|
599 |
+
Step 6: Words vs. Tokens - Making Choices
|
600 |
+
As we advance in NLP, we face an important decision: should we analyze whole words or more sophisticated tokens?
|
601 |
+
Key Concept: Different tokenization approaches have distinct advantages and limitations.
|
602 |
+
Word-based analysis:
|
603 |
+
Intuitive and interpretable
|
604 |
+
Misses connections between related words (run/running/ran)
|
605 |
+
Struggles with compound words and new terms
|
606 |
+
Token-based analysis:
|
607 |
+
Can capture subword information
|
608 |
+
Handles unknown words better
|
609 |
+
May lose some human interpretability
|
610 |
+
Interactive Example: Comparing approaches:
|
611 |
+
python
|
612 |
+
# Word-based
|
613 |
+
review = "The touchscreen is unresponsive"
|
614 |
+
words = review.lower().split() # ['the', 'touchscreen', 'is', 'unresponsive']
|
615 |
+
|
616 |
+
# Subword tokenization (simplified example)
|
617 |
+
subwords = ['the', 'touch', 'screen', 'is', 'un', 'responsive']
|
618 |
+
Challenge Question: For our customer feedback analysis, which approach would be better: analyzing whole words or subword tokens? What factors would influence this decision?
|
619 |
+
Putting It All Together: Solving Our Problem
|
620 |
+
Now that we've learned these fundamental NLP concepts, let's return to our original challenge: analyzing customer feedback at scale.
|
621 |
+
Here's how we'd approach it:
|
622 |
+
Collect and tokenize all customer reviews
|
623 |
+
Clean and normalize the text
|
624 |
+
Calculate key metrics (length, sentiment scores)
|
625 |
+
Visualize common terms and topics
|
626 |
+
Identify positive and negative feedback themes
|
627 |
+
Generate an automated summary for the product team
|
628 |
+
By applying these NLP fundamentals, we've transformed an overwhelming mass of text into actionable insights that can drive product improvements!
|
629 |
+
Final Challenge: How could we extend this analysis to track customer sentiment over time as we release product updates? What additional NLP techniques might be helpful?
|
630 |
+
|
app/__pycache__/main.cpython-311.pyc
CHANGED
Binary files a/app/__pycache__/main.cpython-311.pyc and b/app/__pycache__/main.cpython-311.pyc differ
|
|
app/main.py
CHANGED
@@ -16,7 +16,7 @@ from app.components.login import login
|
|
16 |
from app.pages import week_1
|
17 |
from app.pages import week_2
|
18 |
from app.pages import week_3
|
19 |
-
|
20 |
# Page configuration
|
21 |
st.set_page_config(
|
22 |
page_title="Data Science Course App",
|
@@ -139,6 +139,8 @@ def show_week_content():
|
|
139 |
week_2.show()
|
140 |
elif st.session_state.current_week == 3:
|
141 |
week_3.show()
|
|
|
|
|
142 |
else:
|
143 |
st.warning("Content for this week is not yet available.")
|
144 |
|
@@ -151,7 +153,7 @@ def main():
|
|
151 |
return
|
152 |
|
153 |
# User is logged in, show course content
|
154 |
-
if st.session_state.current_week in [1, 2, 3]:
|
155 |
show_week_content()
|
156 |
else:
|
157 |
st.title("Data Science Research Paper Course")
|
|
|
16 |
from app.pages import week_1
|
17 |
from app.pages import week_2
|
18 |
from app.pages import week_3
|
19 |
+
from app.pages import week_4
|
20 |
# Page configuration
|
21 |
st.set_page_config(
|
22 |
page_title="Data Science Course App",
|
|
|
139 |
week_2.show()
|
140 |
elif st.session_state.current_week == 3:
|
141 |
week_3.show()
|
142 |
+
elif st.session_state.current_week == 4:
|
143 |
+
week_4.show()
|
144 |
else:
|
145 |
st.warning("Content for this week is not yet available.")
|
146 |
|
|
|
153 |
return
|
154 |
|
155 |
# User is logged in, show course content
|
156 |
+
if st.session_state.current_week in [1, 2, 3, 4]:
|
157 |
show_week_content()
|
158 |
else:
|
159 |
st.title("Data Science Research Paper Course")
|
app/pages/__pycache__/week_2.cpython-311.pyc
CHANGED
Binary files a/app/pages/__pycache__/week_2.cpython-311.pyc and b/app/pages/__pycache__/week_2.cpython-311.pyc differ
|
|
app/pages/__pycache__/week_4.cpython-311.pyc
ADDED
Binary file (11 kB). View file
|
|
app/pages/week_4.py
ADDED
@@ -0,0 +1,217 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
import pandas as pd
|
3 |
+
import numpy as np
|
4 |
+
import matplotlib.pyplot as plt
|
5 |
+
import seaborn as sns
|
6 |
+
import nltk
|
7 |
+
from nltk.corpus import stopwords
|
8 |
+
from nltk.tokenize import word_tokenize, sent_tokenize
|
9 |
+
nltk.download('punkt_tab')
|
10 |
+
nltk.download('stopwords')
|
11 |
+
from nltk.stem import PorterStemmer, WordNetLemmatizer
|
12 |
+
from wordcloud import WordCloud
|
13 |
+
import string
|
14 |
+
import io
|
15 |
+
from contextlib import redirect_stdout
|
16 |
+
|
17 |
+
# Initialize session state for notebook-like cells
|
18 |
+
if 'cells' not in st.session_state:
|
19 |
+
st.session_state.cells = []
|
20 |
+
if 'df' not in st.session_state:
|
21 |
+
st.session_state.df = None
|
22 |
+
|
23 |
+
def capture_output(code, df=None):
|
24 |
+
"""Helper function to capture print output"""
|
25 |
+
f = io.StringIO()
|
26 |
+
with redirect_stdout(f):
|
27 |
+
try:
|
28 |
+
# Create a dictionary of variables to use in exec
|
29 |
+
variables = {'pd': pd, 'np': np, 'plt': plt, 'sns': sns, 'nltk': nltk}
|
30 |
+
if df is not None:
|
31 |
+
variables['df'] = df
|
32 |
+
exec(code, variables)
|
33 |
+
except Exception as e:
|
34 |
+
return f"Error: {str(e)}"
|
35 |
+
return f.getvalue()
|
36 |
+
|
37 |
+
def show():
|
38 |
+
st.title("Week 4: Introduction to Natural Language Processing")
|
39 |
+
|
40 |
+
# Introduction Section
|
41 |
+
st.header("Course Overview")
|
42 |
+
st.write("""
|
43 |
+
In this course, you'll learn fundamental Natural Language Processing (NLP) concepts by exploring a fascinating real-world question:
|
44 |
+
What is the effect of releasing a preprint of a paper before it is submitted for peer review?
|
45 |
+
|
46 |
+
Using the ICLR (International Conference on Learning Representations) database - which contains submissions, reviews, and author profiles
|
47 |
+
from 2017-2022 - you'll develop practical NLP skills while investigating potential biases and patterns in academic publishing.
|
48 |
+
""")
|
49 |
+
|
50 |
+
# Learning Path
|
51 |
+
st.subheader("Learning Path")
|
52 |
+
st.write("""
|
53 |
+
1. Understanding Text as Data: How computers represent and work with text
|
54 |
+
2. Text Processing Fundamentals: Basic cleaning and normalization
|
55 |
+
3. Quantitative Text Analysis: Measuring and comparing text features
|
56 |
+
4. Tokenization Approaches: Breaking text into meaningful units
|
57 |
+
5. Text Visualization Techniques: Creating insightful visual representations
|
58 |
+
6. From Analysis to Insights: Drawing evidence-based conclusions
|
59 |
+
""")
|
60 |
+
|
61 |
+
# Module 1: Text as Data
|
62 |
+
st.header("Module 1: Text as Data")
|
63 |
+
st.write("""
|
64 |
+
When we look at text like customer reviews or academic papers, we naturally understand the meaning.
|
65 |
+
But how can a computer understand this?
|
66 |
+
|
67 |
+
Key Concept: Text can be treated as data that we can analyze quantitatively.
|
68 |
+
Unlike numerical data (age, price, temperature) that has inherent mathematical properties,
|
69 |
+
text data needs to be transformed before we can analyze it.
|
70 |
+
""")
|
71 |
+
|
72 |
+
# Interactive Example
|
73 |
+
st.subheader("Interactive Example: Text Tokenization")
|
74 |
+
st.write("Let's try tokenizing some text:")
|
75 |
+
|
76 |
+
example_text = st.text_area(
|
77 |
+
"Enter some text to tokenize:",
|
78 |
+
"The quick brown fox jumps over the lazy dog."
|
79 |
+
)
|
80 |
+
|
81 |
+
if st.button("Tokenize Text"):
|
82 |
+
tokens = word_tokenize(example_text)
|
83 |
+
st.write("Tokens:", tokens)
|
84 |
+
st.write("Number of tokens:", len(tokens))
|
85 |
+
|
86 |
+
# Module 2: Text Processing
|
87 |
+
st.header("Module 2: Text Processing")
|
88 |
+
st.write("""
|
89 |
+
Before we can analyze text, we need to clean and normalize it. This includes:
|
90 |
+
- Converting to lowercase
|
91 |
+
- Removing punctuation
|
92 |
+
- Removing stop words
|
93 |
+
- Stemming or lemmatization
|
94 |
+
""")
|
95 |
+
|
96 |
+
# Interactive Text Processing
|
97 |
+
st.subheader("Try Text Processing")
|
98 |
+
st.write("""
|
99 |
+
Let's process some text using different techniques:
|
100 |
+
""")
|
101 |
+
|
102 |
+
process_text = st.text_area(
|
103 |
+
"Enter text to process:",
|
104 |
+
"The quick brown fox jumps over the lazy dog.",
|
105 |
+
key="process_text"
|
106 |
+
)
|
107 |
+
|
108 |
+
col1, col2 = st.columns(2)
|
109 |
+
|
110 |
+
with col1:
|
111 |
+
if st.button("Remove Stop Words"):
|
112 |
+
stop_words = set(stopwords.words('english'))
|
113 |
+
words = word_tokenize(process_text.lower())
|
114 |
+
filtered_words = [word for word in words if word not in stop_words]
|
115 |
+
st.write("After removing stop words:", filtered_words)
|
116 |
+
|
117 |
+
with col2:
|
118 |
+
if st.button("Remove Punctuation"):
|
119 |
+
no_punct = process_text.translate(str.maketrans('', '', string.punctuation))
|
120 |
+
st.write("After removing punctuation:", no_punct)
|
121 |
+
|
122 |
+
# Module 3: Text Visualization
|
123 |
+
st.header("Module 3: Text Visualization")
|
124 |
+
st.write("""
|
125 |
+
Visual representations help us identify patterns across text data.
|
126 |
+
Common visualization methods include:
|
127 |
+
- Word clouds
|
128 |
+
- Frequency distributions
|
129 |
+
- Sentiment over time
|
130 |
+
- Topic clusters
|
131 |
+
""")
|
132 |
+
|
133 |
+
# Interactive Word Cloud
|
134 |
+
st.subheader("Create a Word Cloud")
|
135 |
+
st.write("""
|
136 |
+
Let's create a word cloud from some text:
|
137 |
+
""")
|
138 |
+
|
139 |
+
wordcloud_text = st.text_area(
|
140 |
+
"Enter text for word cloud:",
|
141 |
+
"The quick brown fox jumps over the lazy dog. The fox is quick and brown. The dog is lazy.",
|
142 |
+
key="wordcloud_text"
|
143 |
+
)
|
144 |
+
|
145 |
+
if st.button("Generate Word Cloud"):
|
146 |
+
# Create and generate a word cloud image
|
147 |
+
wordcloud = WordCloud().generate(wordcloud_text)
|
148 |
+
|
149 |
+
# Display the word cloud
|
150 |
+
plt.figure(figsize=(10, 6))
|
151 |
+
plt.imshow(wordcloud, interpolation='bilinear')
|
152 |
+
plt.axis('off')
|
153 |
+
st.pyplot(plt)
|
154 |
+
|
155 |
+
# Practice Exercises
|
156 |
+
st.header("Practice Exercises")
|
157 |
+
|
158 |
+
with st.expander("Exercise 1: Text Processing"):
|
159 |
+
st.write("""
|
160 |
+
1. Load a sample text
|
161 |
+
2. Remove stop words and punctuation
|
162 |
+
3. Create a word cloud
|
163 |
+
4. Analyze word frequencies
|
164 |
+
""")
|
165 |
+
|
166 |
+
st.code("""
|
167 |
+
# Solution
|
168 |
+
import nltk
|
169 |
+
from nltk.corpus import stopwords
|
170 |
+
from wordcloud import WordCloud
|
171 |
+
import string
|
172 |
+
|
173 |
+
# Sample text
|
174 |
+
text = "Your text here"
|
175 |
+
|
176 |
+
# Remove punctuation
|
177 |
+
text = text.translate(str.maketrans('', '', string.punctuation))
|
178 |
+
|
179 |
+
# Remove stop words
|
180 |
+
stop_words = set(stopwords.words('english'))
|
181 |
+
words = text.split()
|
182 |
+
filtered_words = [word for word in words if word.lower() not in stop_words]
|
183 |
+
|
184 |
+
# Create word cloud
|
185 |
+
wordcloud = WordCloud().generate(' '.join(filtered_words))
|
186 |
+
plt.imshow(wordcloud)
|
187 |
+
plt.axis('off')
|
188 |
+
plt.show()
|
189 |
+
""")
|
190 |
+
|
191 |
+
with st.expander("Exercise 2: Text Analysis"):
|
192 |
+
st.write("""
|
193 |
+
1. Calculate basic text metrics (word count, unique words)
|
194 |
+
2. Perform stemming and lemmatization
|
195 |
+
3. Compare the results
|
196 |
+
4. Visualize the differences
|
197 |
+
""")
|
198 |
+
|
199 |
+
st.code("""
|
200 |
+
# Solution
|
201 |
+
from nltk.stem import PorterStemmer, WordNetLemmatizer
|
202 |
+
|
203 |
+
# Initialize stemmer and lemmatizer
|
204 |
+
stemmer = PorterStemmer()
|
205 |
+
lemmatizer = WordNetLemmatizer()
|
206 |
+
|
207 |
+
# Sample words
|
208 |
+
words = ["running", "runs", "ran", "better", "good"]
|
209 |
+
|
210 |
+
# Apply stemming and lemmatization
|
211 |
+
stemmed = [stemmer.stem(word) for word in words]
|
212 |
+
lemmatized = [lemmatizer.lemmatize(word) for word in words]
|
213 |
+
|
214 |
+
# Compare results
|
215 |
+
for word, stem, lemma in zip(words, stemmed, lemmatized):
|
216 |
+
print(f"Original: {word}, Stemmed: {stem}, Lemmatized: {lemma}")
|
217 |
+
""")
|
requirements.txt
CHANGED
@@ -4,4 +4,6 @@ numpy==1.26.4
|
|
4 |
scikit-learn==1.4.0
|
5 |
matplotlib==3.8.3
|
6 |
seaborn==0.13.2
|
7 |
-
plotly==5.18.0
|
|
|
|
|
|
4 |
scikit-learn==1.4.0
|
5 |
matplotlib==3.8.3
|
6 |
seaborn==0.13.2
|
7 |
+
plotly==5.18.0
|
8 |
+
nltk==3.8.1
|
9 |
+
wordcloud==1.9.3
|