File size: 11,964 Bytes
3ddfd7a
 
faa1d38
 
 
 
 
 
 
 
 
 
 
555bebb
faa1d38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3ddfd7a
64d50ce
faa1d38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3ddfd7a
faa1d38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fc67db7
 
faa1d38
 
 
3ddfd7a
fc67db7
faa1d38
 
 
 
 
3ddfd7a
 
faa1d38
 
 
 
 
 
 
 
 
 
 
 
 
 
3ddfd7a
51d9500
 
555bebb
 
 
 
faa1d38
 
 
 
 
555bebb
 
faa1d38
 
 
555bebb
faa1d38
 
555bebb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
faa1d38
 
 
 
 
 
 
555bebb
 
 
 
 
 
 
 
 
faa1d38
 
 
 
 
 
 
 
 
 
 
 
 
51d9500
faa1d38
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
import streamlit as st
import altair as alt
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
from tqdm.notebook import tqdm
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax

# pip install torch torchvision torchaudio
# pip install transformers

st.title("Final Project Part 2 - Jason Wu | Expert Visualizations")

url = "https://www.kaggle.com/datasets/rahulgoel1106/xenophobia-on-twitter-during-covid19"
st.write("Dataset Link to Download -> [Kaggle Covid-19 Xenophobic Datatset](%s)" % url)

plt.style.use('ggplot')

multi = '''The dataset chosen is called Xenophobic and like the name it highlights the Xenophobic posts on Twitter 
during beginning stages of Covid-19 and today we are conducting sentiment analysis using a trained Twitter sentiment model.
#### To follow: '''
st.markdown('''### About:''')
st.markdown(multi)
st.code('''# pip install these packages into your terminal or workspace
pip install torch torchvision torchaudio # to work with the trained model
pip install transformers # to work with the trained model''')

df = pd.read_csv('Xenophobia.csv', encoding='latin1', nrows=5000)
cols_to_drop = ['status_id', 'created_at', 'location']
df.drop(cols_to_drop, axis=1, inplace=True)

# Convert text to string type
df['text'] = df['text'].astype(str)

st.markdown('''#### Leading Data & Removing Unwated Data: ''')
st.code('''df = pd.read_csv('Xenophobia.csv', encoding='latin1', nrows=5000)
cols_to_drop = ['status_id', 'created_at', 'location'] ''')

multi1 = ''' #### Next Steps:
The next step is to run the sentiment analysis on the dataset, however the analysis take a long time to run so I am only going to test 5000 rows out of the millions of rows.
1. The first step is to intialize the model and call on it from HuggingFace'''

st.markdown(multi1)
st.code('''MODEL = "cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)''')

st.markdown('''2. Then we have to clean the data from stop words''')
st.code(''' def clean_text(text):
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)  
    text = re.sub(r'\s+', ' ', text).strip()    
    return text''')
st.markdown('''3. Lastly, we will have to run the model on the cleaned data and process the sentiment scores into a new csv file to be used''')

st.code('''def examine_text(example):
    try:  # Use the try statement to handle errors such as text being too long etc
        encoded_text = tokenizer( # setting conditions 
            example,
            return_tensors='pt', 
            truncation=True,
            max_length=512,
            padding="max_length"
        )
        output = model(**encoded_text)
        scores = output.logits[0].detach().numpy()
        scores = softmax(scores) # softmax function transforms each element of a collection by computing the exponential of each element divided by the sum of the exponentials of all the elements
        return {
            'neg': scores[0],
            'neu': scores[1],
            'pos': scores[2]
        }
    except Exception as e: # handling errors
        print(f"Error processing text: {example}\nError: {e}")
        return None ''')

st.code('''results = []

# Process each text
for i, row in tqdm(df.iterrows(), total=len(df)):
    text = clean_text(row['text'])
    scores = examine_text(text)
    if scores:
        # Append scores to results
        results.append({'index': i, 'neg': scores['neg'], 'neu': scores['neu'], 'pos': scores['pos']})
    else:
        print(f"Skipped problematic text: {text}")

# Convert results to a DataFrame
results_df = pd.DataFrame(results)

# Save to CSV
results_df.to_csv('sentiment_scores.csv', index=False)
print("Saved sentiment scores to 'sentiment_scores.csv'")
# prints out when done - took me 20 minutes for 5000 rows so imagine a million''') 
# MODEL = "cardiffnlp/twitter-roberta-base-sentiment-latest"
# tokenizer = AutoTokenizer.from_pretrained(MODEL)
# model = AutoModelForSequenceClassification.from_pretrained(MODEL)

# # Clean Text Function
# def clean_text(text):
#     """Remove non-ASCII characters and excess whitespace."""
#     text = re.sub(r'[^\x00-\x7F]+', ' ', text)  # Remove non-ASCII
#     text = re.sub(r'\s+', ' ', text).strip()    # Remove excess whitespace
#     return text


# Sentiment analysis function
# def examine_text(example):
#     try:
#         encoded_text = tokenizer(
#             example,
#             return_tensors='pt',
#             truncation=True,
#             max_length=512,
#             padding="max_length"
#         )
#         output = model(**encoded_text)
#         scores = output.logits[0].detach().numpy()
#         scores = softmax(scores)
#         return {
#             'neg': scores[0],
#             'neu': scores[1],
#             'pos': scores[2]
#         }
#     except Exception as e:
#         print(f"Error processing text: {example}\nError: {e}")
#         return None


# Prepare list for saving results
# results = []

# # Process each text
# for i, row in tqdm(df.iterrows(), total=len(df)):
#     text = clean_text(row['text'])
#     scores = examine_text(text)
#     if scores:
#         # Append scores to results
#         results.append({'index': i, 'neg': scores['neg'], 'neu': scores['neu'], 'pos': scores['pos']})
#     else:
#         print(f"Skipped problematic text: {text}")

# # Convert results to a DataFrame
# results_df = pd.DataFrame(results)

# # Save to CSV
# results_df.to_csv('sentiment_scores.csv', index=False)
# print("Saved sentiment scores to 'sentiment_scores.csv'")


st.markdown(''' ### Plotting in Altair
We then just load in the data from the csv file with the sentiment scores and create plots with them''')
st.code('''# Load sentiment scores
sentiment_scores = pd.read_csv('sentiment_scores.csv')
df = df.reset_index().merge(sentiment_scores, on='index')''')

# Load sentiment scores
sentiment_scores = pd.read_csv('sentiment_scores.csv')
df = df.reset_index().merge(sentiment_scores, on='index')

# Clean text function
def clean_text(text):
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)  
    text = re.sub(r'\s+', ' ', text).strip()    
    return text

df['cleaned_text'] = df['text'].apply(clean_text)

# Determine the highest sentiment score for each row
df['highest_score'] = df[['neg', 'neu', 'pos']].max(axis=1)
df['sentiment_type'] = df[['neg', 'neu', 'pos']].idxmax(axis=1)  # neg/neu/pos as categories
df['sentiment_type'] = df['sentiment_type'].replace({
    'neg': 'Negative',
    'neu': 'Neutral',
    'pos': 'Positive'
})

# Sidebar: Filters
# st.sidebar.header("Filters")
sentiment_filter = st.multiselect(
    "Select Sentiment Types to Display:",
    options=['Negative', 'Neutral', 'Positive'],
    default=['Negative', 'Neutral', 'Positive']
)
score_filter = st.slider(
    "Select Minimum Sentiment Score:",
    min_value=0.0000,
    max_value=1.0000,
    value=0.0000,
    step=0.0001
)

# Filter the DataFrame to only include points that meet criteria
filtered_df = df[
    (df['sentiment_type'].isin(sentiment_filter)) &  # Match selected sentiment type
    (df['highest_score'] >= score_filter)           # Match slider score range
]

filtered_counts = filtered_df['sentiment_type'].value_counts()

# Generate a summary message for the counts
filtered_summary = (
    f"**Filtered DataFrame:**\n"
    f"- **Negative Sentiments Count:** {filtered_counts.get('Negative', 0)}\n"
    f"- **Neutral Sentiments Count:** {filtered_counts.get('Neutral', 0)}\n"
    f"- **Positive Sentiments Count:** {filtered_counts.get('Positive', 0)}"
)


# Create a brush to link scatter plot and bar chart
brush = alt.selection_interval(encodings=['x', 'y'])

# Scatter plot with brush
scatter_plot = alt.Chart(filtered_df).mark_circle(size=60).encode(
    x=alt.X('index:Q', title='Index'),
    y=alt.Y('highest_score:Q', title='Highest Sentiment Score'),
    color=alt.Color('sentiment_type:N', title='Sentiment Type', scale=alt.Scale(scheme='tableau20')),
    tooltip=['index', 'sentiment_type', 'highest_score', 'cleaned_text', 'text']
).add_params(
    brush
).properties(
    width=800,
    height=400,
    title="Scatter Plot of Sentiment Scores (Filtered) - Brush Feature to show Bar Chart"
).interactive()

# Bar chart linked to scatter plot
bar_chart = alt.Chart(filtered_df).transform_filter(
    brush
).transform_filter(
    alt.FieldOneOfPredicate(field='sentiment_type', oneOf=sentiment_filter)  # Apply multiselect filter
).transform_aggregate(
    total_score='sum(highest_score)',  # Aggregate the highest_score
    groupby=['sentiment_type']         # Group by sentiment type
).mark_bar().encode(
    x=alt.X('sentiment_type:N', title='Sentiment Type'),
    y=alt.Y('total_score:Q', title='Sum of Highest Scores'),
    color=alt.Color('sentiment_type:N', scale=alt.Scale(scheme='tableau20'))
    
).properties(
    width=800,
    height=200,
    title="Bar Chart of Sentiment Sums (Linked to Scatter Plot)"
)

# Combine scatter and bar charts
combined_chart = alt.vconcat(
    scatter_plot,
    bar_chart
)

# Display the combined chart
st.altair_chart(combined_chart, use_container_width=True)

# Display the filtered DataFrame and counts
st.write(filtered_summary)
st.dataframe(
    filtered_df[['sentiment_type', 'cleaned_text', 'highest_score', 'text']]
)

st.header('''Contextual Dataset''')
st.image("amazon.jpeg")
url1 = "https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews"
st.write("Dataset Link to Download -> [Kaggle Amazon Reviews Dataset](%s)" % url1)

multi4 = '''The dataset I chose as a contextual dataset is an the Amazon Reviews dataset on Kaggle. I chose this dataset because of it similar nature in reviews of products instead. The dataset provides more sentiment for products but could
be used to train the model to be more efficient in determining Xenophobic tweets. In the future, I want to be able to train the model inorder to do more better analysis on the Twitter dataset.'''

st.markdown(multi4)
st.header('''Write Up''')
multi2 = '''Mentioned in the beginning, as a Asian American I wanted to highlight the xenophobic tweets during 
Covid-19 and using a trained sentiment analysis model to analyze and visualize the tweets was a instant idea 
when I found the dataset. In the first plot, I decided to use an scatter plot to better visualize all the tweets with the highest sentitment score plotted. 
So to do that, I had to compare the negative, neutral, and positive scores and find the highest one. Then using the filters/interactivity, depending on which ones you selected.
The dataframe and the scatter plot will update accordingly. As an additional layer, on the scatter plot I wanted it to be efficient in comparing the sentiments so the data related 
to the point will appear when hovered over.

For the second plot, I wanted for the expert to easily view the dataframe and use it as a secondary reference to the scatter as a better view to make insights because in a table format
, you're able to see all the columns better. Additionally, I adjusted the columns orders to make it efficient as possible. As a additional layer, there is a count for how many of the 
data points are showing and it is updated according to the filters.

For the interactivity, I wanted to have two types of filters with the multiselect first allowing the expert to easily manage the points, and then filter the points again with the sentiment scores.

Overall, I am happy with the plots but if I were to have more time, I would definitly load and analyze more data points which is an easy task to do by just changing the number of rows to parse over and 
afking while letting the program run. However, I just don't know how long it will take as I already explored options like the tqdm module to add a progress bar but it doesn't really work in the terminal locally for me.
'''
st.markdown(multi2)