Spaces:
Sleeping
Sleeping
import streamlit as st | |
import altair as alt | |
import pandas as pd | |
import matplotlib.pyplot as plt | |
import numpy as np | |
import re | |
from tqdm.notebook import tqdm | |
from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
from scipy.special import softmax | |
# pip install torch torchvision torchaudio | |
# pip install transformers | |
st.title("Final Project Part 2 - Jason Wu | Expert Visualizations") | |
url = "https://www.kaggle.com/datasets/rahulgoel1106/xenophobia-on-twitter-during-covid19" | |
st.write("Dataset Link to Download -> [Kaggle Covid-19 Xenophobic Datatset](%s)" % url) | |
plt.style.use('ggplot') | |
multi = '''The dataset chosen is called Xenophobic and like the name it highlights the Xenophobic posts on Twitter | |
during beginning stages of Covid-19 and today we are conducting sentiment analysis using a trained Twitter sentiment model. | |
#### To follow: ''' | |
st.markdown('''### About:''') | |
st.markdown(multi) | |
st.code('''# pip install these packages into your terminal or workspace | |
pip install torch torchvision torchaudio # to work with the trained model | |
pip install transformers # to work with the trained model''') | |
df = pd.read_csv('Xenophobia.csv', encoding='latin1', nrows=5000) | |
cols_to_drop = ['status_id', 'created_at', 'location'] | |
df.drop(cols_to_drop, axis=1, inplace=True) | |
# Convert text to string type | |
df['text'] = df['text'].astype(str) | |
st.markdown('''#### Leading Data & Removing Unwated Data: ''') | |
st.code('''df = pd.read_csv('Xenophobia.csv', encoding='latin1', nrows=5000) | |
cols_to_drop = ['status_id', 'created_at', 'location'] ''') | |
multi1 = ''' #### Next Steps: | |
The next step is to run the sentiment analysis on the dataset, however the analysis take a long time to run so I am only going to test 5000 rows out of the millions of rows. | |
1. The first step is to intialize the model and call on it from HuggingFace''' | |
st.markdown(multi1) | |
st.code('''MODEL = "cardiffnlp/twitter-roberta-base-sentiment-latest" | |
tokenizer = AutoTokenizer.from_pretrained(MODEL) | |
model = AutoModelForSequenceClassification.from_pretrained(MODEL)''') | |
st.markdown('''2. Then we have to clean the data from stop words''') | |
st.code(''' def clean_text(text): | |
text = re.sub(r'[^\x00-\x7F]+', ' ', text) | |
text = re.sub(r'\s+', ' ', text).strip() | |
return text''') | |
st.markdown('''3. Lastly, we will have to run the model on the cleaned data and process the sentiment scores into a new csv file to be used''') | |
st.code('''def examine_text(example): | |
try: # Use the try statement to handle errors such as text being too long etc | |
encoded_text = tokenizer( # setting conditions | |
example, | |
return_tensors='pt', | |
truncation=True, | |
max_length=512, | |
padding="max_length" | |
) | |
output = model(**encoded_text) | |
scores = output.logits[0].detach().numpy() | |
scores = softmax(scores) # softmax function transforms each element of a collection by computing the exponential of each element divided by the sum of the exponentials of all the elements | |
return { | |
'neg': scores[0], | |
'neu': scores[1], | |
'pos': scores[2] | |
} | |
except Exception as e: # handling errors | |
print(f"Error processing text: {example}\nError: {e}") | |
return None ''') | |
st.code('''results = [] | |
# Process each text | |
for i, row in tqdm(df.iterrows(), total=len(df)): | |
text = clean_text(row['text']) | |
scores = examine_text(text) | |
if scores: | |
# Append scores to results | |
results.append({'index': i, 'neg': scores['neg'], 'neu': scores['neu'], 'pos': scores['pos']}) | |
else: | |
print(f"Skipped problematic text: {text}") | |
# Convert results to a DataFrame | |
results_df = pd.DataFrame(results) | |
# Save to CSV | |
results_df.to_csv('sentiment_scores.csv', index=False) | |
print("Saved sentiment scores to 'sentiment_scores.csv'") | |
# prints out when done - took me 20 minutes for 5000 rows so imagine a million''') | |
# MODEL = "cardiffnlp/twitter-roberta-base-sentiment-latest" | |
# tokenizer = AutoTokenizer.from_pretrained(MODEL) | |
# model = AutoModelForSequenceClassification.from_pretrained(MODEL) | |
# # Clean Text Function | |
# def clean_text(text): | |
# """Remove non-ASCII characters and excess whitespace.""" | |
# text = re.sub(r'[^\x00-\x7F]+', ' ', text) # Remove non-ASCII | |
# text = re.sub(r'\s+', ' ', text).strip() # Remove excess whitespace | |
# return text | |
# Sentiment analysis function | |
# def examine_text(example): | |
# try: | |
# encoded_text = tokenizer( | |
# example, | |
# return_tensors='pt', | |
# truncation=True, | |
# max_length=512, | |
# padding="max_length" | |
# ) | |
# output = model(**encoded_text) | |
# scores = output.logits[0].detach().numpy() | |
# scores = softmax(scores) | |
# return { | |
# 'neg': scores[0], | |
# 'neu': scores[1], | |
# 'pos': scores[2] | |
# } | |
# except Exception as e: | |
# print(f"Error processing text: {example}\nError: {e}") | |
# return None | |
# Prepare list for saving results | |
# results = [] | |
# # Process each text | |
# for i, row in tqdm(df.iterrows(), total=len(df)): | |
# text = clean_text(row['text']) | |
# scores = examine_text(text) | |
# if scores: | |
# # Append scores to results | |
# results.append({'index': i, 'neg': scores['neg'], 'neu': scores['neu'], 'pos': scores['pos']}) | |
# else: | |
# print(f"Skipped problematic text: {text}") | |
# # Convert results to a DataFrame | |
# results_df = pd.DataFrame(results) | |
# # Save to CSV | |
# results_df.to_csv('sentiment_scores.csv', index=False) | |
# print("Saved sentiment scores to 'sentiment_scores.csv'") | |
st.markdown(''' ### Plotting in Altair | |
We then just load in the data from the csv file with the sentiment scores and create plots with them''') | |
st.code('''# Load sentiment scores | |
sentiment_scores = pd.read_csv('sentiment_scores.csv') | |
df = df.reset_index().merge(sentiment_scores, on='index')''') | |
# Load sentiment scores | |
sentiment_scores = pd.read_csv('sentiment_scores.csv') | |
df = df.reset_index().merge(sentiment_scores, on='index') | |
# Clean text function | |
def clean_text(text): | |
text = re.sub(r'[^\x00-\x7F]+', ' ', text) | |
text = re.sub(r'\s+', ' ', text).strip() | |
return text | |
df['cleaned_text'] = df['text'].apply(clean_text) | |
# Determine the highest sentiment score for each row | |
df['highest_score'] = df[['neg', 'neu', 'pos']].max(axis=1) | |
df['sentiment_type'] = df[['neg', 'neu', 'pos']].idxmax(axis=1) # neg/neu/pos as categories | |
df['sentiment_type'] = df['sentiment_type'].replace({ | |
'neg': 'Negative', | |
'neu': 'Neutral', | |
'pos': 'Positive' | |
}) | |
# Sidebar: Filters | |
# st.sidebar.header("Filters") | |
sentiment_filter = st.multiselect( | |
"Select Sentiment Types to Display:", | |
options=['Negative', 'Neutral', 'Positive'], | |
default=['Negative', 'Neutral', 'Positive'] | |
) | |
score_filter = st.slider( | |
"Select Minimum Sentiment Score:", | |
min_value=0.0000, | |
max_value=1.0000, | |
value=0.0000, | |
step=0.0001 | |
) | |
# Filter the DataFrame to only include points that meet criteria | |
filtered_df = df[ | |
(df['sentiment_type'].isin(sentiment_filter)) & # Match selected sentiment type | |
(df['highest_score'] >= score_filter) # Match slider score range | |
] | |
filtered_counts = filtered_df['sentiment_type'].value_counts() | |
# Generate a summary message for the counts | |
filtered_summary = ( | |
f"**Filtered DataFrame:**\n" | |
f"- **Negative Sentiments Count:** {filtered_counts.get('Negative', 0)}\n" | |
f"- **Neutral Sentiments Count:** {filtered_counts.get('Neutral', 0)}\n" | |
f"- **Positive Sentiments Count:** {filtered_counts.get('Positive', 0)}" | |
) | |
# Create a brush to link scatter plot and bar chart | |
brush = alt.selection_interval(encodings=['x', 'y']) | |
# Scatter plot with brush | |
scatter_plot = alt.Chart(filtered_df).mark_circle(size=60).encode( | |
x=alt.X('index:Q', title='Index'), | |
y=alt.Y('highest_score:Q', title='Highest Sentiment Score'), | |
color=alt.Color('sentiment_type:N', title='Sentiment Type', scale=alt.Scale(scheme='tableau20')), | |
tooltip=['index', 'sentiment_type', 'highest_score', 'cleaned_text', 'text'] | |
).add_params( | |
brush | |
).properties( | |
width=800, | |
height=400, | |
title="Scatter Plot of Sentiment Scores (Filtered) - Brush Feature to show Bar Chart" | |
).interactive() | |
# Bar chart linked to scatter plot | |
bar_chart = alt.Chart(filtered_df).transform_filter( | |
brush | |
).transform_filter( | |
alt.FieldOneOfPredicate(field='sentiment_type', oneOf=sentiment_filter) # Apply multiselect filter | |
).transform_aggregate( | |
total_score='sum(highest_score)', # Aggregate the highest_score | |
groupby=['sentiment_type'] # Group by sentiment type | |
).mark_bar().encode( | |
x=alt.X('sentiment_type:N', title='Sentiment Type'), | |
y=alt.Y('total_score:Q', title='Sum of Highest Scores'), | |
color=alt.Color('sentiment_type:N', scale=alt.Scale(scheme='tableau20')) | |
).properties( | |
width=800, | |
height=200, | |
title="Bar Chart of Sentiment Sums (Linked to Scatter Plot)" | |
) | |
# Combine scatter and bar charts | |
combined_chart = alt.vconcat( | |
scatter_plot, | |
bar_chart | |
) | |
# Display the combined chart | |
st.altair_chart(combined_chart, use_container_width=True) | |
# Display the filtered DataFrame and counts | |
st.write(filtered_summary) | |
st.dataframe( | |
filtered_df[['sentiment_type', 'cleaned_text', 'highest_score', 'text']] | |
) | |
st.header('''Contextual Dataset''') | |
st.image("amazon.jpeg") | |
url1 = "https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews" | |
st.write("Dataset Link to Download -> [Kaggle Amazon Reviews Dataset](%s)" % url1) | |
multi4 = '''The dataset I chose as a contextual dataset is an the Amazon Reviews dataset on Kaggle. I chose this dataset because of it similar nature in reviews of products instead. The dataset provides more sentiment for products but could | |
be used to train the model to be more efficient in determining Xenophobic tweets. In the future, I want to be able to train the model inorder to do more better analysis on the Twitter dataset.''' | |
st.markdown(multi4) | |
st.header('''Write Up''') | |
multi2 = '''Mentioned in the beginning, as a Asian American I wanted to highlight the xenophobic tweets during | |
Covid-19 and using a trained sentiment analysis model to analyze and visualize the tweets was a instant idea | |
when I found the dataset. In the first plot, I decided to use an scatter plot to better visualize all the tweets with the highest sentitment score plotted. | |
So to do that, I had to compare the negative, neutral, and positive scores and find the highest one. Then using the filters/interactivity, depending on which ones you selected. | |
The dataframe and the scatter plot will update accordingly. As an additional layer, on the scatter plot I wanted it to be efficient in comparing the sentiments so the data related | |
to the point will appear when hovered over. | |
For the second plot, I wanted for the expert to easily view the dataframe and use it as a secondary reference to the scatter as a better view to make insights because in a table format | |
, you're able to see all the columns better. Additionally, I adjusted the columns orders to make it efficient as possible. As a additional layer, there is a count for how many of the | |
data points are showing and it is updated according to the filters. | |
For the interactivity, I wanted to have two types of filters with the multiselect first allowing the expert to easily manage the points, and then filter the points again with the sentiment scores. | |
Overall, I am happy with the plots but if I were to have more time, I would definitly load and analyze more data points which is an easy task to do by just changing the number of rows to parse over and | |
afking while letting the program run. However, I just don't know how long it will take as I already explored options like the tqdm module to add a progress bar but it doesn't really work in the terminal locally for me. | |
''' | |
st.markdown(multi2) |