is445_final

Sleeping

File size: 11,964 Bytes

import streamlit as st
import altair as alt
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
from tqdm.notebook import tqdm
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax

# pip install torch torchvision torchaudio
# pip install transformers

st.title("Final Project Part 2 - Jason Wu | Expert Visualizations")

url = "https://www.kaggle.com/datasets/rahulgoel1106/xenophobia-on-twitter-during-covid19"
st.write("Dataset Link to Download -> [Kaggle Covid-19 Xenophobic Datatset](%s)" % url)

plt.style.use('ggplot')

multi = '''The dataset chosen is called Xenophobic and like the name it highlights the Xenophobic posts on Twitter 
during beginning stages of Covid-19 and today we are conducting sentiment analysis using a trained Twitter sentiment model.
#### To follow: '''
st.markdown('''### About:''')
st.markdown(multi)
st.code('''# pip install these packages into your terminal or workspace
pip install torch torchvision torchaudio # to work with the trained model
pip install transformers # to work with the trained model''')

df = pd.read_csv('Xenophobia.csv', encoding='latin1', nrows=5000)
cols_to_drop = ['status_id', 'created_at', 'location']
df.drop(cols_to_drop, axis=1, inplace=True)

# Convert text to string type
df['text'] = df['text'].astype(str)

st.markdown('''#### Leading Data & Removing Unwated Data: ''')
st.code('''df = pd.read_csv('Xenophobia.csv', encoding='latin1', nrows=5000)
cols_to_drop = ['status_id', 'created_at', 'location'] ''')

multi1 = ''' #### Next Steps:
The next step is to run the sentiment analysis on the dataset, however the analysis take a long time to run so I am only going to test 5000 rows out of the millions of rows.
1. The first step is to intialize the model and call on it from HuggingFace'''

st.markdown(multi1)
st.code('''MODEL = "cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)''')

st.markdown('''2. Then we have to clean the data from stop words''')
st.code(''' def clean_text(text):
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)  
    text = re.sub(r'\s+', ' ', text).strip()    
    return text''')
st.markdown('''3. Lastly, we will have to run the model on the cleaned data and process the sentiment scores into a new csv file to be used''')

st.code('''def examine_text(example):
    try:  # Use the try statement to handle errors such as text being too long etc
        encoded_text = tokenizer( # setting conditions 
            example,
            return_tensors='pt', 
            truncation=True,
            max_length=512,
            padding="max_length"
        )
        output = model(**encoded_text)
        scores = output.logits[0].detach().numpy()
        scores = softmax(scores) # softmax function transforms each element of a collection by computing the exponential of each element divided by the sum of the exponentials of all the elements
        return {
            'neg': scores[0],
            'neu': scores[1],
            'pos': scores[2]
        }
    except Exception as e: # handling errors
        print(f"Error processing text: {example}\nError: {e}")
        return None ''')

st.code('''results = []

# Process each text
for i, row in tqdm(df.iterrows(), total=len(df)):
    text = clean_text(row['text'])
    scores = examine_text(text)
    if scores:
        # Append scores to results
        results.append({'index': i, 'neg': scores['neg'], 'neu': scores['neu'], 'pos': scores['pos']})
    else:
        print(f"Skipped problematic text: {text}")

# Convert results to a DataFrame
results_df = pd.DataFrame(results)

# Save to CSV
results_df.to_csv('sentiment_scores.csv', index=False)
print("Saved sentiment scores to 'sentiment_scores.csv'")
# prints out when done - took me 20 minutes for 5000 rows so imagine a million''') 
# MODEL = "cardiffnlp/twitter-roberta-base-sentiment-latest"
# tokenizer = AutoTokenizer.from_pretrained(MODEL)
# model = AutoModelForSequenceClassification.from_pretrained(MODEL)

# # Clean Text Function
# def clean_text(text):
#     """Remove non-ASCII characters and excess whitespace."""
#     text = re.sub(r'[^\x00-\x7F]+', ' ', text)  # Remove non-ASCII
#     text = re.sub(r'\s+', ' ', text).strip()    # Remove excess whitespace
#     return text


# Sentiment analysis function
# def examine_text(example):
#     try:
#         encoded_text = tokenizer(
#             example,
#             return_tensors='pt',
#             truncation=True,
#             max_length=512,
#             padding="max_length"
#         )
#         output = model(**encoded_text)
#         scores = output.logits[0].detach().numpy()
#         scores = softmax(scores)
#         return {
#             'neg': scores[0],
#             'neu': scores[1],
#             'pos': scores[2]
#         }
#     except Exception as e:
#         print(f"Error processing text: {example}\nError: {e}")
#         return None


# Prepare list for saving results
# results = []

# # Process each text
# for i, row in tqdm(df.iterrows(), total=len(df)):
#     text = clean_text(row['text'])
#     scores = examine_text(text)
#     if scores:
#         # Append scores to results
#         results.append({'index': i, 'neg': scores['neg'], 'neu': scores['neu'], 'pos': scores['pos']})
#     else:
#         print(f"Skipped problematic text: {text}")

# # Convert results to a DataFrame
# results_df = pd.DataFrame(results)

# # Save to CSV
# results_df.to_csv('sentiment_scores.csv', index=False)
# print("Saved sentiment scores to 'sentiment_scores.csv'")


st.markdown(''' ### Plotting in Altair
We then just load in the data from the csv file with the sentiment scores and create plots with them''')
st.code('''# Load sentiment scores
sentiment_scores = pd.read_csv('sentiment_scores.csv')
df = df.reset_index().merge(sentiment_scores, on='index')''')

# Load sentiment scores
sentiment_scores = pd.read_csv('sentiment_scores.csv')
df = df.reset_index().merge(sentiment_scores, on='index')

# Clean text function
def clean_text(text):
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)  
    text = re.sub(r'\s+', ' ', text).strip()    
    return text

df['cleaned_text'] = df['text'].apply(clean_text)

# Determine the highest sentiment score for each row
df['highest_score'] = df[['neg', 'neu', 'pos']].max(axis=1)
df['sentiment_type'] = df[['neg', 'neu', 'pos']].idxmax(axis=1)  # neg/neu/pos as categories
df['sentiment_type'] = df['sentiment_type'].replace({
    'neg': 'Negative',
    'neu': 'Neutral',
    'pos': 'Positive'
})

# Sidebar: Filters
# st.sidebar.header("Filters")
sentiment_filter = st.multiselect(
    "Select Sentiment Types to Display:",
    options=['Negative', 'Neutral', 'Positive'],
    default=['Negative', 'Neutral', 'Positive']
)
score_filter = st.slider(
    "Select Minimum Sentiment Score:",
    min_value=0.0000,
    max_value=1.0000,
    value=0.0000,
    step=0.0001
)

# Filter the DataFrame to only include points that meet criteria
filtered_df = df[
    (df['sentiment_type'].isin(sentiment_filter)) &  # Match selected sentiment type
    (df['highest_score'] >= score_filter)           # Match slider score range
]

filtered_counts = filtered_df['sentiment_type'].value_counts()

# Generate a summary message for the counts
filtered_summary = (
    f"**Filtered DataFrame:**\n"
    f"- **Negative Sentiments Count:** {filtered_counts.get('Negative', 0)}\n"
    f"- **Neutral Sentiments Count:** {filtered_counts.get('Neutral', 0)}\n"
    f"- **Positive Sentiments Count:** {filtered_counts.get('Positive', 0)}"
)


# Create a brush to link scatter plot and bar chart
brush = alt.selection_interval(encodings=['x', 'y'])

# Scatter plot with brush
scatter_plot = alt.Chart(filtered_df).mark_circle(size=60).encode(
    x=alt.X('index:Q', title='Index'),
    y=alt.Y('highest_score:Q', title='Highest Sentiment Score'),
    color=alt.Color('sentiment_type:N', title='Sentiment Type', scale=alt.Scale(scheme='tableau20')),
    tooltip=['index', 'sentiment_type', 'highest_score', 'cleaned_text', 'text']
).add_params(
    brush
).properties(
    width=800,
    height=400,
    title="Scatter Plot of Sentiment Scores (Filtered) - Brush Feature to show Bar Chart"
).interactive()

# Bar chart linked to scatter plot
bar_chart = alt.Chart(filtered_df).transform_filter(
    brush
).transform_filter(
    alt.FieldOneOfPredicate(field='sentiment_type', oneOf=sentiment_filter)  # Apply multiselect filter
).transform_aggregate(
    total_score='sum(highest_score)',  # Aggregate the highest_score
    groupby=['sentiment_type']         # Group by sentiment type
).mark_bar().encode(
    x=alt.X('sentiment_type:N', title='Sentiment Type'),
    y=alt.Y('total_score:Q', title='Sum of Highest Scores'),
    color=alt.Color('sentiment_type:N', scale=alt.Scale(scheme='tableau20'))
    
).properties(
    width=800,
    height=200,
    title="Bar Chart of Sentiment Sums (Linked to Scatter Plot)"
)

# Combine scatter and bar charts
combined_chart = alt.vconcat(
    scatter_plot,
    bar_chart
)

# Display the combined chart
st.altair_chart(combined_chart, use_container_width=True)

# Display the filtered DataFrame and counts
st.write(filtered_summary)
st.dataframe(
    filtered_df[['sentiment_type', 'cleaned_text', 'highest_score', 'text']]
)

st.header('''Contextual Dataset''')
st.image("amazon.jpeg")
url1 = "https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews"
st.write("Dataset Link to Download -> [Kaggle Amazon Reviews Dataset](%s)" % url1)

multi4 = '''The dataset I chose as a contextual dataset is an the Amazon Reviews dataset on Kaggle. I chose this dataset because of it similar nature in reviews of products instead. The dataset provides more sentiment for products but could
be used to train the model to be more efficient in determining Xenophobic tweets. In the future, I want to be able to train the model inorder to do more better analysis on the Twitter dataset.'''

st.markdown(multi4)
st.header('''Write Up''')
multi2 = '''Mentioned in the beginning, as a Asian American I wanted to highlight the xenophobic tweets during 
Covid-19 and using a trained sentiment analysis model to analyze and visualize the tweets was a instant idea 
when I found the dataset. In the first plot, I decided to use an scatter plot to better visualize all the tweets with the highest sentitment score plotted. 
So to do that, I had to compare the negative, neutral, and positive scores and find the highest one. Then using the filters/interactivity, depending on which ones you selected.
The dataframe and the scatter plot will update accordingly. As an additional layer, on the scatter plot I wanted it to be efficient in comparing the sentiments so the data related 
to the point will appear when hovered over.

For the second plot, I wanted for the expert to easily view the dataframe and use it as a secondary reference to the scatter as a better view to make insights because in a table format
, you're able to see all the columns better. Additionally, I adjusted the columns orders to make it efficient as possible. As a additional layer, there is a count for how many of the 
data points are showing and it is updated according to the filters.

For the interactivity, I wanted to have two types of filters with the multiselect first allowing the expert to easily manage the points, and then filter the points again with the sentiment scores.

Overall, I am happy with the plots but if I were to have more time, I would definitly load and analyze more data points which is an easy task to do by just changing the number of rows to parse over and 
afking while letting the program run. However, I just don't know how long it will take as I already explored options like the tqdm module to add a progress bar but it doesn't really work in the terminal locally for me.
'''
st.markdown(multi2)