is445_final_part3

Sleeping

App Files Files Community

Jason Wu commited on Dec 5, 2024

Commit

faa1d38

1 Parent(s): 636549a

finish

Browse files

Files changed (1) hide show

app.py +238 -61

app.py CHANGED Viewed

@@ -1,73 +1,250 @@
-# INSTRUCTIONS:
-# 1. Open a "Terminal" by: View --> Terminal OR just the "Terminal" through the hamburger menu
-# 2. run in terminal with: streamlit run app.py
-# 3. click the "Open in Browser" link that pops up OR click on "Ports" and copy the URL
-# 4. Open a Simple Browswer with View --> Command Palette --> Simple Browser: Show
-# 5. use the URL from prior steps as intput into this simple browser
 import streamlit as st
 import altair as alt
-from vega_datasets import data
-st.title('Streamlit App for IS445: ID47122')
-st.text("The URL for this app is: https://huggingface.co/spaces/jwu249/is445_demo")
-source = data.seattle_weather()
-scale = alt.Scale(
-    domain=["sun", "fog", "drizzle", "rain", "snow"],
-    range=["#e7ba52", "#a7a7a7", "#aec7e8", "#1f77b4", "#9467bd"],
 )
-color = alt.Color("weather:N", scale=scale)
-# We create two selections:
-# - a brush that is active on the top panel
-# - a multi-click that is active on the bottom panel
-brush = alt.selection_interval(encodings=["x"])
-click = alt.selection_point(encodings=["color"])
-# Top panel is scatter plot of temperature vs time
-points = (
-    alt.Chart()
-    .mark_point()
-    .encode(
-        alt.X("monthdate(date):T", title="Date (Month Year)"),
-        alt.Y(
-            "temp_max:Q",
-            title="Maximum Daily Temperature (C)",
-            scale=alt.Scale(domain=[-5, 40]),
-        ),
-        color=alt.condition(brush, color, alt.value("lightgray")),
-        size=alt.Size("precipitation:Q", scale=alt.Scale(range=[5, 200])),
-    )
-    .properties(width=550, height=300)
-    .add_params(brush)
-    .transform_filter(click)
 )
-# Bottom panel is a bar chart of weather type
-bars = (
-    alt.Chart()
-    .mark_bar()
-    .encode(
-        x="count()",
-        y="weather:N",
-        color=alt.condition(click, color, alt.value("lightgray")),
-    )
-    .transform_filter(brush)
-    .properties(
-        width=550,
-    )
-    .add_params(click)
 )
-chart = alt.vconcat(points, bars, data=source, title="Seattle Weather - 2002 to 2012")
-tab1, tab2 = st.tabs(["Streamlit theme (default)", "Altair native theme"])
-with tab1:
-    st.altair_chart(chart, theme="streamlit", use_container_width=True)
-with tab2:
-    st.altair_chart(chart, theme=None, use_container_width=True)

 import streamlit as st
 import altair as alt
+import pandas as pd
+import matplotlib.pyplot as plt
+import numpy as np
+import re
+from tqdm.notebook import tqdm
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+from scipy.special import softmax
+# pip install torch torchvision torchaudio
+# pip install transformers
+st.title("Final Project Part 2 - jwu249 | Expert Visualizations")
+url = "https://www.kaggle.com/datasets/rahulgoel1106/xenophobia-on-twitter-during-covid19"
+st.write("Dataset Link to Download -> [Kaggle Covid-19 Xenophobic Datatset](%s)" % url)
+plt.style.use('ggplot')
+multi = '''The dataset chosen is called Xenophobic and like the name it highlights the Xenophobic posts on Twitter
+during beginning stages of Covid-19 and today we are conducting sentiment analysis using a trained Twitter sentiment model.
+#### To follow: '''
+st.markdown('''### About:''')
+st.markdown(multi)
+st.code('''# pip install these packages into your terminal or workspace
+pip install torch torchvision torchaudio # to work with the trained model
+pip install transformers # to work with the trained model''')
+df = pd.read_csv('Xenophobia.csv', encoding='latin1', nrows=5000)
+cols_to_drop = ['status_id', 'created_at', 'location']
+df.drop(cols_to_drop, axis=1, inplace=True)
+# Convert text to string type
+df['text'] = df['text'].astype(str)
+st.markdown('''#### Leading Data & Removing Unwated Data: ''')
+st.code('''df = pd.read_csv('Xenophobia.csv', encoding='latin1', nrows=5000)
+cols_to_drop = ['status_id', 'created_at', 'location'] ''')
+multi1 = ''' #### Next Steps:
+The next step is to run the sentiment analysis on the dataset, however the analysis take a long time to run so I am only going to test 5000 rows out of the millions of rows.
+1. The first step is to intialize the model and call on it from HuggingFace'''
+st.markdown(multi1)
+st.code('''MODEL = "cardiffnlp/twitter-roberta-base-sentiment-latest"
+tokenizer = AutoTokenizer.from_pretrained(MODEL)
+model = AutoModelForSequenceClassification.from_pretrained(MODEL)''')
+st.markdown('''2. Then we have to clean the data from stop words''')
+st.code(''' def clean_text(text):
+    text = re.sub(r'[^\x00-\x7F]+', ' ', text)
+    text = re.sub(r'\s+', ' ', text).strip()
+    return text''')
+st.markdown('''3. Lastly, we will have to run the model on the cleaned data and process the sentiment scores into a new csv file to be used''')
+st.code('''def examine_text(example):
+    try:  # Use the try statement to handle errors such as text being too long etc
+        encoded_text = tokenizer( # setting conditions
+            example,
+            return_tensors='pt',
+            truncation=True,
+            max_length=512,
+            padding="max_length"
+        )
+        output = model(**encoded_text)
+        scores = output.logits[0].detach().numpy()
+        scores = softmax(scores) # softmax function transforms each element of a collection by computing the exponential of each element divided by the sum of the exponentials of all the elements
+        return {
+            'neg': scores[0],
+            'neu': scores[1],
+            'pos': scores[2]
+        }
+    except Exception as e: # handling errors
+        print(f"Error processing text: {example}\nError: {e}")
+        return None ''')
+st.code('''results = []
+# Process each text
+for i, row in tqdm(df.iterrows(), total=len(df)):
+    text = clean_text(row['text'])
+    scores = examine_text(text)
+    if scores:
+        # Append scores to results
+        results.append({'index': i, 'neg': scores['neg'], 'neu': scores['neu'], 'pos': scores['pos']})
+    else:
+        print(f"Skipped problematic text: {text}")
+# Convert results to a DataFrame
+results_df = pd.DataFrame(results)
+# Save to CSV
+results_df.to_csv('sentiment_scores.csv', index=False)
+print("Saved sentiment scores to 'sentiment_scores.csv'")
+# prints out when done - took me 20 minutes for 5000 rows so imagine a million''')
+# MODEL = "cardiffnlp/twitter-roberta-base-sentiment-latest"
+# tokenizer = AutoTokenizer.from_pretrained(MODEL)
+# model = AutoModelForSequenceClassification.from_pretrained(MODEL)
+# # Clean Text Function
+# def clean_text(text):
+#     """Remove non-ASCII characters and excess whitespace."""
+#     text = re.sub(r'[^\x00-\x7F]+', ' ', text)  # Remove non-ASCII
+#     text = re.sub(r'\s+', ' ', text).strip()    # Remove excess whitespace
+#     return text
+# Sentiment analysis function
+# def examine_text(example):
+#     try:
+#         encoded_text = tokenizer(
+#             example,
+#             return_tensors='pt',
+#             truncation=True,
+#             max_length=512,
+#             padding="max_length"
+#         )
+#         output = model(**encoded_text)
+#         scores = output.logits[0].detach().numpy()
+#         scores = softmax(scores)
+#         return {
+#             'neg': scores[0],
+#             'neu': scores[1],
+#             'pos': scores[2]
+#         }
+#     except Exception as e:
+#         print(f"Error processing text: {example}\nError: {e}")
+#         return None
+# Prepare list for saving results
+# results = []
+# # Process each text
+# for i, row in tqdm(df.iterrows(), total=len(df)):
+#     text = clean_text(row['text'])
+#     scores = examine_text(text)
+#     if scores:
+#         # Append scores to results
+#         results.append({'index': i, 'neg': scores['neg'], 'neu': scores['neu'], 'pos': scores['pos']})
+#     else:
+#         print(f"Skipped problematic text: {text}")
+# # Convert results to a DataFrame
+# results_df = pd.DataFrame(results)
+# # Save to CSV
+# results_df.to_csv('sentiment_scores.csv', index=False)
+# print("Saved sentiment scores to 'sentiment_scores.csv'")
+st.markdown(''' ### Plotting in Altair
+We then just load in the data from the csv file with the sentiment scores and create plots with them''')
+st.code('''# Load sentiment scores
+sentiment_scores = pd.read_csv('sentiment_scores.csv')
+df = df.reset_index().merge(sentiment_scores, on='index')''')
+# Load sentiment scores
+sentiment_scores = pd.read_csv('sentiment_scores.csv')
+df = df.reset_index().merge(sentiment_scores, on='index')
+# Clean text function
+def clean_text(text):
+    text = re.sub(r'[^\x00-\x7F]+', ' ', text)
+    text = re.sub(r'\s+', ' ', text).strip()
+    return text
+df['cleaned_text'] = df['text'].apply(clean_text)
+# Determine the highest sentiment score for each row
+df['highest_score'] = df[['neg', 'neu', 'pos']].max(axis=1)
+df['sentiment_type'] = df[['neg', 'neu', 'pos']].idxmax(axis=1)  # neg/neu/pos as categories
+df['sentiment_type'] = df['sentiment_type'].replace({
+    'neg': 'Negative',
+    'neu': 'Neutral',
+    'pos': 'Positive'
+})
+# Sidebar: Filters
+st.sidebar.header("Filters")
+sentiment_filter = st.sidebar.multiselect(
+    "Select Sentiment Types to Display:",
+    options=['Negative', 'Neutral', 'Positive'],
+    default=['Negative', 'Neutral', 'Positive']
 )
+score_filter = st.sidebar.slider(
+    "Select Minimum Sentiment Score:",
+    min_value=0.0000,
+    max_value=1.0000,
+    value=0.0000,
+    step=0.0001
 )
+# Filter the DataFrame to only include points that meet criteria
+filtered_df = df[
+    (df['sentiment_type'].isin(sentiment_filter)) &  # Match selected sentiment type
+    (df['highest_score'] >= score_filter)           # Match slider score range
+]
+filtered_counts = filtered_df['sentiment_type'].value_counts()
+# Generate a summary message for the counts
+filtered_summary = (
+    f"**Filtered DataFrame:**\n"
+    f"- **Negative Sentiments Count:** {filtered_counts.get('Negative', 0)}\n"
+    f"- **Neutral Sentiments Count:** {filtered_counts.get('Neutral', 0)}\n"
+    f"- **Positive Sentiments Count:** {filtered_counts.get('Positive', 0)}"
 )
+# Display the scatter plot
+scatter_plot = alt.Chart(filtered_df).mark_circle(size=60).encode(
+    x=alt.X('index:Q', title='Index'),
+    y=alt.Y('highest_score:Q', title='Highest Sentiment Score'),
+    color=alt.Color('sentiment_type:N', title='Sentiment Type', scale=alt.Scale(scheme='tableau20')),
+    tooltip=['index', 'sentiment_type', 'highest_score', 'cleaned_text', 'text']
+).properties(
+    width=800,
+    height=400,
+    title="Scatter Plot of Sentiment Scores (Filtered)"
+).interactive()
+# Display the scatter plot
+st.altair_chart(scatter_plot, use_container_width=True)
+# Display the filtered DataFrame and counts
+st.write(filtered_summary)
+st.dataframe(
+    filtered_df[['sentiment_type', 'cleaned_text', 'highest_score', 'text']]
+)
+st.header('''Write Up''')
+multi2 = '''Mentioned in the beginning, as a Asian American I wanted to highlight the xenophobic tweets during
+Covid-19 and using a trained sentiment analysis model to analyze and visualize the tweets was a instant idea
+when I found the dataset. In the first plot, I decided to use an scatter plot to better visualize all the tweets with the highest sentitment score plotted.
+So to do that, I had to compare the negative, neutral, and positive scores and find the highest one. Then using the filters/interactivity, depending on which ones you selected.
+The dataframe and the scatter plot will update accordingly. As an additional layer, on the scatter plot I wanted it to be efficient in comparing the sentiments so the data related
+to the point will appear when hovered over.
+For the second plot, I wanted for the expert to easily view the dataframe and use it as a secondary reference to the scatter as a better view to make insights because in a table format
+, you're able to see all the columns better. Additionally, I adjusted the columns orders to make it efficient as possible. As a additional layer, there is a count for how many of the
+data points are showing and it is updated according to the filters.
+For the interactivity, I wanted to have two types of filters with the multiselect first allowing the expert to easily manage the points, and then filter the points again with the sentiment scores.
+Overall, I am happy with the plots but if I were to have more time, I would definitly load and analyze more data points which is an easy task to do by just changing the number of rows to parse over and
+afking while letting the program run. However, I just don't know how long it will take as I already explored options like the tqdm module to add a progress bar but it doesn't really work in the terminal locally for me.
+'''
+st.markdown(multi2)