is445_final

Sleeping

App Files Files Community

is445_final / app.py

jwu249

Upload 2 files

555bebb verified 5 months ago

raw

history blame contribute delete

12 kB

	import streamlit as st
	import altair as alt
	import pandas as pd
	import matplotlib.pyplot as plt
	import numpy as np
	import re
	from tqdm.notebook import tqdm
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	from scipy.special import softmax

	# pip install torch torchvision torchaudio
	# pip install transformers

	st.title("Final Project Part 2 - Jason Wu \| Expert Visualizations")

	url = "https://www.kaggle.com/datasets/rahulgoel1106/xenophobia-on-twitter-during-covid19"
	st.write("Dataset Link to Download -> [Kaggle Covid-19 Xenophobic Datatset](%s)" % url)

	plt.style.use('ggplot')

	multi = '''The dataset chosen is called Xenophobic and like the name it highlights the Xenophobic posts on Twitter
	during beginning stages of Covid-19 and today we are conducting sentiment analysis using a trained Twitter sentiment model.
	#### To follow: '''
	st.markdown('''### About:''')
	st.markdown(multi)
	st.code('''# pip install these packages into your terminal or workspace
	pip install torch torchvision torchaudio # to work with the trained model
	pip install transformers # to work with the trained model''')

	df = pd.read_csv('Xenophobia.csv', encoding='latin1', nrows=5000)
	cols_to_drop = ['status_id', 'created_at', 'location']
	df.drop(cols_to_drop, axis=1, inplace=True)

	# Convert text to string type
	df['text'] = df['text'].astype(str)

	st.markdown('''#### Leading Data & Removing Unwated Data: ''')
	st.code('''df = pd.read_csv('Xenophobia.csv', encoding='latin1', nrows=5000)
	cols_to_drop = ['status_id', 'created_at', 'location'] ''')

	multi1 = ''' #### Next Steps:
	The next step is to run the sentiment analysis on the dataset, however the analysis take a long time to run so I am only going to test 5000 rows out of the millions of rows.
	1. The first step is to intialize the model and call on it from HuggingFace'''

	st.markdown(multi1)
	st.code('''MODEL = "cardiffnlp/twitter-roberta-base-sentiment-latest"
	tokenizer = AutoTokenizer.from_pretrained(MODEL)
	model = AutoModelForSequenceClassification.from_pretrained(MODEL)''')

	st.markdown('''2. Then we have to clean the data from stop words''')
	st.code(''' def clean_text(text):
	text = re.sub(r'[^\x00-\x7F]+', ' ', text)
	text = re.sub(r'\s+', ' ', text).strip()
	return text''')
	st.markdown('''3. Lastly, we will have to run the model on the cleaned data and process the sentiment scores into a new csv file to be used''')

	st.code('''def examine_text(example):
	try: # Use the try statement to handle errors such as text being too long etc
	encoded_text = tokenizer( # setting conditions
	example,
	return_tensors='pt',
	truncation=True,
	max_length=512,
	padding="max_length"
	)
	output = model(**encoded_text)
	scores = output.logits[0].detach().numpy()
	scores = softmax(scores) # softmax function transforms each element of a collection by computing the exponential of each element divided by the sum of the exponentials of all the elements
	return {
	'neg': scores[0],
	'neu': scores[1],
	'pos': scores[2]
	}
	except Exception as e: # handling errors
	print(f"Error processing text: {example}\nError: {e}")
	return None ''')

	st.code('''results = []

	# Process each text
	for i, row in tqdm(df.iterrows(), total=len(df)):
	text = clean_text(row['text'])
	scores = examine_text(text)
	if scores:
	# Append scores to results
	results.append({'index': i, 'neg': scores['neg'], 'neu': scores['neu'], 'pos': scores['pos']})
	else:
	print(f"Skipped problematic text: {text}")

	# Convert results to a DataFrame
	results_df = pd.DataFrame(results)

	# Save to CSV
	results_df.to_csv('sentiment_scores.csv', index=False)
	print("Saved sentiment scores to 'sentiment_scores.csv'")
	# prints out when done - took me 20 minutes for 5000 rows so imagine a million''')
	# MODEL = "cardiffnlp/twitter-roberta-base-sentiment-latest"
	# tokenizer = AutoTokenizer.from_pretrained(MODEL)
	# model = AutoModelForSequenceClassification.from_pretrained(MODEL)

	# # Clean Text Function
	# def clean_text(text):
	# """Remove non-ASCII characters and excess whitespace."""
	# text = re.sub(r'[^\x00-\x7F]+', ' ', text) # Remove non-ASCII
	# text = re.sub(r'\s+', ' ', text).strip() # Remove excess whitespace
	# return text


	# Sentiment analysis function
	# def examine_text(example):
	# try:
	# encoded_text = tokenizer(
	# example,
	# return_tensors='pt',
	# truncation=True,
	# max_length=512,
	# padding="max_length"
	# )
	# output = model(**encoded_text)
	# scores = output.logits[0].detach().numpy()
	# scores = softmax(scores)
	# return {
	# 'neg': scores[0],
	# 'neu': scores[1],
	# 'pos': scores[2]
	# }
	# except Exception as e:
	# print(f"Error processing text: {example}\nError: {e}")
	# return None


	# Prepare list for saving results
	# results = []

	# # Process each text
	# for i, row in tqdm(df.iterrows(), total=len(df)):
	# text = clean_text(row['text'])
	# scores = examine_text(text)
	# if scores:
	# # Append scores to results
	# results.append({'index': i, 'neg': scores['neg'], 'neu': scores['neu'], 'pos': scores['pos']})
	# else:
	# print(f"Skipped problematic text: {text}")

	# # Convert results to a DataFrame
	# results_df = pd.DataFrame(results)

	# # Save to CSV
	# results_df.to_csv('sentiment_scores.csv', index=False)
	# print("Saved sentiment scores to 'sentiment_scores.csv'")


	st.markdown(''' ### Plotting in Altair
	We then just load in the data from the csv file with the sentiment scores and create plots with them''')
	st.code('''# Load sentiment scores
	sentiment_scores = pd.read_csv('sentiment_scores.csv')
	df = df.reset_index().merge(sentiment_scores, on='index')''')

	# Load sentiment scores
	sentiment_scores = pd.read_csv('sentiment_scores.csv')
	df = df.reset_index().merge(sentiment_scores, on='index')

	# Clean text function
	def clean_text(text):
	text = re.sub(r'[^\x00-\x7F]+', ' ', text)
	text = re.sub(r'\s+', ' ', text).strip()
	return text

	df['cleaned_text'] = df['text'].apply(clean_text)

	# Determine the highest sentiment score for each row
	df['highest_score'] = df[['neg', 'neu', 'pos']].max(axis=1)
	df['sentiment_type'] = df[['neg', 'neu', 'pos']].idxmax(axis=1) # neg/neu/pos as categories
	df['sentiment_type'] = df['sentiment_type'].replace({
	'neg': 'Negative',
	'neu': 'Neutral',
	'pos': 'Positive'
	})

	# Sidebar: Filters
	# st.sidebar.header("Filters")
	sentiment_filter = st.multiselect(
	"Select Sentiment Types to Display:",
	options=['Negative', 'Neutral', 'Positive'],
	default=['Negative', 'Neutral', 'Positive']
	)
	score_filter = st.slider(
	"Select Minimum Sentiment Score:",
	min_value=0.0000,
	max_value=1.0000,
	value=0.0000,
	step=0.0001
	)

	# Filter the DataFrame to only include points that meet criteria
	filtered_df = df[
	(df['sentiment_type'].isin(sentiment_filter)) & # Match selected sentiment type
	(df['highest_score'] >= score_filter) # Match slider score range
	]

	filtered_counts = filtered_df['sentiment_type'].value_counts()

	# Generate a summary message for the counts
	filtered_summary = (
	f"Filtered DataFrame:\n"
	f"- Negative Sentiments Count: {filtered_counts.get('Negative', 0)}\n"
	f"- Neutral Sentiments Count: {filtered_counts.get('Neutral', 0)}\n"
	f"- Positive Sentiments Count: {filtered_counts.get('Positive', 0)}"
	)


	# Create a brush to link scatter plot and bar chart
	brush = alt.selection_interval(encodings=['x', 'y'])

	# Scatter plot with brush
	scatter_plot = alt.Chart(filtered_df).mark_circle(size=60).encode(
	x=alt.X('index:Q', title='Index'),
	y=alt.Y('highest_score:Q', title='Highest Sentiment Score'),
	color=alt.Color('sentiment_type:N', title='Sentiment Type', scale=alt.Scale(scheme='tableau20')),
	tooltip=['index', 'sentiment_type', 'highest_score', 'cleaned_text', 'text']
	).add_params(
	brush
	).properties(
	width=800,
	height=400,
	title="Scatter Plot of Sentiment Scores (Filtered) - Brush Feature to show Bar Chart"
	).interactive()

	# Bar chart linked to scatter plot
	bar_chart = alt.Chart(filtered_df).transform_filter(
	brush
	).transform_filter(
	alt.FieldOneOfPredicate(field='sentiment_type', oneOf=sentiment_filter) # Apply multiselect filter
	).transform_aggregate(
	total_score='sum(highest_score)', # Aggregate the highest_score
	groupby=['sentiment_type'] # Group by sentiment type
	).mark_bar().encode(
	x=alt.X('sentiment_type:N', title='Sentiment Type'),
	y=alt.Y('total_score:Q', title='Sum of Highest Scores'),
	color=alt.Color('sentiment_type:N', scale=alt.Scale(scheme='tableau20'))

	).properties(
	width=800,
	height=200,
	title="Bar Chart of Sentiment Sums (Linked to Scatter Plot)"
	)

	# Combine scatter and bar charts
	combined_chart = alt.vconcat(
	scatter_plot,
	bar_chart
	)

	# Display the combined chart
	st.altair_chart(combined_chart, use_container_width=True)

	# Display the filtered DataFrame and counts
	st.write(filtered_summary)
	st.dataframe(
	filtered_df[['sentiment_type', 'cleaned_text', 'highest_score', 'text']]
	)

	st.header('''Contextual Dataset''')
	st.image("amazon.jpeg")
	url1 = "https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews"
	st.write("Dataset Link to Download -> [Kaggle Amazon Reviews Dataset](%s)" % url1)

	multi4 = '''The dataset I chose as a contextual dataset is an the Amazon Reviews dataset on Kaggle. I chose this dataset because of it similar nature in reviews of products instead. The dataset provides more sentiment for products but could
	be used to train the model to be more efficient in determining Xenophobic tweets. In the future, I want to be able to train the model inorder to do more better analysis on the Twitter dataset.'''

	st.markdown(multi4)
	st.header('''Write Up''')
	multi2 = '''Mentioned in the beginning, as a Asian American I wanted to highlight the xenophobic tweets during
	Covid-19 and using a trained sentiment analysis model to analyze and visualize the tweets was a instant idea
	when I found the dataset. In the first plot, I decided to use an scatter plot to better visualize all the tweets with the highest sentitment score plotted.
	So to do that, I had to compare the negative, neutral, and positive scores and find the highest one. Then using the filters/interactivity, depending on which ones you selected.
	The dataframe and the scatter plot will update accordingly. As an additional layer, on the scatter plot I wanted it to be efficient in comparing the sentiments so the data related
	to the point will appear when hovered over.

	For the second plot, I wanted for the expert to easily view the dataframe and use it as a secondary reference to the scatter as a better view to make insights because in a table format
	, you're able to see all the columns better. Additionally, I adjusted the columns orders to make it efficient as possible. As a additional layer, there is a count for how many of the
	data points are showing and it is updated according to the filters.

	For the interactivity, I wanted to have two types of filters with the multiselect first allowing the expert to easily manage the points, and then filter the points again with the sentiment scores.

	Overall, I am happy with the plots but if I were to have more time, I would definitly load and analyze more data points which is an easy task to do by just changing the number of rows to parse over and
	afking while letting the program run. However, I just don't know how long it will take as I already explored options like the tqdm module to add a progress bar but it doesn't really work in the terminal locally for me.
	'''
	st.markdown(multi2)