Spaces:
Running
Running
Commit
·
c884791
1
Parent(s):
2aa262e
Initial commit for Toxic Detection project
Browse files- ai_portfolio.py +274 -0
ai_portfolio.py
ADDED
@@ -0,0 +1,274 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# -*- coding: utf-8 -*-
|
2 |
+
"""ai-portfolio.ipynb
|
3 |
+
|
4 |
+
Automatically generated by Colab.
|
5 |
+
|
6 |
+
Original file is located at
|
7 |
+
https://colab.research.google.com/drive/1XN71Q8R5ctujwjQB0XsGHB7KBp4hP6wR
|
8 |
+
|
9 |
+
# Project: Portfolio - Final Project
|
10 |
+
|
11 |
+
**Instructions for Students:**
|
12 |
+
|
13 |
+
Please carefully follow these steps to complete and submit your assignment:
|
14 |
+
|
15 |
+
1. **Completing the Assignment**: You are required to work on and complete all tasks in the provided assignment. Be disciplined and ensure that you thoroughly engage with each task.
|
16 |
+
|
17 |
+
2. **Creating a Google Drive Folder**: If you don't previously have a folder for collecting assignments, you must create a new folder in your Google Drive. This will be a repository for all your completed assignment files, helping you keep your work organized and easy to access.
|
18 |
+
|
19 |
+
3. **Uploading Completed Assignment**: Upon completion of your assignment, make sure to upload all necessary files, involving codes, reports, and related documents into the created Google Drive folder. Save this link in the 'Student Identity' section and also provide it as the last parameter in the `submit` function that has been provided.
|
20 |
+
|
21 |
+
4. **Sharing Folder Link**: You're required to share the link to your assignment Google Drive folder. This is crucial for the submission and evaluation of your assignment.
|
22 |
+
|
23 |
+
5. **Setting Permission toPublic**: Please make sure your **Google Drive folder is set to public**. This allows your instructor to access your solutions and assess your work correctly.
|
24 |
+
|
25 |
+
Adhering to these procedures will facilitate a smooth assignment process for you and the reviewers.
|
26 |
+
|
27 |
+
**Description:**
|
28 |
+
|
29 |
+
Welcome to your final portfolio project assignment for AI Bootcamp. This is your chance to put all the skills and knowledge you've learned throughout the bootcamp into action by creating real-world AI application.
|
30 |
+
|
31 |
+
You have the freedom to create any application or model, be it text-based or image-based or even voice-based or multimodal.
|
32 |
+
|
33 |
+
To get you started, here are some ideas:
|
34 |
+
|
35 |
+
1. **Sentiment Analysis Application:** Develop an application that can determine sentiment (positive, negative, neutral) from text data like reviews or social media posts. You can use Natural Language Processing (NLP) libraries like NLTK or TextBlob, or more advanced pre-trained models from transformers library by Hugging Face, for your sentiment analysis model.
|
36 |
+
|
37 |
+
2. **Chatbot:** Design a chatbot serving a specific purpose such as customer service for a certain industry, a personal fitness coach, or a study helper. Libraries like ChatterBot or Dialogflow can assist in designing conversational agents.
|
38 |
+
|
39 |
+
3. **Predictive Text Application:** Develop a model that suggests the next word or sentence similar to predictive text on smartphone keyboards. You could use the transformers library by Hugging Face, which includes pre-trained models like GPT-2.
|
40 |
+
|
41 |
+
4. **Image Classification Application:** Create a model to distinguish between different types of flowers or fruits. For this type of image classification task, pre-trained models like ResNet or VGG from PyTorch or TensorFlow can be utilized.
|
42 |
+
|
43 |
+
5. **News Article Classifier:** Develop a text classification model that categorizes news articles into predefined categories. NLTK, SpaCy, and sklearn are valuable libraries for text pre-processing, feature extraction, and building classification models.
|
44 |
+
|
45 |
+
6. **Recommendation System:** Create a simplified recommendation system. For instance, a book or movie recommender based on user preferences. Python's Surprise library can assist in building effective recommendation systems.
|
46 |
+
|
47 |
+
7. **Plant Disease Detection:** Develop a model to identify diseases in plants using leaf images. This project requires a good understanding of convolutional neural networks (CNNs) and image processing. PyTorch, TensorFlow, and OpenCV are all great tools to use.
|
48 |
+
|
49 |
+
8. **Facial Expression Recognition:** Develop a model to classify human facial expressions. This involves complex feature extraction and classification algorithms. You might want to leverage deep learning libraries like TensorFlow or PyTorch, along with OpenCV for processing facial images.
|
50 |
+
|
51 |
+
9. **Chest X-Ray Interpretation:** Develop a model to detect abnormalities in chest X-ray images. This task may require understanding of specific features in such images. Again, TensorFlow and PyTorch for deep learning, and libraries like SciKit-Image or PIL for image processing, could be of use.
|
52 |
+
|
53 |
+
10. **Food Classification:** Develop a model to classify a variety of foods such as local Indonesian food. Pre-trained models like ResNet or VGG from PyTorch or TensorFlow can be a good starting point.
|
54 |
+
|
55 |
+
11. **Traffic Sign Recognition:** Design a model to recognize different traffic signs. This project has real-world applicability in self-driving car technology. Once more, you might utilize PyTorch or TensorFlow for the deep learning aspect, and OpenCV for image processing tasks.
|
56 |
+
|
57 |
+
**Submission:**
|
58 |
+
|
59 |
+
Please upload both your model and application to Huggingface or your own Github account for submission.
|
60 |
+
|
61 |
+
**Presentation:**
|
62 |
+
|
63 |
+
You are required to create a presentation to showcase your project, including the following details:
|
64 |
+
|
65 |
+
- The objective of your model.
|
66 |
+
- A comprehensive description of your model.
|
67 |
+
- The specific metrics used to measure your model's effectiveness.
|
68 |
+
- A brief overview of the dataset used, including its source, pre-processing steps, and any insights.
|
69 |
+
- An explanation of the methodology used in developing the model.
|
70 |
+
- A discussion on challenges faced, how they were handled, and your learnings from those.
|
71 |
+
- Suggestions for potential future improvements to the model.
|
72 |
+
- A functioning link to a demo of your model in action.
|
73 |
+
|
74 |
+
**Grading:**
|
75 |
+
|
76 |
+
Submissions will be manually graded, with a select few given the opportunity to present their projects in front of a panel of judges. This will provide valuable feedback, further enhancing your project and expanding your knowledge base.
|
77 |
+
|
78 |
+
Remember, consistent practice is the key to mastering these concepts. Apply your knowledge, ask questions when in doubt, and above all, enjoy the process. Best of luck to you all!
|
79 |
+
"""
|
80 |
+
|
81 |
+
|
82 |
+
# Commented out IPython magic to ensure Python compatibility.
|
83 |
+
# %pip install rggrader
|
84 |
+
|
85 |
+
|
86 |
+
"""## Working Space"""
|
87 |
+
|
88 |
+
import nltk
|
89 |
+
nltk.download('wordnet')
|
90 |
+
nltk.download('omw-1.4') # Untuk mendukung antonim multi-bahasa
|
91 |
+
|
92 |
+
"""## Submit Notebook"""
|
93 |
+
|
94 |
+
import random
|
95 |
+
from transformers import pipeline
|
96 |
+
import string
|
97 |
+
from nltk.corpus import wordnet
|
98 |
+
import nltk
|
99 |
+
|
100 |
+
# Unduh resource WordNet
|
101 |
+
nltk.download("wordnet")
|
102 |
+
nltk.download("omw-1.4")
|
103 |
+
|
104 |
+
# Load GPT-2 untuk menghasilkan kata pengganti
|
105 |
+
text_generator = pipeline("text-generation", model="gpt2")
|
106 |
+
|
107 |
+
# Load pretrained hate speech detection model
|
108 |
+
hate_speech_classifier = pipeline("text-classification", model="unitary/toxic-bert")
|
109 |
+
|
110 |
+
# Confidence threshold untuk mendeteksi toksisitas
|
111 |
+
CONFIDENCE_THRESHOLD = 0.5
|
112 |
+
|
113 |
+
# Initialize toxic counter
|
114 |
+
toxic_counter = {"count": 0}
|
115 |
+
|
116 |
+
# File path untuk menyimpan mapping negatif ke positif
|
117 |
+
filepath = "extended_negative_to_positive_words.txt"
|
118 |
+
|
119 |
+
# Daftar kata positif untuk fallback
|
120 |
+
positive_words = ["kind", "friendly", "smart", "brilliant", "amazing", "wonderful", "great", "excellent"]
|
121 |
+
|
122 |
+
# Fungsi untuk mencari antonim menggunakan WordNet
|
123 |
+
def find_opposite(word):
|
124 |
+
antonyms = []
|
125 |
+
for syn in wordnet.synsets(word):
|
126 |
+
for lemma in syn.lemmas():
|
127 |
+
if lemma.antonyms(): # Cek apakah ada antonim
|
128 |
+
antonyms.append(lemma.antonyms()[0].name())
|
129 |
+
return antonyms[0] if antonyms else None
|
130 |
+
|
131 |
+
# Fungsi untuk menghasilkan kata pengganti secara acak menggunakan GPT-2
|
132 |
+
def generate_random_antonym(word):
|
133 |
+
prompt = f"Generate a random positive word to replace the toxic word '{word}':"
|
134 |
+
try:
|
135 |
+
response = text_generator(prompt, max_new_tokens=5, truncation=True, num_return_sequences=1)
|
136 |
+
generated_text = response[0]['generated_text']
|
137 |
+
# Ambil kata pertama dari hasil yang dihasilkan
|
138 |
+
random_antonym = generated_text.split(":")[-1].strip().split()[0]
|
139 |
+
# Validasi apakah hasil hanya terdiri dari alfabet
|
140 |
+
if random_antonym.isalpha():
|
141 |
+
return random_antonym
|
142 |
+
else:
|
143 |
+
return random.choice(positive_words)
|
144 |
+
except Exception as e:
|
145 |
+
print(f"Error in generating random antonym for '{word}': {e}")
|
146 |
+
# Fallback ke kata positif acak
|
147 |
+
return random.choice(positive_words)
|
148 |
+
|
149 |
+
# Fungsi untuk memuat mapping negatif ke positif dari file
|
150 |
+
def load_neg_to_pos_map(filepath):
|
151 |
+
neg_to_pos_map = {}
|
152 |
+
with open(filepath, "r") as file:
|
153 |
+
for line_number, line in enumerate(file, start=1):
|
154 |
+
if line.strip(): # Skip empty lines
|
155 |
+
parts = line.strip().split(":")
|
156 |
+
if len(parts) == 2: # Pastikan format benar
|
157 |
+
neg, pos = parts
|
158 |
+
neg_to_pos_map[neg.strip().lower()] = pos.strip()
|
159 |
+
else:
|
160 |
+
print(f"Warning: Invalid format on line {line_number}: {line.strip()}")
|
161 |
+
return neg_to_pos_map
|
162 |
+
|
163 |
+
# Fungsi untuk memperbarui file mapping
|
164 |
+
def update_neg_to_pos_file(filepath, word, opposite_word):
|
165 |
+
with open(filepath, "a") as file:
|
166 |
+
file.write(f"{word} : {opposite_word}\n")
|
167 |
+
|
168 |
+
# Fungsi untuk mengganti kata-kata toksik
|
169 |
+
def replace_toxic_words(text, neg_to_pos_map, filepath="extended_negative_to_positive_words.txt"):
|
170 |
+
words = text.split()
|
171 |
+
replaced_words = []
|
172 |
+
updates = []
|
173 |
+
unresolved = []
|
174 |
+
|
175 |
+
for word in words:
|
176 |
+
# Bersihkan kata dari tanda baca
|
177 |
+
clean_word = word.strip(string.punctuation).lower()
|
178 |
+
|
179 |
+
# Gunakan model untuk mendeteksi toksik
|
180 |
+
result = hate_speech_classifier(clean_word)
|
181 |
+
label = result[0]['label']
|
182 |
+
confidence = result[0]['score']
|
183 |
+
|
184 |
+
if "toxic" in label.lower() and confidence >= CONFIDENCE_THRESHOLD:
|
185 |
+
# Jika kata toksik, cek apakah sudah ada pengganti
|
186 |
+
if clean_word in neg_to_pos_map:
|
187 |
+
replacement = neg_to_pos_map[clean_word]
|
188 |
+
replaced_word = word.replace(clean_word, replacement)
|
189 |
+
replaced_words.append(replaced_word)
|
190 |
+
else:
|
191 |
+
# Cari antonim atau hasilkan secara acak
|
192 |
+
antonym = find_opposite(clean_word) or generate_random_antonym(clean_word)
|
193 |
+
if antonym and antonym.isalpha(): # Validasi hasil penggantian
|
194 |
+
neg_to_pos_map[clean_word] = antonym
|
195 |
+
update_neg_to_pos_file(filepath, clean_word, antonym)
|
196 |
+
updates.append((clean_word, antonym))
|
197 |
+
replaced_word = word.replace(clean_word, antonym)
|
198 |
+
replaced_words.append(replaced_word)
|
199 |
+
else:
|
200 |
+
# Jika gagal, fallback ke kata positif acak
|
201 |
+
fallback_word = random.choice(positive_words)
|
202 |
+
neg_to_pos_map[clean_word] = fallback_word
|
203 |
+
update_neg_to_pos_file(filepath, clean_word, fallback_word)
|
204 |
+
updates.append((clean_word, fallback_word))
|
205 |
+
replaced_word = word.replace(clean_word, fallback_word)
|
206 |
+
replaced_words.append(replaced_word)
|
207 |
+
else:
|
208 |
+
# Kata non-toksik tetap
|
209 |
+
replaced_words.append(word)
|
210 |
+
|
211 |
+
return " ".join(replaced_words), updates, unresolved
|
212 |
+
|
213 |
+
# Fungsi untuk mendeteksi dan mereparafrase teks
|
214 |
+
def detect_and_paraphrase_with_ban(text, neg_to_pos_map, filepath="extended_negative_to_positive_words.txt"):
|
215 |
+
# Cek apakah user sudah diblokir
|
216 |
+
if toxic_counter["count"] >= 3:
|
217 |
+
return "You have been banned for submitting toxic content multiple times. Please refresh to try again."
|
218 |
+
|
219 |
+
# Deteksi konten toksik
|
220 |
+
result = hate_speech_classifier(text)
|
221 |
+
label = result[0]['label']
|
222 |
+
confidence = result[0]['score']
|
223 |
+
|
224 |
+
detection_info = f"Detection: {label} (Confidence: {confidence:.2f})\n"
|
225 |
+
|
226 |
+
# Jika teks terdeteksi toksik
|
227 |
+
if "toxic" in label.lower() and confidence >= CONFIDENCE_THRESHOLD:
|
228 |
+
toxic_counter["count"] += 1
|
229 |
+
detection_info += "Detected toxic content. Rewriting...\n"
|
230 |
+
|
231 |
+
if toxic_counter["count"] >= 3:
|
232 |
+
return "You have been banned for submitting toxic content multiple times. Please refresh to try again."
|
233 |
+
|
234 |
+
# Ganti kata toksik
|
235 |
+
rewritten_text, updates, unresolved = replace_toxic_words(text, neg_to_pos_map, filepath)
|
236 |
+
|
237 |
+
# Log perubahan dan kata yang tidak terselesaikan
|
238 |
+
if updates:
|
239 |
+
detection_info += "Updates made:\n" + "\n".join(
|
240 |
+
[f"- '{word}' updated with antonym '{opposite}'" for word, opposite in updates]
|
241 |
+
) + "\n"
|
242 |
+
if unresolved:
|
243 |
+
detection_info += "Unresolved words (no antonyms found): " + ", ".join(unresolved) + "\n"
|
244 |
+
|
245 |
+
return detection_info + f"Rewritten Text: {rewritten_text}"
|
246 |
+
else:
|
247 |
+
detection_info += "Content is not toxic or confidence is too low.\n"
|
248 |
+
return detection_info + f"Original Text: {text}"
|
249 |
+
|
250 |
+
# Muat peta negatif ke positif
|
251 |
+
neg_to_pos_map = load_neg_to_pos_map(filepath)
|
252 |
+
|
253 |
+
import gradio as gr
|
254 |
+
|
255 |
+
# Fungsi untuk Gradio
|
256 |
+
def detect_and_rewrite_chatbot(input_text):
|
257 |
+
global neg_to_pos_map
|
258 |
+
if not neg_to_pos_map:
|
259 |
+
neg_to_pos_map = load_neg_to_pos_map(filepath)
|
260 |
+
return detect_and_paraphrase_with_ban(input_text, neg_to_pos_map, filepath)
|
261 |
+
|
262 |
+
# Buat antarmuka Gradio
|
263 |
+
with gr.Blocks() as chatbot_interface:
|
264 |
+
gr.Markdown("## Toxicity Detection")
|
265 |
+
with gr.Row():
|
266 |
+
input_text = gr.Textbox(label="Input Text", placeholder="Type something...", lines=2)
|
267 |
+
output_text = gr.Textbox(label="Output Text", interactive=False)
|
268 |
+
submit_button = gr.Button("Submit")
|
269 |
+
submit_button.click(detect_and_rewrite_chatbot, inputs=input_text, outputs=output_text)
|
270 |
+
|
271 |
+
# Jalankan Gradio
|
272 |
+
if __name__ == "__main__":
|
273 |
+
chatbot_interface.launch()
|
274 |
+
|