dragonities commited on
Commit
c884791
·
1 Parent(s): 2aa262e

Initial commit for Toxic Detection project

Browse files
Files changed (1) hide show
  1. ai_portfolio.py +274 -0
ai_portfolio.py ADDED
@@ -0,0 +1,274 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ """ai-portfolio.ipynb
3
+
4
+ Automatically generated by Colab.
5
+
6
+ Original file is located at
7
+ https://colab.research.google.com/drive/1XN71Q8R5ctujwjQB0XsGHB7KBp4hP6wR
8
+
9
+ # Project: Portfolio - Final Project
10
+
11
+ **Instructions for Students:**
12
+
13
+ Please carefully follow these steps to complete and submit your assignment:
14
+
15
+ 1. **Completing the Assignment**: You are required to work on and complete all tasks in the provided assignment. Be disciplined and ensure that you thoroughly engage with each task.
16
+
17
+ 2. **Creating a Google Drive Folder**: If you don't previously have a folder for collecting assignments, you must create a new folder in your Google Drive. This will be a repository for all your completed assignment files, helping you keep your work organized and easy to access.
18
+
19
+ 3. **Uploading Completed Assignment**: Upon completion of your assignment, make sure to upload all necessary files, involving codes, reports, and related documents into the created Google Drive folder. Save this link in the 'Student Identity' section and also provide it as the last parameter in the `submit` function that has been provided.
20
+
21
+ 4. **Sharing Folder Link**: You're required to share the link to your assignment Google Drive folder. This is crucial for the submission and evaluation of your assignment.
22
+
23
+ 5. **Setting Permission toPublic**: Please make sure your **Google Drive folder is set to public**. This allows your instructor to access your solutions and assess your work correctly.
24
+
25
+ Adhering to these procedures will facilitate a smooth assignment process for you and the reviewers.
26
+
27
+ **Description:**
28
+
29
+ Welcome to your final portfolio project assignment for AI Bootcamp. This is your chance to put all the skills and knowledge you've learned throughout the bootcamp into action by creating real-world AI application.
30
+
31
+ You have the freedom to create any application or model, be it text-based or image-based or even voice-based or multimodal.
32
+
33
+ To get you started, here are some ideas:
34
+
35
+ 1. **Sentiment Analysis Application:** Develop an application that can determine sentiment (positive, negative, neutral) from text data like reviews or social media posts. You can use Natural Language Processing (NLP) libraries like NLTK or TextBlob, or more advanced pre-trained models from transformers library by Hugging Face, for your sentiment analysis model.
36
+
37
+ 2. **Chatbot:** Design a chatbot serving a specific purpose such as customer service for a certain industry, a personal fitness coach, or a study helper. Libraries like ChatterBot or Dialogflow can assist in designing conversational agents.
38
+
39
+ 3. **Predictive Text Application:** Develop a model that suggests the next word or sentence similar to predictive text on smartphone keyboards. You could use the transformers library by Hugging Face, which includes pre-trained models like GPT-2.
40
+
41
+ 4. **Image Classification Application:** Create a model to distinguish between different types of flowers or fruits. For this type of image classification task, pre-trained models like ResNet or VGG from PyTorch or TensorFlow can be utilized.
42
+
43
+ 5. **News Article Classifier:** Develop a text classification model that categorizes news articles into predefined categories. NLTK, SpaCy, and sklearn are valuable libraries for text pre-processing, feature extraction, and building classification models.
44
+
45
+ 6. **Recommendation System:** Create a simplified recommendation system. For instance, a book or movie recommender based on user preferences. Python's Surprise library can assist in building effective recommendation systems.
46
+
47
+ 7. **Plant Disease Detection:** Develop a model to identify diseases in plants using leaf images. This project requires a good understanding of convolutional neural networks (CNNs) and image processing. PyTorch, TensorFlow, and OpenCV are all great tools to use.
48
+
49
+ 8. **Facial Expression Recognition:** Develop a model to classify human facial expressions. This involves complex feature extraction and classification algorithms. You might want to leverage deep learning libraries like TensorFlow or PyTorch, along with OpenCV for processing facial images.
50
+
51
+ 9. **Chest X-Ray Interpretation:** Develop a model to detect abnormalities in chest X-ray images. This task may require understanding of specific features in such images. Again, TensorFlow and PyTorch for deep learning, and libraries like SciKit-Image or PIL for image processing, could be of use.
52
+
53
+ 10. **Food Classification:** Develop a model to classify a variety of foods such as local Indonesian food. Pre-trained models like ResNet or VGG from PyTorch or TensorFlow can be a good starting point.
54
+
55
+ 11. **Traffic Sign Recognition:** Design a model to recognize different traffic signs. This project has real-world applicability in self-driving car technology. Once more, you might utilize PyTorch or TensorFlow for the deep learning aspect, and OpenCV for image processing tasks.
56
+
57
+ **Submission:**
58
+
59
+ Please upload both your model and application to Huggingface or your own Github account for submission.
60
+
61
+ **Presentation:**
62
+
63
+ You are required to create a presentation to showcase your project, including the following details:
64
+
65
+ - The objective of your model.
66
+ - A comprehensive description of your model.
67
+ - The specific metrics used to measure your model's effectiveness.
68
+ - A brief overview of the dataset used, including its source, pre-processing steps, and any insights.
69
+ - An explanation of the methodology used in developing the model.
70
+ - A discussion on challenges faced, how they were handled, and your learnings from those.
71
+ - Suggestions for potential future improvements to the model.
72
+ - A functioning link to a demo of your model in action.
73
+
74
+ **Grading:**
75
+
76
+ Submissions will be manually graded, with a select few given the opportunity to present their projects in front of a panel of judges. This will provide valuable feedback, further enhancing your project and expanding your knowledge base.
77
+
78
+ Remember, consistent practice is the key to mastering these concepts. Apply your knowledge, ask questions when in doubt, and above all, enjoy the process. Best of luck to you all!
79
+ """
80
+
81
+
82
+ # Commented out IPython magic to ensure Python compatibility.
83
+ # %pip install rggrader
84
+
85
+
86
+ """## Working Space"""
87
+
88
+ import nltk
89
+ nltk.download('wordnet')
90
+ nltk.download('omw-1.4') # Untuk mendukung antonim multi-bahasa
91
+
92
+ """## Submit Notebook"""
93
+
94
+ import random
95
+ from transformers import pipeline
96
+ import string
97
+ from nltk.corpus import wordnet
98
+ import nltk
99
+
100
+ # Unduh resource WordNet
101
+ nltk.download("wordnet")
102
+ nltk.download("omw-1.4")
103
+
104
+ # Load GPT-2 untuk menghasilkan kata pengganti
105
+ text_generator = pipeline("text-generation", model="gpt2")
106
+
107
+ # Load pretrained hate speech detection model
108
+ hate_speech_classifier = pipeline("text-classification", model="unitary/toxic-bert")
109
+
110
+ # Confidence threshold untuk mendeteksi toksisitas
111
+ CONFIDENCE_THRESHOLD = 0.5
112
+
113
+ # Initialize toxic counter
114
+ toxic_counter = {"count": 0}
115
+
116
+ # File path untuk menyimpan mapping negatif ke positif
117
+ filepath = "extended_negative_to_positive_words.txt"
118
+
119
+ # Daftar kata positif untuk fallback
120
+ positive_words = ["kind", "friendly", "smart", "brilliant", "amazing", "wonderful", "great", "excellent"]
121
+
122
+ # Fungsi untuk mencari antonim menggunakan WordNet
123
+ def find_opposite(word):
124
+ antonyms = []
125
+ for syn in wordnet.synsets(word):
126
+ for lemma in syn.lemmas():
127
+ if lemma.antonyms(): # Cek apakah ada antonim
128
+ antonyms.append(lemma.antonyms()[0].name())
129
+ return antonyms[0] if antonyms else None
130
+
131
+ # Fungsi untuk menghasilkan kata pengganti secara acak menggunakan GPT-2
132
+ def generate_random_antonym(word):
133
+ prompt = f"Generate a random positive word to replace the toxic word '{word}':"
134
+ try:
135
+ response = text_generator(prompt, max_new_tokens=5, truncation=True, num_return_sequences=1)
136
+ generated_text = response[0]['generated_text']
137
+ # Ambil kata pertama dari hasil yang dihasilkan
138
+ random_antonym = generated_text.split(":")[-1].strip().split()[0]
139
+ # Validasi apakah hasil hanya terdiri dari alfabet
140
+ if random_antonym.isalpha():
141
+ return random_antonym
142
+ else:
143
+ return random.choice(positive_words)
144
+ except Exception as e:
145
+ print(f"Error in generating random antonym for '{word}': {e}")
146
+ # Fallback ke kata positif acak
147
+ return random.choice(positive_words)
148
+
149
+ # Fungsi untuk memuat mapping negatif ke positif dari file
150
+ def load_neg_to_pos_map(filepath):
151
+ neg_to_pos_map = {}
152
+ with open(filepath, "r") as file:
153
+ for line_number, line in enumerate(file, start=1):
154
+ if line.strip(): # Skip empty lines
155
+ parts = line.strip().split(":")
156
+ if len(parts) == 2: # Pastikan format benar
157
+ neg, pos = parts
158
+ neg_to_pos_map[neg.strip().lower()] = pos.strip()
159
+ else:
160
+ print(f"Warning: Invalid format on line {line_number}: {line.strip()}")
161
+ return neg_to_pos_map
162
+
163
+ # Fungsi untuk memperbarui file mapping
164
+ def update_neg_to_pos_file(filepath, word, opposite_word):
165
+ with open(filepath, "a") as file:
166
+ file.write(f"{word} : {opposite_word}\n")
167
+
168
+ # Fungsi untuk mengganti kata-kata toksik
169
+ def replace_toxic_words(text, neg_to_pos_map, filepath="extended_negative_to_positive_words.txt"):
170
+ words = text.split()
171
+ replaced_words = []
172
+ updates = []
173
+ unresolved = []
174
+
175
+ for word in words:
176
+ # Bersihkan kata dari tanda baca
177
+ clean_word = word.strip(string.punctuation).lower()
178
+
179
+ # Gunakan model untuk mendeteksi toksik
180
+ result = hate_speech_classifier(clean_word)
181
+ label = result[0]['label']
182
+ confidence = result[0]['score']
183
+
184
+ if "toxic" in label.lower() and confidence >= CONFIDENCE_THRESHOLD:
185
+ # Jika kata toksik, cek apakah sudah ada pengganti
186
+ if clean_word in neg_to_pos_map:
187
+ replacement = neg_to_pos_map[clean_word]
188
+ replaced_word = word.replace(clean_word, replacement)
189
+ replaced_words.append(replaced_word)
190
+ else:
191
+ # Cari antonim atau hasilkan secara acak
192
+ antonym = find_opposite(clean_word) or generate_random_antonym(clean_word)
193
+ if antonym and antonym.isalpha(): # Validasi hasil penggantian
194
+ neg_to_pos_map[clean_word] = antonym
195
+ update_neg_to_pos_file(filepath, clean_word, antonym)
196
+ updates.append((clean_word, antonym))
197
+ replaced_word = word.replace(clean_word, antonym)
198
+ replaced_words.append(replaced_word)
199
+ else:
200
+ # Jika gagal, fallback ke kata positif acak
201
+ fallback_word = random.choice(positive_words)
202
+ neg_to_pos_map[clean_word] = fallback_word
203
+ update_neg_to_pos_file(filepath, clean_word, fallback_word)
204
+ updates.append((clean_word, fallback_word))
205
+ replaced_word = word.replace(clean_word, fallback_word)
206
+ replaced_words.append(replaced_word)
207
+ else:
208
+ # Kata non-toksik tetap
209
+ replaced_words.append(word)
210
+
211
+ return " ".join(replaced_words), updates, unresolved
212
+
213
+ # Fungsi untuk mendeteksi dan mereparafrase teks
214
+ def detect_and_paraphrase_with_ban(text, neg_to_pos_map, filepath="extended_negative_to_positive_words.txt"):
215
+ # Cek apakah user sudah diblokir
216
+ if toxic_counter["count"] >= 3:
217
+ return "You have been banned for submitting toxic content multiple times. Please refresh to try again."
218
+
219
+ # Deteksi konten toksik
220
+ result = hate_speech_classifier(text)
221
+ label = result[0]['label']
222
+ confidence = result[0]['score']
223
+
224
+ detection_info = f"Detection: {label} (Confidence: {confidence:.2f})\n"
225
+
226
+ # Jika teks terdeteksi toksik
227
+ if "toxic" in label.lower() and confidence >= CONFIDENCE_THRESHOLD:
228
+ toxic_counter["count"] += 1
229
+ detection_info += "Detected toxic content. Rewriting...\n"
230
+
231
+ if toxic_counter["count"] >= 3:
232
+ return "You have been banned for submitting toxic content multiple times. Please refresh to try again."
233
+
234
+ # Ganti kata toksik
235
+ rewritten_text, updates, unresolved = replace_toxic_words(text, neg_to_pos_map, filepath)
236
+
237
+ # Log perubahan dan kata yang tidak terselesaikan
238
+ if updates:
239
+ detection_info += "Updates made:\n" + "\n".join(
240
+ [f"- '{word}' updated with antonym '{opposite}'" for word, opposite in updates]
241
+ ) + "\n"
242
+ if unresolved:
243
+ detection_info += "Unresolved words (no antonyms found): " + ", ".join(unresolved) + "\n"
244
+
245
+ return detection_info + f"Rewritten Text: {rewritten_text}"
246
+ else:
247
+ detection_info += "Content is not toxic or confidence is too low.\n"
248
+ return detection_info + f"Original Text: {text}"
249
+
250
+ # Muat peta negatif ke positif
251
+ neg_to_pos_map = load_neg_to_pos_map(filepath)
252
+
253
+ import gradio as gr
254
+
255
+ # Fungsi untuk Gradio
256
+ def detect_and_rewrite_chatbot(input_text):
257
+ global neg_to_pos_map
258
+ if not neg_to_pos_map:
259
+ neg_to_pos_map = load_neg_to_pos_map(filepath)
260
+ return detect_and_paraphrase_with_ban(input_text, neg_to_pos_map, filepath)
261
+
262
+ # Buat antarmuka Gradio
263
+ with gr.Blocks() as chatbot_interface:
264
+ gr.Markdown("## Toxicity Detection")
265
+ with gr.Row():
266
+ input_text = gr.Textbox(label="Input Text", placeholder="Type something...", lines=2)
267
+ output_text = gr.Textbox(label="Output Text", interactive=False)
268
+ submit_button = gr.Button("Submit")
269
+ submit_button.click(detect_and_rewrite_chatbot, inputs=input_text, outputs=output_text)
270
+
271
+ # Jalankan Gradio
272
+ if __name__ == "__main__":
273
+ chatbot_interface.launch()
274
+