reab5555 commited on
Commit
a18de95
·
verified ·
1 Parent(s): 24a4384

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +104 -97
app.py CHANGED
@@ -54,10 +54,15 @@ def process_and_show_completion(video_input_path, anomaly_threshold_input, fps,
54
  return [error_message] + [None] * 16
55
 
56
  def on_button_click(video, threshold, fps):
57
- # Display execution time immediately
58
- yield {execution_time: gr.update(visible=True, value=0)}
59
-
60
  start_time = time.time()
 
 
 
 
 
 
 
 
61
  results = process_and_show_completion(video, threshold, fps)
62
  end_time = time.time()
63
  exec_time = end_time - start_time
@@ -104,102 +109,104 @@ with gr.Blocks() as iface:
104
  progress_bar = gr.Progress()
105
 
106
  execution_time = gr.Number(label="Execution Time (seconds)", visible=False)
107
-
108
  with gr.Tabs() as tabs:
109
- with gr.TabItem("Description", id="description_tab") as description_tab:
 
110
  with gr.Column():
111
- gr.Markdown("""
112
- # Multimodal Behavioral Anomalies Detection
113
-
114
- The purpose of this tool is to detect anomalies in facial expressions, body language, and voice over the timeline of a video.
115
-
116
- It extracts faces, postures, and voice features from video frames, detects unique facial features, body postures, and speaker embeddings, and analyzes them to identify anomalies using time series analysis, specifically utilizing a variational autoencoder (VAE) approach.
117
-
118
- ## Applications
119
-
120
- - Identify suspicious behavior in surveillance footage.
121
- - Analyze micro-expressions.
122
- - Monitor and assess emotional states in communications.
123
- - Evaluate changes in vocal tone and speech patterns.
124
-
125
- ## Features
126
-
127
- - **Face Extraction**: Extracts faces from video frames using the MTCNN model.
128
- - **Feature Embeddings**: Extracts facial feature embeddings using the InceptionResnetV1 model.
129
- - **Body Posture Analysis**: Evaluates body postures using MediaPipe Pose.
130
- - **Voice Analysis**: Extracts and segment speaker embeddings from audio using PyAnnote.
131
- - **Anomaly Detection**: Uses Variational Autoencoder (VAE) to detect anomalies in facial expressions, body postures, and voice features over time.
132
- - **Visualization**: Represents changes in facial expressions, body postures, and vocal tone over time, marking anomaly key points.
133
-
134
- <img src="appendix/Anomay Detection.png" width="1050" alt="alt text">
135
-
136
- ## InceptionResnetV1
137
- The InceptionResnetV1 model is a deep convolutional neural network used for facial recognition and facial attribute extraction.
138
-
139
- - **Accuracy and Reliability**: Pre-trained on the VGGFace2 dataset, it achieves high accuracy in recognizing and differentiating between faces.
140
- - **Feature Richness**: The embeddings capture rich facial details, essential for recognizing subtle expressions and variations.
141
- - **Global Recognition**: Widely adopted in various facial recognition applications, demonstrating reliability and robustness across different scenarios.
142
-
143
- ## MediaPipe Pose
144
- MediaPipe Pose is a versatile machine learning library designed for high-accuracy real-time posture estimation. Mediapipe Pose uses a deep learning model to detect body landmarks and infer body posture.
145
-
146
- - **Real-Time Performance**: Capable of processing video frames at real-time speeds, making it suitable for live video analysis.
147
- - **Accuracy and Precision**: Detects 33 body landmarks, including important joints and key points, enabling detailed posture and movement analysis.
148
- - **Integration**: Easily integrates with other machine learning frameworks and tools, enhancing its versatility for various applications.
149
-
150
- ## Voice Analysis
151
- The voice analysis module involves extracting and analyzing vocal features using speaker diarization and embedding models to capture key characteristics of the speaker's voice.
152
-
153
- PyAnnote is a toolkit for speaker diarization and voice analysis.
154
- - **Speaker Diarization**: Identifies voice segments and classifies them by speaker.
155
- - **Speaker Embeddings**: Captures voice characteristics using a pre-trained embedding model.
156
-
157
- ## Variational Autoencoder (VAE)
158
- A Variational Autoencoder (VAE) is a type of neural network that learns to encode input data (like facial embeddings or posture scores) into a latent space and then reconstructs the data from this latent representation. VAEs not only learn to compress data but also to generate new data, making them particularly useful for anomaly detection.
159
-
160
- - **Probabilistic Nature**: VAEs introduce a probabilistic approach to encoding, where the encoded representations are not single fixed points but distributions. This allows the model to learn a more robust representation of the data.
161
- - **Reconstruction and Generation**: By comparing the reconstructed data to the original, VAEs can measure reconstruction errors. High errors indicate anomalies, as such data points do not conform well to the learned normal patterns.
162
-
163
- ## Setup Parameters
164
- - **Frames Per Second (FPS)**: Frames per second to analyze (lower for faster processing).
165
- - **Anomaly Detection Threshold**: Threshold for detecting anomalies (Standard Deviation).
166
-
167
- ## Micro-Expressions
168
- Paul Ekman’s work on facial expressions of emotion identified universal micro-expressions that reveal true emotions. These fleeting expressions, lasting only milliseconds, are challenging to detect but can be captured and analyzed using computer vision algorithms when analyzing frame-by-frame.
169
-
170
- ### Micro-Expressions and Frame Rate Analysis
171
- Micro-expressions are brief, involuntary facial expressions that typically last between 1/25 to 1/5 of a second (40-200 milliseconds). To capture these fleeting expressions, a high frame rate is essential.
172
-
173
- ### 10 fps
174
-
175
- - **Frame Interval** Each frame is captured every 100 milliseconds.
176
- - **Effectiveness** Given that micro-expressions can last as short as 40 milliseconds, a frame rate of 10 fps is insufficient. Many micro-expressions would begin and end between frames, making it highly likely that they would be missed entirely.
177
-
178
- ### 20 fps
179
-
180
- - **Frame Interval** Each frame is captured every 50 milliseconds.
181
- - **Effectiveness** While 20 fps is better than 10 fps, it is still inadequate. Micro-expressions can still occur and resolve within the 50-millisecond interval between frames, leading to incomplete or missed captures.
182
-
183
- ### High-Speed Cameras
184
-
185
- Effective capture of micro-expressions generally requires frame rates above 100 fps. High-speed video systems designed for micro-expression detection often operate at 118 fps or higher, with some systems reaching up to 200 fps.
186
-
187
- ## Limitations
188
-
189
- - **Evaluation Challenges**: Since this is an unsupervised method, there is no labeled data to compare against. This makes it difficult to quantitatively evaluate the accuracy or effectiveness of the anomaly detection.
190
- - **Subjectivity**: The concept of what constitutes an "anomaly" can be subjective and context-dependent. This can lead to false positives or negatives depending on the situation.
191
- - **Lighting and Resolution**: Variability in lighting conditions, camera resolution, and frame rate can affect the quality of detected features and postures, leading to inconsistent results.
192
- - **Audio Quality**: Background noise, poor audio quality, and overlapping speech can affect the accuracy of speaker diarization and voice embeddings.
193
- - **Generalization**: The model may not generalize well to all types of videos and contexts. For example, trained embeddings may work well for a specific demographic but poorly for another.
194
- - **Computationally Intensive**: Real-time processing of high-resolution video frames can be computationally demanding, requiring significant hardware resources.
195
- - **Micro-Expressions and Frame Rate Limitations**: Videos recorded at 10 or 20 fps are not suitable for reliably capturing micro-expressions due to their rapid onset and brief duration. Higher frame rates, typically above 100 fps, are essential to ensure that these fleeting expressions are accurately captured and analyzed.
196
-
197
- ## Conclusion
198
- This tool offers solutions for detecting emotional, posture, and vocal anomalies in video-based facial expressions, body language, and speech, beneficial for both forensic analysis and HUMINT operations. However, users should be aware of its limitations and the challenges inherent in unsupervised anomaly detection methodologies. By leveraging advanced computer vision techniques and the power of autoencoders, it provides crucial insights into human behavior in a timely manner, but results should be interpreted with caution and, where possible, supplemented with additional context and expert analysis.
199
- h caution.
200
- """)
201
-
202
- with gr.TabItem("Results", id="results_tab", visible=False) as results_tab:
 
203
  with gr.Tabs():
204
  with gr.TabItem("Facial Features"):
205
  video_display_facial = gr.Video(label="Input Video")
 
54
  return [error_message] + [None] * 16
55
 
56
  def on_button_click(video, threshold, fps):
 
 
 
57
  start_time = time.time()
58
+
59
+ # Show execution time and hide description tab immediately
60
+ yield {
61
+ execution_time: gr.update(visible=True, value=0),
62
+ description_tab: gr.update(visible=False),
63
+ results_tab: gr.update(visible=True)
64
+ }
65
+
66
  results = process_and_show_completion(video, threshold, fps)
67
  end_time = time.time()
68
  exec_time = end_time - start_time
 
109
  progress_bar = gr.Progress()
110
 
111
  execution_time = gr.Number(label="Execution Time (seconds)", visible=False)
112
+
113
  with gr.Tabs() as tabs:
114
+ description_tab = gr.TabItem("Description", visible=True)
115
+ with description_tab:
116
  with gr.Column():
117
+ gr.Markdown("""
118
+ # Multimodal Behavioral Anomalies Detection
119
+
120
+ The purpose of this tool is to detect anomalies in facial expressions, body language, and voice over the timeline of a video.
121
+
122
+ It extracts faces, postures, and voice features from video frames, detects unique facial features, body postures, and speaker embeddings, and analyzes them to identify anomalies using time series analysis, specifically utilizing a variational autoencoder (VAE) approach.
123
+
124
+ ## Applications
125
+
126
+ - Identify suspicious behavior in surveillance footage.
127
+ - Analyze micro-expressions.
128
+ - Monitor and assess emotional states in communications.
129
+ - Evaluate changes in vocal tone and speech patterns.
130
+
131
+ ## Features
132
+
133
+ - **Face Extraction**: Extracts faces from video frames using the MTCNN model.
134
+ - **Feature Embeddings**: Extracts facial feature embeddings using the InceptionResnetV1 model.
135
+ - **Body Posture Analysis**: Evaluates body postures using MediaPipe Pose.
136
+ - **Voice Analysis**: Extracts and segment speaker embeddings from audio using PyAnnote.
137
+ - **Anomaly Detection**: Uses Variational Autoencoder (VAE) to detect anomalies in facial expressions, body postures, and voice features over time.
138
+ - **Visualization**: Represents changes in facial expressions, body postures, and vocal tone over time, marking anomaly key points.
139
+
140
+ <img src="appendix/Anomay Detection.png" width="1050" alt="alt text">
141
+
142
+ ## InceptionResnetV1
143
+ The InceptionResnetV1 model is a deep convolutional neural network used for facial recognition and facial attribute extraction.
144
+
145
+ - **Accuracy and Reliability**: Pre-trained on the VGGFace2 dataset, it achieves high accuracy in recognizing and differentiating between faces.
146
+ - **Feature Richness**: The embeddings capture rich facial details, essential for recognizing subtle expressions and variations.
147
+ - **Global Recognition**: Widely adopted in various facial recognition applications, demonstrating reliability and robustness across different scenarios.
148
+
149
+ ## MediaPipe Pose
150
+ MediaPipe Pose is a versatile machine learning library designed for high-accuracy real-time posture estimation. Mediapipe Pose uses a deep learning model to detect body landmarks and infer body posture.
151
+
152
+ - **Real-Time Performance**: Capable of processing video frames at real-time speeds, making it suitable for live video analysis.
153
+ - **Accuracy and Precision**: Detects 33 body landmarks, including important joints and key points, enabling detailed posture and movement analysis.
154
+ - **Integration**: Easily integrates with other machine learning frameworks and tools, enhancing its versatility for various applications.
155
+
156
+ ## Voice Analysis
157
+ The voice analysis module involves extracting and analyzing vocal features using speaker diarization and embedding models to capture key characteristics of the speaker's voice.
158
+
159
+ PyAnnote is a toolkit for speaker diarization and voice analysis.
160
+ - **Speaker Diarization**: Identifies voice segments and classifies them by speaker.
161
+ - **Speaker Embeddings**: Captures voice characteristics using a pre-trained embedding model.
162
+
163
+ ## Variational Autoencoder (VAE)
164
+ A Variational Autoencoder (VAE) is a type of neural network that learns to encode input data (like facial embeddings or posture scores) into a latent space and then reconstructs the data from this latent representation. VAEs not only learn to compress data but also to generate new data, making them particularly useful for anomaly detection.
165
+
166
+ - **Probabilistic Nature**: VAEs introduce a probabilistic approach to encoding, where the encoded representations are not single fixed points but distributions. This allows the model to learn a more robust representation of the data.
167
+ - **Reconstruction and Generation**: By comparing the reconstructed data to the original, VAEs can measure reconstruction errors. High errors indicate anomalies, as such data points do not conform well to the learned normal patterns.
168
+
169
+ ## Setup Parameters
170
+ - **Frames Per Second (FPS)**: Frames per second to analyze (lower for faster processing).
171
+ - **Anomaly Detection Threshold**: Threshold for detecting anomalies (Standard Deviation).
172
+
173
+ ## Micro-Expressions
174
+ Paul Ekman’s work on facial expressions of emotion identified universal micro-expressions that reveal true emotions. These fleeting expressions, lasting only milliseconds, are challenging to detect but can be captured and analyzed using computer vision algorithms when analyzing frame-by-frame.
175
+
176
+ ### Micro-Expressions and Frame Rate Analysis
177
+ Micro-expressions are brief, involuntary facial expressions that typically last between 1/25 to 1/5 of a second (40-200 milliseconds). To capture these fleeting expressions, a high frame rate is essential.
178
+
179
+ ### 10 fps
180
+
181
+ - **Frame Interval** Each frame is captured every 100 milliseconds.
182
+ - **Effectiveness** Given that micro-expressions can last as short as 40 milliseconds, a frame rate of 10 fps is insufficient. Many micro-expressions would begin and end between frames, making it highly likely that they would be missed entirely.
183
+
184
+ ### 20 fps
185
+
186
+ - **Frame Interval** Each frame is captured every 50 milliseconds.
187
+ - **Effectiveness** While 20 fps is better than 10 fps, it is still inadequate. Micro-expressions can still occur and resolve within the 50-millisecond interval between frames, leading to incomplete or missed captures.
188
+
189
+ ### High-Speed Cameras
190
+
191
+ Effective capture of micro-expressions generally requires frame rates above 100 fps. High-speed video systems designed for micro-expression detection often operate at 118 fps or higher, with some systems reaching up to 200 fps.
192
+
193
+ ## Limitations
194
+
195
+ - **Evaluation Challenges**: Since this is an unsupervised method, there is no labeled data to compare against. This makes it difficult to quantitatively evaluate the accuracy or effectiveness of the anomaly detection.
196
+ - **Subjectivity**: The concept of what constitutes an "anomaly" can be subjective and context-dependent. This can lead to false positives or negatives depending on the situation.
197
+ - **Lighting and Resolution**: Variability in lighting conditions, camera resolution, and frame rate can affect the quality of detected features and postures, leading to inconsistent results.
198
+ - **Audio Quality**: Background noise, poor audio quality, and overlapping speech can affect the accuracy of speaker diarization and voice embeddings.
199
+ - **Generalization**: The model may not generalize well to all types of videos and contexts. For example, trained embeddings may work well for a specific demographic but poorly for another.
200
+ - **Computationally Intensive**: Real-time processing of high-resolution video frames can be computationally demanding, requiring significant hardware resources.
201
+ - **Micro-Expressions and Frame Rate Limitations**: Videos recorded at 10 or 20 fps are not suitable for reliably capturing micro-expressions due to their rapid onset and brief duration. Higher frame rates, typically above 100 fps, are essential to ensure that these fleeting expressions are accurately captured and analyzed.
202
+
203
+ ## Conclusion
204
+ This tool offers solutions for detecting emotional, posture, and vocal anomalies in video-based facial expressions, body language, and speech, beneficial for both forensic analysis and HUMINT operations. However, users should be aware of its limitations and the challenges inherent in unsupervised anomaly detection methodologies. By leveraging advanced computer vision techniques and the power of autoencoders, it provides crucial insights into human behavior in a timely manner, but results should be interpreted with caution and, where possible, supplemented with additional context and expert analysis.
205
+ h caution.
206
+ """)
207
+
208
+ results_tab = gr.TabItem("Results", visible=False)
209
+ with results_tab:
210
  with gr.Tabs():
211
  with gr.TabItem("Facial Features"):
212
  video_display_facial = gr.Video(label="Input Video")