File size: 4,911 Bytes
975f207
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
# Feelings to Emoji: Technical Reference

This document provides technical details about the implementation of the Feelings to Emoji application.

## Project Structure

The application is organized into several Python modules:

- `app.py` - Main application file with Gradio interface
- `emoji_processor.py` - Core processing logic for emoji matching
- `config.py` - Configuration settings
- `utils.py` - Utility functions
- `generate_embeddings.py` - Standalone tool to pre-generate embeddings

## Embedding Models

The system uses the following sentence embedding models from the Sentence Transformers library:

| Model Key | Model ID | Size | Description |
|-----------|----------|------|-------------|
| mpnet | all-mpnet-base-v2 | 110M | Balanced, great general-purpose model |
| gte | thenlper/gte-large | 335M | Context-rich, good for emotion & nuance |
| bge | BAAI/bge-large-en-v1.5 | 350M | Tuned for ranking & high-precision similarity |

## Emoji Matching Algorithm

The application uses cosine similarity between sentence embeddings to match text with emojis:

1. For each emoji category (emotion and event):
   - Embed descriptions using the selected model
   - Calculate cosine similarity between the input text embedding and each emoji description embedding
   - Return the emoji with the highest similarity score

2. The embeddings are pre-computed and cached to improve performance:
   - Stored as pickle files in the `embeddings/` directory
   - Generated using `generate_embeddings.py`
   - Loaded at startup to minimize processing time

## Module Reference

### `config.py`

Contains configuration settings including:

- `CONFIG`: Dictionary with basic application settings (model name, file paths, etc.)
- `EMBEDDING_MODELS`: Dictionary defining the available embedding models

### `utils.py`

Utility functions including:

- `setup_logging()`: Configures application logging
- `kitchen_txt_to_dict(filepath)`: Parses emoji dictionary files
- `save_embeddings_to_pickle(embeddings, filepath)`: Saves embeddings to pickle files
- `load_embeddings_from_pickle(filepath)`: Loads embeddings from pickle files
- `get_embeddings_pickle_path(model_id, emoji_type)`: Generates consistent paths for embedding files

### `emoji_processor.py`

Core processing logic:

- `EmojiProcessor`: Main class for emoji matching and processing
  - `__init__(model_name=None, model_key=None, use_cached_embeddings=True)`: Initializes the processor with a specific model
  - `load_emoji_dictionaries(emotion_file, item_file)`: Loads emoji dictionaries from text files
  - `switch_model(model_key)`: Switches to a different embedding model
  - `sentence_to_emojis(sentence)`: Processes text to find matching emojis and generate mashup
  - `find_top_emojis(embedding, emoji_embeddings, top_n=1)`: Finds top matching emojis using cosine similarity

### `app.py`

Gradio interface:

- `EmojiMashupApp`: Main application class
  - `create_interface()`: Creates the Gradio interface
  - `process_with_model(model_selection, text, use_cached_embeddings)`: Processes text with selected model
  - `get_random_example()`: Gets a random example sentence for demonstration

### `generate_embeddings.py`

Standalone utility to pre-generate embeddings:

- `generate_embeddings_for_model(model_key, model_info)`: Generates embeddings for a specific model
- `main()`: Main function that processes all models and saves embeddings

## Emoji Data Files

- `google-emoji-kitchen-emotion.txt`: Emotion emojis with descriptions
- `google-emoji-kitchen-item.txt`: Event/object emojis with descriptions
- `google-emoji-kitchen-compatible.txt`: Compatibility information for emoji combinations

## Embedding Cache Structure

The `embeddings/` directory contains pre-generated embeddings in pickle format:

- `[model_id]_emotion.pkl`: Embeddings for emotion emojis
- `[model_id]_event.pkl`: Embeddings for event/object emojis

## API Usage Examples

### Using the EmojiProcessor Directly

```python
from emoji_processor import EmojiProcessor

# Initialize with default model (mpnet)
processor = EmojiProcessor()
processor.load_emoji_dictionaries()

# Process a sentence
emotion, event, image = processor.sentence_to_emojis("I'm feeling happy today!")
print(f"Emotion emoji: {emotion}")
print(f"Event emoji: {event}")
# image contains the PIL Image object of the mashup
```

### Switching Models

```python
# Switch to a different model
processor.switch_model("gte")

# Process with the new model
emotion, event, image = processor.sentence_to_emojis("I'm feeling anxious about tomorrow.")
```

## Performance Considerations

- Embedding generation is computationally intensive but only happens once per model
- Using cached embeddings significantly improves response time
- Larger models (GTE, BGE) may provide better accuracy but require more resources
- The MPNet model offers a good balance of performance and accuracy for most use cases