cptu_bench

Running

App Files Files Community

jansowa commited on 26 days ago

Commit

387f1e8

verified ·

1 Parent(s): fa79b35

Update app.py

Browse files

Files changed (1) hide show

app.py +88 -49

app.py CHANGED Viewed

@@ -164,7 +164,11 @@ st.markdown("""
 tab1, tab2 = st.tabs([RESULTS_COLUMN_NAME, "Description"])
 with tab1:
-    st.write("This benchmark is designed to evaluate the ability of language models to correctly interpret complex Polish texts, including sarcasm, phraseological compounds, and implicatures. Models are assessed not only on traditional sentiment analysis but also on their ability to understand and interpret more complex language forms. The focus is on how well models can uncover the intended meaning in texts that require going beyond literal word meanings to recognize deeper, context-dependent interpretations.")
     # Prepare data
     data = load_data('data.json')
@@ -364,60 +368,91 @@ with tab1:
 with tab2:
     st.markdown("""
 ### <span style='text-decoration: #FDA428 wavy underline;'>**Cause of Creation**</span>
-1. **Need**: Models face significant challenges when dealing with understanding complex, context-reliant texts that involve meanings implied beyond the literal content of a statement. Such cases include sarcasm, implicatures, and phraseological compounds.
-Traditional sentiment classifiers typically rely on word-based features (e.g., identifying positive or negative words) to assess sentiment. However, with sarcasm, the literal meaning of words often contradicts the intended sentiment, making it difficult for models to accurately gauge tone. Sarcasm's context-dependence further complicates matters, as these classifiers typically lack the ability to grasp nuanced cues in context, especially when sarcasm is subtle.
-Similarly, classifiers struggle with implicatures, where the underlying intent is implied rather than explicitly stated. Here, models fail to capture the full sentiment because they rely heavily on surface-level words, missing the non-literal meaning that often drives the sentiment.
-Phraseological compounds add another layer of difficulty. These are fixed or semi-fixed expressions whose meanings cannot be directly inferred from the individual words. Language models, trained on word-level patterns, often misinterpret these expressions because they fail to recognize the idiomatic or non-literal meaning, leading to inaccurate sentiment analysis.
-In addition to sentiment analysis, we decided to include the understanding of more complex texts in the benchmark, which was measured by the ability to uncover the intended meaning.
 ### <span style='text-decoration: #FDA428 wavy underline;'>**Dataset Information**</span>
-The dataset contains 200 examples, all written in Polish. Each example consists of the following:
-- **Main Text**: This is a statement (often an opinion) on any topic that includes a certain type of implicature, often several simultaneously, such as sarcasm or phraseological compounds.
-- **Reference Sentiment**: The sentiment associated with the main text. We use three categories: negative, neutral, and positive. Ambiguous examples were labeled as "neutral" to exclude them from sentiment classification testing.
-- **Reference phraseological compounds**: A list of phraseological compounds found in the main text.
-- **Reference Explanation**: An explanation of the underlying intentions that the author of the main text might have had.
 ### <span style='text-decoration: #FDA428 wavy underline;'>**Evaluation Procedure**</span>
-We distinguish between two models in the evaluation process:
-- **Evaluated Model**: The model that performs specific tasks, is then assessed based on its performance, and added to a ranking.
-- **Judge Metamodel**: One of the currently strongest, most versatile LLMs.
-### <span style='text-decoration: #FDA428 wavy underline;'>**GENERATING RESPONSES FROM THE EVALUATED MODEL**</span>
-1. For each text in the dataset, the evaluated model was required to list the following in three points:
-    - The sentiment (only positive/negative).
-    - The underlying intentions of the author of the text.
-    - All phraseological compounds present in the text along with their meanings in the given context.
-2. No system prompt is used. The prompt provided to the evaluated model is written in Polish, as we are testing the models in this language. It contains:
-    - **User Prompt**: 3 elements, each consisting of a header written in capital letters and content enclosed in triple quotes:
-        - Information about the role of a careful linguist with extensive experience.
-        - The instruction to perform the three previously described tasks.
-        - The first example of a text that could be included in the dataset.
-    - **Assistant Prompt**: A human-written example answer for the first example text.
-    - **User Prompt**: A second example of a text that could be included in the dataset.
-    - **Assistant Prompt**: A human-written example answer for the second example text.
-    - **User Prompt**: The target text, based on which the evaluated model will be assessed.
-3. The decision to split the examples into user prompts and assistant prompts was made due to the better results achieved by the vast majority of models. The two examples were selected based on diversity: one has a negative sentiment and several phraseological compounds, while the other is positive and lacks phraseological compounds.
-### <span style='text-decoration: #FDA428 wavy underline;'>**GENERATING METAMODEL EVALUATIONS**</span>
-1. The purpose of the metamodel is to return the following evaluations:
-    - **Understanding of the Text**: A comparison of the evaluated model's response description to the reference explanation.
-    - **Sentiment Analysis**: An optional evaluation, only if the reference sentiment is "positive" or "negative." We made this decision to exclude texts that people might interpret ambiguously.
-    - **phraseological compounds**: The model is penalized for phrases not included in the reference phraseological compounds. In cases where there are no phraseological compounds, the highest score is awarded only if the model indicates the absence of such expressions — one point is deducted for each excess phrase until the score reaches zero.
-2. Each evaluation is provided in JSON format. Example of a full response from the metamodel:
 ```json
 {"WYDŹWIĘK": "5"}
 {"OCENA": "4"}
 {"ZWIĄZKI": "3"}
 ```
-3. The judge metamodel's prompt structure is similar to that of the evaluated model's prompt. No system prompt is used. The prompt includes:
-    - **User Prompt**: 3 elements, each consisting of a header written in capital letters and content enclosed in triple quotes:
-        - **Role**: A reliable assistant who adheres to the instructions and does not perform any other tasks, nor enters any additional text in the response.
-        - **Task**: According to the description in point 1. The comparison of phraseological compounds has the most guidelines, so we noted that the model should focus on this as it is the most challenging step, and that its work will be evaluated based on this point.
-        - The first example of a potential response from the evaluated model along with the references.
-    - **Assistant Prompt**: An example response containing the evaluations.
-    - **User Prompt**: A second example of a potential response from the evaluated model along with the references.
-    - **Assistant Prompt**: An example response containing the evaluations for the second example.
-    - **User Prompt**: The actual response from the evaluated model and the references on which the metamodel will base its evaluations included in the benchmark.
-4. Here, the examples were also selected based on diversity. One includes a reference with a positive sentiment, while the other contains no reference sentiment at all (an example labeled as "neutral" in the dataset).
-5. It is worth explaining why we chose this particular process for evaluating phraseological compounds. Initially, we intended to check only those phrases included in the reference and ignore others in the evaluation. Unfortunately, this procedure favored models that provided many phrases that were not phraseological compounds.
-Therefore, we decided to penalize models for phrases not included in the reference. We aimed to ensure that models were not penalized for providing phraseological compounds we had not included in the reference. After generating the responses, we collected phrases noted by several models and manually reviewed all references to identify phraseological compounds we might have missed.
-A similar procedure was applied to sentiment analysis—we listed all examples where several models consistently recorded a different sentiment than the reference and reconsidered whether the examples could be interpreted differently than initially assumed.
     """, unsafe_allow_html=True)
@@ -427,6 +462,8 @@ st.markdown("<hr style='border: 1px solid #A85E00;'>", unsafe_allow_html=True)
 st.markdown("""
 ### Authors:
 - [Jan Sowa](https://www.linkedin.com/in/janpiotrsowa) - leadership, writing texts, benchmark code
 - [Agnieszka Kosiak](https://www.linkedin.com/in/agn-kosiak/) - writing texts
 - [Magdalena Krawczyk](https://www.linkedin.com/in/magdalena-krawczyk-7810942ab/) - writing texts, labeling
 - [Marta Matylda Kania](https://www.linkedin.com/in/martamatyldakania/) - prompt engineering
@@ -435,8 +472,10 @@ st.markdown("""
 - [Szymon Baczyński](https://www.linkedin.com/in/szymon-baczynski/) - front-end / streamlit assistant
 - [Artur Słomowski](https://www.linkedin.com/in/arturslomowski/) - front-end / streamlit assistant
 - [Maria Filipkowska](https://www.linkedin.com/in/maria-filipkowska/) - writing text, linguistic support
 """)
 st.divider()
-# Run the app with `streamlit run your_script.py`

 tab1, tab2 = st.tabs([RESULTS_COLUMN_NAME, "Description"])
 with tab1:
+    st.markdown("""
+This benchmark is designed to evaluate the proficiency of language models in accurately interpreting complex Polish texts. It comprises two distinct components:
+1. *Implicatures*: This part evaluates models on their capacity to interpret implied meanings, including sarcasm, idiomatic expressions, and varying levels of linguistic complexity. Beyond conventional sentiment analysis, models are specifically assessed for their ability to discern implicit meanings that extend beyond literal interpretations, requiring sensitivity to nuanced, context-dependent inferences.
+2. *Tricky Questions*: This section assesses the model's capability to accurately address challenging questions characterized by logical puzzles, semantic ambiguity, logical inconsistencies, absurdity, and humor. The emphasis here lies in evaluating the model's reasoning skills and flexibility in handling unconventional linguistic constructs.
+""")
     # Prepare data
     data = load_data('data.json')
 with tab2:
     st.markdown("""
 ### <span style='text-decoration: #FDA428 wavy underline;'>**Cause of Creation**</span>
+LLM models face multiple challenges that significantly impact their practical use. This benchmark has been created to comprehensively evaluate two distinct but equally critical aspects of their performance:
+#### 1. **Implicatures and Phraseological Compounds**
+Language models frequently struggle when interpreting complex, context-dependent meanings that extend beyond the literal interpretation of a text. Such linguistic phenomena include sarcasm, implicatures, and idiomatic or phraseological expressions.
+- **Sarcasm:** Traditional sentiment analysis often fails on sarcastic statements because the literal meaning directly contradicts the intended sentiment. This context-dependence makes detection particularly difficult.
+- **Implicatures:** These implied meanings, not explicitly stated, often elude models which rely heavily on surface-level text analysis.
+- **Phraseological Compounds:** Fixed or semi-fixed expressions whose meanings can't be inferred from their individual components pose additional challenges, leading to inaccurate interpretation by models trained primarily on word-level semantics.
+The goal of this part of the benchmark is thus to test the ability of LLMs to correctly interpret implied meanings, detect sarcasm, identify idiomatic phrases, and evaluate their overall text understanding capability beyond literal semantics.
+#### 2. **Tricky Questions and Hallucination Detection**
+Another critical problem observed in commercial LLM deployments is the model's tendency to provide incorrect or hallucinated answers, especially when faced with logically inconsistent, ambiguous, or absurd questions.
+Key reasons to evaluate this aspect include:
+- **Hallucination Detection:** Preventing models from generating confident-sounding but entirely incorrect answers.
+- **Logical Consistency:** Identifying and correctly responding (or explicitly refusing to respond) to internally inconsistent or logically flawed questions.
+- **Protecting Model Reputation:** Ensuring the model avoids generating absurd or nonsensical answers, which significantly reduces user trust and solution credibility.
+---
 ### <span style='text-decoration: #FDA428 wavy underline;'>**Dataset Information**</span>
+All samples were written by human. This benchmark dataset is divided clearly into two subsets corresponding to each testing area:
+#### 1. **Implicatures and Phraseological Compounds Dataset**
+- **Language:** Polish
+- **Size:** 200 carefully selected examples
+- **Structure of Each Example:**
+  - **Main Text:** Contains sarcasm, implicatures, and/or idiomatic phrases.
+  - **Reference Sentiment:** Annotated sentiment label (*positive*, *neutral*, *negative*).
+  - **Reference Phraseological Compounds:** Explicit list of phraseological expressions in the text.
+  - **Reference Explanation:** Clear explanation of the author's intended meaning.
+#### 2. **Tricky Questions Dataset**
+- **Language:** Polish
+- **Size:** 178 examples (to be specified)
+- **Types of Included Questions:**
+  - Logical riddles and puzzles.
+  - Questions based on semantic ambiguity (e.g., "How much sugar is needed to have a sweet voice?").
+  - Logically flawed questions (e.g., non-existent events or dates: "In what year did Poland witness a battle between Vatican and South Africa?").
+  - Absurd or humorous questions ("How can I join a snail chess club?").
+---
 ### <span style='text-decoration: #FDA428 wavy underline;'>**Evaluation Procedure**</span>
+The evaluation procedure also consists of two clearly distinguished approaches:
+#### 1. **Implicatures and Phraseological Compounds Evaluation**
+##### **Evaluated Model**
+- For each text in the dataset, the evaluated model was explicitly required to list the following in three clearly separated points:
+  1. **Sentiment** (*positive* or *negative* only).
+  2. **The underlying intentions of the author**.
+  3. **All phraseological compounds present in the text, along with their meanings in the given context**.
+- **Prompt Structure**:
+  - Written entirely in Polish, without a system prompt.
+  - Contains three main elements (clearly separated by headers in capital letters and triple quotes):
+    1. Information defining the evaluated model’s role as a careful linguist with extensive experience.
+    2. Explicit instructions about the three tasks to be performed.
+    3. Two diverse human-labeled examples:
+       - Example inputs are presented in the **User Prompt**.
+       - Corresponding example responses are presented separately in the **Assistant Prompt**.
+  - One target example (text) per evaluation follows after these examples.
+- **Selection of Examples**:
+  - The two examples provided in the prompt were chosen due to diversity:
+    - One with negative sentiment and multiple phraseological compounds.
+    - One with positive sentiment and no phraseological compounds.
+##### **Judge Metamodel**
+- Prompt structure similar to the evaluated model, but explicitly defines a distinct role ("reliable assistant") focused exclusively on evaluation.
+- The judge metamodel’s prompt includes:
+  - Several diverse examples (with both positive and neutral sentiment references), each clearly separated into **User Prompt** (containing evaluated model responses and references) and **Assistant Prompt** (containing example evaluations by humans).
+- Returns three separate evaluations in JSON format, clearly assessing:
+  1. **Understanding of the Text** (`OCENA`) – a comparison of the evaluated model’s explanations to reference explanations.
+  2. **Sentiment Analysis** (`WYDŹWIĘK`) – an optional evaluation performed only for samples explicitly labeled with "positive" or "negative" sentiment. Neutral samples were deliberately ignored to avoid ambiguity.
+  3. **Phraseological Compounds** (`ZWIĄZKI`) – carefully evaluated with a penalization system to avoid "phrase spamming," deducting points for each non-reference or incorrect phrase until a minimum score of zero.
+- **Example scoring output**:
 ```json
 {"WYDŹWIĘK": "5"}
 {"OCENA": "4"}
 {"ZWIĄZKI": "3"}
 ```
+#### 2. **Tricky Questions Evaluation**
+##### **Evaluated Model**
+- Receives only the tricky question itself as input, with **no system prompt** and **no additional instructions**.
+##### **Judge Metamodel**
+- Uses GPT-4o with structured JSON outputs.
+- Prompt structure explicitly includes **3 diverse examples** of tricky questions along with reference and evaluated-model responses, without dividing them into separate user and assistant prompts.
+- Structured JSON evaluation includes:
+  - `"think"` field: detailed comparison of evaluated model’s answer to the reference, clearly explaining reasoning behind scoring.
+  - `"mark"` field: integer from 0 to 5 indicating performance (high scores correspond to accurate detection of logical flaws, ambiguity or absurdity, and appropriate refusal to hallucinate answers).
+- Example scoring output:
+```json
+{
+  "think": "Detailed reasoning comparing evaluated model's response to reference answer, noting if the evaluated model correctly identified logical inconsistency and avoided hallucination.",
+  "mark": 4
+}
+```
+- The addition of the `"think"` field significantly simplifies the analysis of the judge model's evaluations, explicitly clarifying the reasoning behind each awarded score.
     """, unsafe_allow_html=True)
 st.markdown("""
 ### Authors:
 - [Jan Sowa](https://www.linkedin.com/in/janpiotrsowa) - leadership, writing texts, benchmark code
+- [Natalia Nadolna](https://www.linkedin.com/in/natalia-nadolna) - benchmark code, dataset cleaning & analysis
+- [Anna Zielińska](https://www.linkedin.com/in/zieli%C5%84ska-anna/) - benchmark code, dataset cleaning & analysis
 - [Agnieszka Kosiak](https://www.linkedin.com/in/agn-kosiak/) - writing texts
 - [Magdalena Krawczyk](https://www.linkedin.com/in/magdalena-krawczyk-7810942ab/) - writing texts, labeling
 - [Marta Matylda Kania](https://www.linkedin.com/in/martamatyldakania/) - prompt engineering
 - [Szymon Baczyński](https://www.linkedin.com/in/szymon-baczynski/) - front-end / streamlit assistant
 - [Artur Słomowski](https://www.linkedin.com/in/arturslomowski/) - front-end / streamlit assistant
 - [Maria Filipkowska](https://www.linkedin.com/in/maria-filipkowska/) - writing text, linguistic support
+- [Magda Król](https://www.linkedin.com/in/magda-król/) - writing text
+- [Artur Gogol](https://www.linkedin.com/in/arturgogol) - writing text
 """)
 st.divider()
+# Run the app with `streamlit run your_script.py`