# F1 Scores

In [None]:
library("ggpubr")
library(readr)
library(ggplot2)
library(tidyverse)
library(ARTool)
library(emmeans)
library(multcomp)
library(car)
library(rstatix)

In [10]:
f1_scores <- read_csv("f1_scores.csv") %>%
    rename(question = `...1`) %>%
    pivot_longer(!question, names_to=c("retriever", "reader"), names_sep="_", values_to="f1")

f1_scores$retriever = as.factor(f1_scores$retriever)
f1_scores$reader = as.factor(f1_scores$reader)

head(f1_scores)

New names:
* `` -> ...1

[1mRows: [22m[34m59[39m [1mColumns: [22m[34m5[39m
[36m--[39m [1mColumn specification[22m [36m--------------------------------------------------------[39m
[1mDelimiter:[22m ","
[32mdbl[39m (5): ...1, faiss_dpr, faiss_longformer, es_dpr, es_longformer

[36mi[39m Use `spec()` to retrieve the full column specification for this data.
[36mi[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


question,retriever,reader,f1
<dbl>,<fct>,<fct>,<dbl>
0,faiss,dpr,0.0
0,faiss,longformer,0.0
0,es,dpr,0.1300813
0,es,longformer,0.7692308
1,faiss,dpr,0.0
1,faiss,longformer,0.0


To test which tests we can use, we need to check for normality. For this, we use a Shapiro-Wilk test of normality. As you can see in the results below, all $p$-values are lower than 0.001, so we reject the null-hypothesis of normality and now know that none of the f1-scores are normally distributed.

In [21]:
f1_scores %>%
    filter(retriever == "faiss") %>%
    shapiro_test(f1)

f1_scores %>%
    filter(retriever == "es") %>%
    shapiro_test(f1)

f1_scores %>%
    filter(reader == "dpr") %>%
    shapiro_test(f1)

f1_scores %>%
    filter(reader == "longformer") %>%
    shapiro_test(f1)


variable,statistic,p
<chr>,<dbl>,<dbl>
f1,0.5086706,3.999447e-18


variable,statistic,p
<chr>,<dbl>,<dbl>
f1,0.7704567,2.671656e-12


variable,statistic,p
<chr>,<dbl>,<dbl>
f1,0.6741031,7.912632e-15


variable,statistic,p
<chr>,<dbl>,<dbl>
f1,0.6558935,3.037616e-15


Since our data is not normally distributed, we cannot use an ANOVA to compare our results. Therefore, we use an aligned-rank test, which is a non-parameteric version of a factorial repeated measures ANOVA.

In [22]:
model.acc <- art(f1 ~ retriever * reader, data = f1_scores)
anova(model.acc)
art.con(model.acc, ~ retriever)
art.con(model.acc, ~ reader)

Unnamed: 0_level_0,Term,Df,Df.res,Sum Sq,Sum Sq.res,F value,Pr(>F)
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
retriever,retriever,1,232,200452.9,793168.0,58.63206,5.105423e-13
reader,reader,1,232,66045.36,944311.6,16.22613,7.620176e-05
retriever:reader,retriever:reader,1,232,158290.44,843714.0,43.52587,2.804257e-10


NOTE: Results may be misleading due to involvement in interactions



 contrast   estimate   SE  df t.ratio p.value
 es - faiss     58.3 7.61 232   7.657  <.0001

Results are averaged over the levels of: reader 

NOTE: Results may be misleading due to involvement in interactions



 contrast         estimate   SE  df t.ratio p.value
 dpr - longformer    -33.5 8.31 232  -4.028  0.0001

Results are averaged over the levels of: retriever 

From these results, we can see that both the retriever and the reader have a significant effect on the F1 score ($F = 58.63$ and $F = 16.23$ respectively, $p < 0.0001$ for both). However, there is also an interaction between the retriever and reader ($F = 43.53$, $p < 0.0001$). The post-hoc analysis of contrasts shows that ElasticSearch performs better than FAISS ($p < 0.0001$) and Longformer performs better than DPR ($p = 0.0001$).