mtyrrell commited on
Commit
cf7c394
·
1 Parent(s): 64cd537

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -5
README.md CHANGED
@@ -38,7 +38,7 @@ The model is a binary text classifier based on [sentence-transformers/all-mpnet-
38
 
39
  ## Intended uses & limitations
40
 
41
- The classifier assigns a class of 'Unconditional' or 'Conditional' to denote the strength of commitments as portrayed in extracted passages from the documents. The intended use is for climate policy researchers and analysts seeking to automate the process of reviewing lengthy, non-standardized PDF documents to produce summaries and reports.
42
 
43
  Due to inconsistencies in the training data, the classifier performance leaves room for improvement. The classifier exhibits reasonably good training metrics (F1 ~ 0.85), balanced between precise identification of true positive classifications (precision ~ 0.85) and a wide net to capture as many true positives as possible (recall ~ 0.85). When tested on real world unseen test data, the performance was subptimal for a binary classifier (F1 ~ 0.5). However, testing was based on a small out-of-sample dataset containing it's own inconsistencies. Therefore classification may prove more robust in practice.
44
 
@@ -56,10 +56,13 @@ The pre-processing operations used to produce the final training dataset were as
56
  1. Dataset is filtered based on 'medium' value in 'strategy' column (sequence length = 85).
57
  2. For IKITracs, labels are assigned based on the presence of certain substrings ('_unc' or '_c') based on 'parameter' values which correspond to assessments of 'unconditional' or 'conditional' commitments by human annotaters.
58
  3. For ClimateWatch, the 'QuestionText' field is searched for the terms 'unconditional' or 'conditional', and labels assigned accordingly.
59
- 3. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'.
60
- 4. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
61
- 5. The 'match_onanswer' and 'answerWordcount' are used conditionally to select high quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
62
- 6. Data is then augmented using sentence shuffle from the ```albumentations``` library (NLP methods insertion and substitution were also tried, but lowered the performance of the model and were therefore not included in the final training data)
 
 
 
63
 
64
 
65
  ## Training procedure
 
38
 
39
  ## Intended uses & limitations
40
 
41
+ The classifier assigns a class of **'Unconditional' or 'Conditional' to denote the strength of commitments** as portrayed in extracted passages from the documents. The intended use is for climate policy researchers and analysts seeking to automate the process of reviewing lengthy, non-standardized PDF documents to produce summaries and reports.
42
 
43
  Due to inconsistencies in the training data, the classifier performance leaves room for improvement. The classifier exhibits reasonably good training metrics (F1 ~ 0.85), balanced between precise identification of true positive classifications (precision ~ 0.85) and a wide net to capture as many true positives as possible (recall ~ 0.85). When tested on real world unseen test data, the performance was subptimal for a binary classifier (F1 ~ 0.5). However, testing was based on a small out-of-sample dataset containing it's own inconsistencies. Therefore classification may prove more robust in practice.
44
 
 
56
  1. Dataset is filtered based on 'medium' value in 'strategy' column (sequence length = 85).
57
  2. For IKITracs, labels are assigned based on the presence of certain substrings ('_unc' or '_c') based on 'parameter' values which correspond to assessments of 'unconditional' or 'conditional' commitments by human annotaters.
58
  3. For ClimateWatch, the 'QuestionText' field is searched for the terms 'unconditional' or 'conditional', and labels assigned accordingly.
59
+ 4. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'.
60
+ 5. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
61
+ 6. The 'match_onanswer' and 'answerWordcount' are used conditionally to select high quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
62
+ 7. Data is then augmented using sentence shuffle from the ```albumentations``` library (NLP methods insertion and substitution were also tried, but lowered the performance of the model and were therefore not included in the final training data). This is done to increase the number of training samples available for the Unconditional class from 774 to 1163. The end result is an equal sample per class breakdown of:
63
+ > -UNCONDITIONAL: 1163
64
+ > -CONDITIONAL: 1163
65
+
66
 
67
 
68
  ## Training procedure