Update README.md
Browse files
README.md
CHANGED
@@ -8,22 +8,8 @@ language:
|
|
8 |
|
9 |
**Model Summary**
|
10 |
|
11 |
-
Recently, IBM has introduced GneissWeb; a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. The models trained using GneissWeb dataset outperform those trained on FineWeb 1.1.0 by 2.14 percentage points in terms of average score computed on a set of 11 commonly used benchmarks.
|
12 |
-
|
13 |
In order to be able to reproduce GneissWeb, we provide here GneissWeb.Sci_classifier a science category fastText classifier. This fastText model is used as part of the ensemble filter in GneissWeb to detect documents with science content.
|
14 |
|
15 |
-
|
16 |
-
**Intended Use**
|
17 |
-
|
18 |
-
The fastText model takes as input text and classifies whether the text categorized as ''science'' (labeled as `__label__hq`) or other categories''cc'' (labeled as `__label__cc`).
|
19 |
-
The model can be used with python (please refer to [fasttext documentation](https://fasttext.cc/docs/en/python-module.html) for details on using fasttext classifiers)
|
20 |
-
or with [IBM Data Prep Kit](https://github.com/IBM/data-prep-kit/) (DPK) (please refer to the [example notebook](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/gneissweb_classification/gneissweb_classification.ipynb) for using a fastText model with DPK).
|
21 |
-
|
22 |
-
The GneissWeb ensemble filter uses the confidence score given to `__label__hq` for filtering documents based on an appropriately chosen threshold.
|
23 |
-
The fastText model is used along with [GneissWeb.Edu_classifier](https://huggingface.co/ibm-granite/GneissWeb.Edu_classifier), [GneissWeb.Tech_classifier](https://huggingface.co/ibm-granite/GneissWeb.Tech_classifier), and [GneissWeb.Med_classifier](https://huggingface.co/ibm-granite/GneissWeb.Med_classifier) and other quality annotators.
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
**Developers**: IBM Research
|
28 |
|
29 |
**Release Date**: Feb 10th, 2025
|
|
|
8 |
|
9 |
**Model Summary**
|
10 |
|
|
|
|
|
11 |
In order to be able to reproduce GneissWeb, we provide here GneissWeb.Sci_classifier a science category fastText classifier. This fastText model is used as part of the ensemble filter in GneissWeb to detect documents with science content.
|
12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
**Developers**: IBM Research
|
14 |
|
15 |
**Release Date**: Feb 10th, 2025
|