Spaces:
Running
Running
Update README.md
#1
by
davanstrien
HF Staff
- opened
README.md
CHANGED
@@ -7,92 +7,61 @@ sdk: static
|
|
7 |
pinned: false
|
8 |
---
|
9 |
|
10 |
-
|
11 |
|
|
|
12 |
|
13 |
-
|
14 |
-
We are continuing to work on making more datasets available via the Hugging Face hub to help make these datasets more discoverable, open them up to new audiences, and help ensure that machine-learning datasets more closely reflect the richness of human culture.
|
15 |
|
|
|
|
|
16 |
|
17 |
-
|
18 |
-
|
19 |
-
An overview of datasets currently made available via BigLam organised by task type.
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
<details>
|
24 |
-
<summary>image-classification</summary>
|
25 |
-
|
26 |
-
- [19th Century United States Newspaper Advert images with 'illustrated' or 'non illustrated' labels](https://huggingface.co/datasets/biglam/illustrated_ads)
|
27 |
-
- [Brill Iconclass AI Test Set ](https://huggingface.co/datasets/biglam/brill_iconclass)
|
28 |
-
- [National Library of Scotland Chapbook Illustrations](https://huggingface.co/datasets/biglam/nls_chapbook_illustrations)
|
29 |
-
- [Encyclopaedia Britannica Illustrated](https://huggingface.co/datasets/biglam/encyclopaedia_britannica_illustrated)
|
30 |
-
- [V4Design Europeana style dataset](https://huggingface.co/datasets/biglam/v4design_europeana_style_dataset)
|
31 |
-
- [Early Printed Books Font Detection Dataset](https://huggingface.co/datasets/biglam/early_printed_books_font_detection)
|
32 |
-
- [DEArt: Dataset of European Art](https://huggingface.co/datasets/biglam/european_art)
|
33 |
-
|
34 |
-
</details>
|
35 |
|
|
|
36 |
|
37 |
-
|
38 |
-
<summary>text-classification</summary>
|
39 |
-
|
40 |
-
- [Annotated dataset to assess the accuracy of the textual description of cultural heritage records](https://huggingface.co/datasets/biglam/biglam/cultural_heritage_metadata_accuracy)
|
41 |
-
- [Atypical Animacy](https://huggingface.co/datasets/biglam/atypical_animacy)
|
42 |
-
- [Old Bailey Proceedings](https://huggingface.co/datasets/biglam/old_bailey_proceedings)
|
43 |
-
- [Lampeter Corpus](https://huggingface.co/datasets/biglam/lampeter_corpus)
|
44 |
-
- [Hansard Speeches](https://huggingface.co/datasets/biglam/hansard_speech)
|
45 |
-
- [Contentious Contexts Corpus](https://huggingface.co/datasets/biglam/contentious_contexts)
|
46 |
|
47 |
-
|
48 |
|
|
|
|
|
|
|
49 |
|
50 |
-
|
51 |
-
<summary>image-to-text</summary>
|
52 |
-
|
53 |
-
- [Brill Iconclass AI Test Set ](https://huggingface.co/datasets/biglam/biglam/brill_iconclass)
|
54 |
|
55 |
-
|
56 |
|
|
|
57 |
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
- [Old Bailey Proceedings](https://huggingface.co/datasets/biglam/old_bailey_proceedings)
|
62 |
-
- [Hansard Speeches](https://huggingface.co/datasets/biglam/hansard_speech)
|
63 |
-
- [Berlin State Library OCR](https://huggingface.co/datasets/biglam/berlin_state_library_ocr)
|
64 |
-
- [Literary fictions of Gallica](https://huggingface.co/datasets/biglam/gallica_literary_fictions)
|
65 |
-
- [Europeana Newspapers ](https://huggingface.co/datasets/biglam/europeana_newspapers)
|
66 |
-
- [Gutenberg Poetry Corpus](https://huggingface.co/datasets/biglam/gutenberg-poetry-corpus)
|
67 |
-
- [BnL Newspapers 1841-1879](https://huggingface.co/datasets/biglam/bnl_newspapers1841-1879)
|
68 |
|
69 |
-
|
|
|
|
|
70 |
|
|
|
71 |
|
72 |
-
|
73 |
-
<summary>object-detection</summary>
|
74 |
-
|
75 |
-
- [National Library of Scotland Chapbook Illustrations](https://huggingface.co/datasets/biglam/nls_chapbook_illustrations)
|
76 |
-
- [YALTAi Tabular Dataset](https://huggingface.co/datasets/biglam/yalta_ai_tabular_dataset)
|
77 |
-
- [YALTAi Tabular Dataset](https://huggingface.co/datasets/biglam/yalta_ai_segmonto_manuscript_dataset)
|
78 |
-
- [Beyond Words](https://huggingface.co/datasets/biglam/loc_beyond_words)
|
79 |
-
- [DEArt: Dataset of European Art](https://huggingface.co/datasets/biglam/european_art)
|
80 |
|
81 |
-
|
82 |
|
|
|
|
|
83 |
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
|
88 |
-
- [BnL Newspapers 1841-1879](https://huggingface.co/datasets/biglam/bnl_newspapers1841-1879)
|
89 |
|
90 |
-
|
91 |
|
|
|
92 |
|
93 |
-
|
94 |
-
<summary>token-classification</summary>
|
95 |
-
|
96 |
-
- [Unsilencing Colonial Archives via Automated Entity Recognition](https://huggingface.co/datasets/biglam/unsilence_voc)
|
97 |
|
98 |
-
|
|
|
|
|
|
|
|
7 |
pinned: false
|
8 |
---
|
9 |
|
10 |
+
# 📚 BigLAM: Machine Learning for Libraries, Archives, and Museums
|
11 |
|
12 |
+
**BigLAM** is a community-driven effort to build an open ecosystem of machine learning models, datasets, and tools for **Libraries, Archives, and Museums (LAMs)**.
|
13 |
|
14 |
+
We aim to make cultural heritage data more accessible and usable for machine learning by:
|
|
|
15 |
|
16 |
+
- 🗃️ **Curating and sharing LAM datasets** with potential for ML applications, hosted openly on the [Hugging Face Hub](https://huggingface.co/biglam).
|
17 |
+
- 🤖 **Training and releasing open-source models** tailored to LAM-relevant tasks, including classification, generation, and object detection.
|
18 |
|
19 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
|
21 |
+
## ✨ Origins and Purpose
|
22 |
|
23 |
+
BigLAM began as a [datasets hackathon](https://github.com/bigscience-workshop/lam) within the [BigScience 🌸](https://bigscience.huggingface.co/) project—an open scientific collaboration involving over 600 researchers from 50 countries and 250 institutions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
|
25 |
+
Our initial goal was to make LAM data more discoverable and usable on the Hugging Face Hub. We're continuing this work with the broader aim of:
|
26 |
|
27 |
+
- Helping LAM data reach new audiences.
|
28 |
+
- Supporting researchers and practitioners working at the intersection of AI and cultural heritage.
|
29 |
+
- Ensuring that machine learning datasets reflect the diversity and richness of human culture.
|
30 |
|
31 |
+
---
|
|
|
|
|
|
|
32 |
|
33 |
+
## 📂 What You'll Find Here
|
34 |
|
35 |
+
The [BigLAM organization on Hugging Face](https://huggingface.co/biglam) hosts:
|
36 |
|
37 |
+
- 🧠 **Datasets** from and about libraries, archives, and museums, including image, text, and tabular formats.
|
38 |
+
- ⚙️ **Models** fine-tuned for LAM tasks, such as:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
|
40 |
+
- Art and historical image classification
|
41 |
+
- OCR and document understanding
|
42 |
+
- Metadata quality assessment
|
43 |
|
44 |
+
- 🧪 **Spaces and tools** for exploring datasets and running models interactively.
|
45 |
|
46 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
|
48 |
+
## 🧩 Get Involved
|
49 |
|
50 |
+
We welcome contributions and collaborations!
|
51 |
+
You can:
|
52 |
|
53 |
+
- Explore our [datasets and models](https://huggingface.co/biglam).
|
54 |
+
- Join the conversation by opening a [New Discussion](https://huggingface.co/spaces/biglam/README/discussions/new) on the BigLAM space.
|
55 |
+
- Submit datasets, models, or tools that support AI for cultural heritage.
|
56 |
+
- Use our datasets in your own research or projects—and share what you build!
|
|
|
57 |
|
58 |
+
---
|
59 |
|
60 |
+
## 🌍 Why It Matters
|
61 |
|
62 |
+
Cultural heritage data is too often underrepresented in machine learning. By making LAM data more visible and usable:
|
|
|
|
|
|
|
63 |
|
64 |
+
- We support the responsible and inclusive development of AI.
|
65 |
+
- We help cultural institutions explore new forms of access and interpretation.
|
66 |
+
- We ensure that machine learning models learn from the full range of human knowledge—not just what's convenient to crawl.
|
67 |
+
- We develop tools and approaches that are tailored to the specific formats, challenges, and goals of libraries, archives, and museums—supporting long-term reuse and alignment with professional practices.
|