Studio commited on
Commit
6dbef57
·
verified ·
1 Parent(s): b136841

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +135 -3
README.md CHANGED
@@ -1,3 +1,135 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - ru
5
+ - en
6
+ pipeline_tag: document-question-answering
7
+ tags:
8
+ - DocumentQA
9
+ - QuestionAnswering
10
+ - NLP
11
+ - DeepLearning
12
+ - Transformers
13
+ - Multimodal
14
+ - HuggingFace
15
+ - ruBert
16
+ - MachineLearning
17
+ - DeepQA
18
+ - AIForDocs
19
+ - Docs
20
+ - NeuralNetworks
21
+ - torch
22
+ - pytorch
23
+ - large
24
+ - text-generation-inference
25
+ library_name: transformers
26
+ metrics:
27
+ - accuracy
28
+ - f1
29
+ - recall
30
+ - exact_match
31
+ - precision
32
+ base_model:
33
+ - ai-forever/ruBert-large
34
+ ---
35
+
36
+ ![Official Kaleidoscope Logo](https://huggingface.co/LaciaStudio/Kaleidoscope_large_v1/resolve/main/Kaleidoscope.png)
37
+
38
+ # Document Question Answering Model - Kaleidoscope_large_v1
39
+ This model is a fine-tuned version of sberbank-ai/ruBert-large designed for the task of document question answering. It has been adapted specifically for extracting answers from a provided document context and fine-tuned on a custom JSON dataset containing context, question, and answer triples.
40
+
41
+ # Key Features
42
+ * Objective: Extract answers from documents based on user questions.
43
+ * Base Model: sberbank-ai/ruBert-large.
44
+ * Dataset: A custom JSON file with fields: context, question, and answer.
45
+ * Preprocessing: The input is formed by concatenating the question and the document context, guiding the model to focus on the relevant segments.
46
+ # Training Settings:
47
+ * Number of epochs: 20.
48
+ * Batch size: 4 per device.
49
+ * Warmup steps: 0.1 of total steps.
50
+ * FP16 training enabled (if CUDA is available).
51
+ * Hardware: Training was performed on an 1xRTX 3070.
52
+
53
+ # Description
54
+ The model was fine-tuned using the Transformers library with a custom training pipeline. Key aspects of the training process include:
55
+
56
+ Custom Dataset: A loader reads a JSON file containing context, question, and answer triples.
57
+
58
+ * *Feature Preparation: The script tokenizes the document and question with a sliding window approach to handle long texts.*
59
+ * *Training Process: Leveraging mixed precision training and the AdamW optimizer to improve optimization.*
60
+ * *Evaluation and Checkpointing: The training script evaluates model performance on a validation set, saves checkpoints, and employs early stopping based on validation loss.*
61
+ * *This model is ideal for interactive document question answering tasks, making it a powerful tool for applications such as customer support, document search, and automated Q&A systems.*
62
+
63
+ While primarily focused on Russian texts, the model also supports English language inputs.
64
+ **The model also supports English language, but its support was not tested**
65
+
66
+ # Example Usage
67
+
68
+ ```python
69
+ import torch
70
+ from transformers import AutoTokenizer, AutoModelForQuestionAnswering
71
+
72
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
73
+ tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Kaleidoscope_large_v1")
74
+ model = AutoModelForQuestionAnswering.from_pretrained("LaciaStudio/Kaleidoscope_large_v1")
75
+ model.to(device)
76
+
77
+ file_path = input("Enter document path: ")
78
+ with open(file_path, "r", encoding="utf-8") as f:
79
+ context = f.read()
80
+
81
+ while True:
82
+ question = input("Enter question (or 'exit' to quit): ")
83
+ if question.lower() == "exit":
84
+ break
85
+ inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384)
86
+ inputs = {k: v.to(device) for k, v in inputs.items()}
87
+ outputs = model(**inputs)
88
+ start_logits = outputs.start_logits
89
+ end_logits = outputs.end_logits
90
+ start_index = torch.argmax(start_logits)
91
+ end_index = torch.argmax(end_logits)
92
+ answer_tokens = inputs["input_ids"][0][start_index:end_index + 1]
93
+ answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
94
+ print("Answer:", answer)
95
+ ```
96
+
97
+ # Example of answering
98
+ **RU**
99
+ *Context:*
100
+
101
+ ```
102
+ Альберт Эйнштейн разработал теорию относительности.
103
+ ```
104
+
105
+ *Question:*
106
+
107
+ ```
108
+ Кто разработал теорию относительности?
109
+ ```
110
+
111
+ *Answer:*
112
+
113
+ ```
114
+ альберт эинштеин
115
+ ```
116
+ **EN**
117
+ *Context:*
118
+
119
+ ```
120
+ I had a red car.
121
+ ```
122
+
123
+ *Question:*
124
+
125
+ ```
126
+ What kind of car did I have?
127
+ ```
128
+
129
+ *Answer:*
130
+
131
+ ```
132
+ a red car
133
+ ```
134
+
135
+ **Finetuned by LaciaStudio | LaciaAI**