# Approach
## VectorDB
There are a number of aspects of choosing a vector db that might be unique to your situation. You should think through your HW, utilization, latency requirements, scale, etc before choosing. 

I've been hearing a lot about LanceDB and wanted to check it out. It's newer and may or may not be good for **your** use-case. I'm attracted by its fast ingestion, cuda assisted indexing, and portability. It has some drawbacks, it doesnt support hnsw yet and it could change significantly given how early it is.


You will be blown away on how fast ingestion + indexing is with LanceDB. 

## Ingestion Strategy
I used the ~100k document `.ndjson` files in sequence to upload. After uploading I index.

## Indexing
The algorithm used is `IVF_PQ`. I ignore the `PQ` part because I want better recall. Recall is important since Jais only has a 2k context window, I can't put my top 10 documents for RAG in my prompt. It will be my top 3 (512\*3 + query + instructions ~ 2k). For many use-cases its worth the trade-off as you get much faster retrieval with not much performance loss. 

More partitions means faster retrieval but slower indexing. I chose 384 sub_vectors to be equal to my embedding dimension size. 

```tbl.create_index(num_partitions=1024, num_sub_vectors=384, accelerator="cuda")```

Read more about it [here](https://lancedb.github.io/lancedb/ann_indexes/).

# Imports

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [2]:
from pathlib import Path
import json

from tqdm.notebook import tqdm
import lancedb

In [3]:
proj_dir = Path.cwd().parent
print(proj_dir)

/home/ec2-user/arabic-wiki


# Config

In [4]:
files_in = list((proj_dir / 'data/embedded/').glob('*.ndjson'))

# Setup
To work with LanceDB we want to create the table before ingesting the first batch. To create a table we need at least 1 doc.

In [5]:
with open(files_in[0], 'r') as f:
    first_line = f.readline().strip()  # read only the first line
    document = json.loads(first_line)
    document['vector'] = document.pop('embedding')

In [6]:
doc = document.copy()
doc['vector'] = doc['vector'][:5] + ['...']
doc

{'content': 'الماء مادةٌ شفافةٌ عديمة اللون والرائحة، وهو المكوّن الأساسي للجداول والبحيرات والبحار والمحيطات وكذلك للسوائل في جميع الكائنات الحيّة، وهو أكثر المركّبات الكيميائيّة انتشاراً على سطح الأرض. يتألّف جزيء الماء من ذرّة أكسجين مركزية ترتبط بها ذرّتا هيدروجين على طرفيها برابطة تساهميّة بحيث تكون صيغته الكيميائية H2O. عند الظروف القياسية من الضغط ودرجة الحرارة يكون الماء سائلاً؛ أمّا الحالة الصلبة فتتشكّل عند نقطة التجمّد، وتدعى بالجليد؛ أمّا الحالة الغازية فتتشكّل عند نقطة الغليان، وتسمّى بخار الماء.\nإنّ الماء هو أساس وجود الحياة على كوكب الأرض، وهو يغطّي 71% من سطحها، وتمثّل مياه البحار والمحيطات أكبر نسبة للماء على الأرض، حيث تبلغ حوالي 96.5%. وتتوزّع النسب الباقية بين المياه الجوفيّة وبين جليد المناطق القطبيّة (1.7% لكليهما)، مع وجود نسبة صغيرة على شكل بخار ماء معلّق في الهواء على هيئة سحاب (غيوم)، وأحياناً أخرى على هيئة ضباب أو ندى، بالإضافة إلى الزخات المطريّة أو الثلجيّة. تبلغ نسبة الماء العذب حوالي 2.5% فقط من الماء الموجود على الأرض، وأغلب هذه الكمّيّة (حوالي 99%) موج

Here we create the db and the table.

In [7]:
from lancedb.embeddings.registry import EmbeddingFunctionRegistry
from lancedb.embeddings.sentence_transformers import SentenceTransformerEmbeddings

db = lancedb.connect(proj_dir/".lancedb")
tbl = db.create_table('arabic-wiki', [document])

For each file we:
- Read the `ndjson` into a list of documents
- Replace 'embedding' with 'vector' to be compatible with LanceDB
- Write the docs to the table

After that we index with a cuda accelerator.

In [8]:
%%time
for file_in in tqdm(files_in, desc='Wiki Files: '):

    tqdm.write(f"Reading documents {str(file_in)}")
    with open(file_in, 'r') as f:
        documents = [json.loads(line) for line in f]
    tqdm.write(f"Read documents")

    for doc in tqdm(documents):
        if 'embedding' in doc:
            doc['vector'] = doc.pop('embedding')
    
    tqdm.write(f"Adding documents {str(file_in)}")
    tbl.add(documents)
    tqdm.write(f"Added documents")
tbl.create_index(
     num_partitions=1024,
     num_sub_vectors=384,
     accelerator="cuda"
)
    

Wiki Files:   0%|          | 0/23 [00:00<?, ?it/s]

Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_1.ndjson
Read documents


  0%|          | 0/243068 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_1.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_2.ndjson
Read documents


  0%|          | 0/104065 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_2.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_3.ndjson
Read documents


  0%|          | 0/123154 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_3.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_4.ndjson
Read documents


  0%|          | 0/135965 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_4.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_5.ndjson
Read documents


  0%|          | 0/99138 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_5.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_6.ndjson
Read documents


  0%|          | 0/83678 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_6.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_7.ndjson
Read documents


  0%|          | 0/30573 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_7.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_8.ndjson
Read documents


  0%|          | 0/78957 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_8.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_9.ndjson
Read documents


  0%|          | 0/86327 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_9.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_10.ndjson
Read documents


  0%|          | 0/83111 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_10.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_11.ndjson
Read documents


  0%|          | 0/92664 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_11.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_12.ndjson
Read documents


  0%|          | 0/66404 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_12.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_13.ndjson
Read documents


  0%|          | 0/62844 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_13.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_14.ndjson
Read documents


  0%|          | 0/59349 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_14.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_15.ndjson
Read documents


  0%|          | 0/52554 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_15.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_16.ndjson
Read documents


  0%|          | 0/34240 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_16.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_17.ndjson
Read documents


  0%|          | 0/35933 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_17.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_18.ndjson
Read documents


  0%|          | 0/64575 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_18.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_19.ndjson
Read documents


  0%|          | 0/94244 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_19.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_20.ndjson
Read documents


  0%|          | 0/124472 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_20.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_21.ndjson
Read documents


  0%|          | 0/121849 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_21.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_22.ndjson
Read documents


  0%|          | 0/147110 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_22.ndjson
Added documents
Reading documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_23.ndjson
Read documents


  0%|          | 0/70322 [00:00<?, ?it/s]

Adding documents /home/ec2-user/arabic-wiki/data/embedded/ar_wiki_23.ndjson
Added documents


  new_centroids.index_reduce_(
 34%|███████████████████████████████████                                                                    | 17/50 [01:26<02:47,  5.06s/it]
[2023-10-31T18:47:43Z WARN  lance_linalg::kmeans] KMeans: cluster 108 is empty
[2023-10-31T18:52:24Z WARN  lance_linalg::kmeans] KMeans: cluster 227 is empty
[2023-10-31T18:57:24Z WARN  lance_linalg::kmeans] KMeans: cluster 167 is empty
[2023-10-31T19:14:19Z WARN  lance_linalg::kmeans] KMeans: cluster 160 is empty


CPU times: user 2h 2min 51s, sys: 1min 38s, total: 2h 4min 30s
Wall time: 42min 56s


It's crazy how fast it was. 42minutes to ingest and index >2M documents. Lets run a test to make sure it worked!

In [9]:
from sentence_transformers import SentenceTransformer

name="sentence-transformers/paraphrase-multilingual-minilm-l12-v2"
model = SentenceTransformer(name)

# used for both training and querying
def embed_func(batch):
    return [model.encode(sentence) for sentence in batch]

In [11]:
query = "What is the capital of China? I think it's Singapore."
query_vector = embed_func([query])[0]
[doc['meta']['title'] for doc in tbl.search(query_vector).limit(10).to_list()]

['بكين',
 'كونمينغ',
 'نينغشيا',
 'تاي يوان',
 'تشنغتشو',
 'شانغهاي',
 'سنغافورة',
 'دلتا نهر يانغتسي',
 'تشانغتشون',
 'بكين']