**Notes**


---
**Goal -** Patentability of a patent document (Binary classification - Accepted / Rejected)

If marked as Rejected, it means it was given both final and non-finel rejection and was ultimately abandoned by the applicant.

**Dataset :**
Harvard USPTO Patent Dataset (34 fields)

Why this dataset?


1.   It has info about patent text at the time of original filing along with metadata.
2.   Larger dataset and can be used for a wide range of ML applications.








# Loading Dataset

In [1]:
# Install required packages
!pip install datasets
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xxhash
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# Import libraries
from datasets import load_dataset

In [3]:
# Loading a small subset of patent filings during the month of January in 2016
dataset_dict = load_dataset('HUPD/hupd',
    name='sample',
    data_files="https://huggingface.co/datasets/HUPD/hupd/blob/main/hupd_metadata_2022-02-22.feather", 
    icpr_label=None,
    train_filing_start_date='2016-01-01',
    train_filing_end_date='2016-01-21',
    val_filing_start_date='2016-01-22',
    val_filing_end_date='2016-01-31',
)

Downloading builder script:   0%|          | 0.00/14.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/10.9k [00:00<?, ?B/s]

Downloading and preparing dataset hupd/sample to /root/.cache/huggingface/datasets/HUPD___hupd/sample-23bcfec45c886e8c/0.0.0/6920d2def8fd7767046c0470603357f76866e5a09c97e19571896bfdca521142...
Loading dataset with config: PatentsConfig(name='sample', version=0.0.0, data_dir='sample', data_files={'train': ['https://huggingface.co/datasets/HUPD/hupd/blob/main/hupd_metadata_2022-02-22.feather']}, description='Patent data from January 2016, for debugging')


Downloading data:   0%|          | 0.00/6.67M [00:00<?, ?B/s]

Using metadata file: /root/.cache/huggingface/datasets/downloads/bac34b767c2799633010fa78ecd401d2eeffd62eff58abdb4db75829f8932710


Downloading data:   0%|          | 0.00/388M [00:00<?, ?B/s]

Reading metadata file: /root/.cache/huggingface/datasets/downloads/bac34b767c2799633010fa78ecd401d2eeffd62eff58abdb4db75829f8932710
Filtering train dataset by filing start date: 2016-01-01
Filtering train dataset by filing end date: 2016-01-21
Filtering val dataset by filing start date: 2016-01-22
Filtering val dataset by filing end date: 2016-01-31


Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset hupd downloaded and prepared to /root/.cache/huggingface/datasets/HUPD___hupd/sample-23bcfec45c886e8c/0.0.0/6920d2def8fd7767046c0470603357f76866e5a09c97e19571896bfdca521142. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
# Print info about the sizes of the train and validation sets
print(f'Train dataset size: {dataset_dict["train"].shape}')
print(f'Validation dataset size: {dataset_dict["validation"].shape}')

Train dataset size: (16153, 14)
Validation dataset size: (9094, 14)


## Processing the target variable (Patent decisions)

There are 6 decisions present in the USPTO dataset which consist of pending, cont- decisions, accepted and rejected. Since here, our goal is to do binary classification, we want to remove patents with those other decisions and also convert them to numbers for prediction.

In [5]:
# Label-to-index mapping for the decision status field
decision_to_str = {'REJECTED': 0, 'ACCEPTED': 1, 'PENDING': 2, 'CONT-REJECTED': 3, 'CONT-ACCEPTED': 4, 'CONT-PENDING': 5}

# Helper function
def map_decision_to_string(example):
    return {'decision': decision_to_str[example['decision']]}

In [6]:
# Re-labeling/mapping in both train and validation set
train_set = dataset_dict['train'].map(map_decision_to_string)
val_set = dataset_dict['validation'].map(map_decision_to_string)

Map:   0%|          | 0/16153 [00:00<?, ? examples/s]

Map:   0%|          | 0/9094 [00:00<?, ? examples/s]

In [7]:
# Filtering only those patents that have decisions as accepted/rejected
train_set = train_set.filter(lambda e: e['decision'] <= 1)
val_set = val_set.filter(lambda e: e['decision'] <= 1)

Filter:   0%|          | 0/16153 [00:00<?, ? examples/s]

Filter:   0%|          | 0/9094 [00:00<?, ? examples/s]

# Creating a dataloader, a model and a tokenizer

Getting the model and tokenizer from transformers

In [8]:
# Import libraries
import random
import numpy as np
import collections
from tqdm import tqdm

from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig
# Torch and torch dataloader
import torch
from torch.utils.data import DataLoader

In [9]:
# Fixing the random seed
RANDOM_SEED = 1729
torch.manual_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

In [22]:
# Define model_name from the list ['bert-base-uncased', 'distilbert-base-uncased', 'roberta-base', 'gpt2', 'allenai/longformer-base-4096']
# Define number of classes and max length of the sequence
model_name = 'distilbert-base-uncased'
CLASSES = 2
max_length = 512

In [23]:
config = AutoConfig.from_pretrained(model_name, num_labels=CLASSES, output_hidden_states=False)
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
if model_name == 'gpt2':
  tokenizer.pad_token = tokenizer.eos_token
tokenizer.max_length = max_length
tokenizer.model_max_length = max_length
# Model
model = AutoModelForSequenceClassification.from_config(config=config)

In [24]:
print(f'Model name: {model_name} \nModel params: {model.num_parameters()}')

Model name: distilbert-base-uncased 
Model params: 66955010


Creating a dataset containing of tokenized text of a section

In [25]:
# Define the section to tokenize
_SECTION_ = 'claims'

In [26]:
# Training set
train_set = train_set.map(
    lambda e: tokenizer((e[_SECTION_]), truncation=True, padding='max_length'),
    batched=True)
# Validation set
val_set = val_set.map(
    lambda e: tokenizer((e[_SECTION_]), truncation=True, padding='max_length'),
    batched=True)

Map:   0%|          | 0/8719 [00:00<?, ? examples/s]

Map:   0%|          | 0/4888 [00:00<?, ? examples/s]

In [27]:
# Set the dataset format
train_set.set_format(type='torch', 
    columns=['input_ids', 'attention_mask', 'decision'])

val_set.set_format(type='torch', 
    columns=['input_ids', 'attention_mask', 'decision'])

Creating a Pytorch Dataloader to pass to the torch model for training

In [28]:
# train_dataloader and val_data_loader
train_dataloader = DataLoader(train_set, batch_size=16, shuffle = True)
val_dataloader = DataLoader(val_set, batch_size=16)

In [29]:
# Deleting unnecessary variables to save space
del dataset_dict

# Training

In [30]:
# Connect to Google Drive
from google.colab import drive

drive.mount('/content/drive/', force_remount=True)

Mounted at /content/drive/


In [31]:
# Import libraries
# Confusion matrix
from sklearn.metrics import confusion_matrix

# For scheduling 
from transformers import get_linear_schedule_with_warmup

In [32]:
# Define number of epochs to train
epoch_n = 5
# Optimizer and scheduler
optim = torch.optim.AdamW(params=model.parameters(), lr=2e-5, eps=1e-8)
total_steps = len(train_dataloader) * epoch_n 
scheduler = get_linear_schedule_with_warmup(optim, num_warmup_steps = 0, num_training_steps = total_steps)
# Loss function 
criterion = torch.nn.CrossEntropyLoss()

In [33]:
# To use cuda
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [34]:
CLASS_NAMES = [i for i in range(CLASSES)]
# Calculate TOP1 accuracy
def measure_accuracy(outputs, labels):
    preds = np.argmax(outputs, axis=1).flatten()
    labels = labels.flatten()
    correct = np.sum(preds == labels)
    c_matrix = confusion_matrix(labels, preds, labels=CLASS_NAMES)
    return correct, len(labels), c_matrix

In [35]:
def validation(model, data_loader):
  # Inference
  model.eval()
  total_loss = 0.
  total_correct = 0
  total_correct_class_level = 0
  total_sample = 0
  total_confusion = np.zeros((CLASSES, CLASSES))
  # Loop over the examples in the evaluation set
  for i, batch in enumerate(tqdm(data_loader)):
    inputs, decisions = batch['input_ids'], batch['decision']
    inputs = inputs.to(device)
    decisions = decisions.to(device)
    with torch.no_grad():
      outputs = model(input_ids=inputs, labels=decisions).logits
    loss = criterion(outputs, decisions) 
    logits = outputs 
    total_loss += loss.cpu().item()
    correct_n, sample_n, c_matrix = measure_accuracy(logits.cpu().numpy(), decisions.cpu().numpy())
    total_confusion += c_matrix
    total_correct += correct_n
    total_sample += sample_n
    
  # Print the performance of the model on the validation set 
  print(f'*** Accuracy on the validation set: {total_correct/total_sample}')
  print(f'*** Confusion matrix:\n{total_confusion}')

  return total_loss, float(total_correct/total_sample) * 100.

In [36]:
def training():
  # Training mode is on
  model.to(device)
  model.train()
  # Best validation set accuracy so far.
  best_val_acc = 0
  for epoch in range(epoch_n):
    total_train_loss = 0.
    # Loop over the examples in the training set.
    for i, batch in enumerate(tqdm(train_dataloader)):
      inputs, decisions = batch['input_ids'], batch['decision']
      inputs = inputs.to(device, non_blocking=True)
      decisions = decisions.to(device, non_blocking=True)
    
    # Forward pass
    outputs = model(input_ids=inputs, labels=decisions).logits
    loss = criterion(outputs, decisions) #outputs.logits
    total_train_loss += loss.cpu().item()
    
    # Backward pass
    loss.backward()
    optim.step()
    if scheduler:
      scheduler.step()
    optim.zero_grad()

    # Print the loss every 500 steps
    if i % 500 == 0:
       print(f'*** Loss: {loss}')
       
       # Get the performance of the model on the validation set
       _, val_acc = validation(model, val_dataloader)
       model.train()

       # Save model if val_acc improves
       if best_val_acc < val_acc:
         best_val_acc = val_acc
         model.save_pretrained('/content/drive/MyDrive/Finetune_models/')
         tokenizer.save_pretrained('/content/drive/MyDrive/Finetune_models/tokenizers/', legacy_format=False)
  
  # Training is complete!
  print(f'\n ~ The End ~')

  # Final evaluation on the validation set
  _, val_acc = validation(model, val_dataloader)
  if best_val_acc < val_acc:
    best_val_acc = val_acc
    model.save_pretrained('/content/drive/MyDrive/Finetune_models/')

  # Additionally, print the performance of the model on the training set
  _, train_val_acc = validation(model, train_dataloader)
  print(f'*** Accuracy on the training set: {train_val_acc}.')



In [37]:
training()

100%|██████████| 545/545 [00:01<00:00, 325.60it/s]
100%|██████████| 545/545 [00:01<00:00, 464.90it/s]
100%|██████████| 545/545 [00:01<00:00, 429.16it/s]
100%|██████████| 545/545 [00:01<00:00, 432.43it/s]
100%|██████████| 545/545 [00:01<00:00, 462.72it/s]



 ~ The End ~


100%|██████████| 306/306 [01:15<00:00,  4.06it/s]


*** Accuracy on the validation set: 0.7964402618657938
*** Confusion matrix:
[[   0.  995.]
 [   0. 3893.]]


100%|██████████| 545/545 [02:25<00:00,  3.76it/s]

*** Accuracy on the validation set: 0.7965363000344076
*** Confusion matrix:
[[   0. 1774.]
 [   0. 6945.]]
*** Accuracy on the training set: 79.65363000344075.





# Inference

Getting the patentability score from saved fine tuned model.

In [38]:
# Import Libraries
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer

In [39]:
# Model/tokenizer name or path
model_name_or_path = '/content/drive/MyDrive/Finetune_models/'
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Model
model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path)

Using eos_token, but it is not set yet.


### 1. With already tokenized dataset

In [40]:
# Getting one sample from validation dataset to test
# Get the next batch
batch = next(iter(val_dataloader))
# Print the ids
print(batch['input_ids'])
# Print the labels
print(batch['decision'])

tensor([[  101,  1015,  1012,  ..., 16503,  2063,   102],
        [  101,  1015,  1012,  ...,  3341,  2012,   102],
        [  101,  1015,  1011,  ...,  1012,  2861,   102],
        ...,
        [  101,  1015,  1011,  ...,  2003,  2030,   102],
        [  101,  1015,  1011,  ...,  2003,  2019,   102],
        [  101,  1015,  1012,  ...,     0,     0,     0]])
tensor([0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])


In [41]:
# A helper function that converts ids into tokens
def convert_ids_to_string(tokenizer, input):
    return ' '.join(tokenizer.convert_ids_to_tokens(input))

In [42]:
# Print the example
print(convert_ids_to_string(tokenizer,batch['input_ids'][1]))

[CLS] 1 . a cl ##amp arrangement for supporting a fractured portion of a wooden post wherein at least part of the cl ##amp arrangement needs to be driven into the ground about the wooden post , said arrangement including : a pair of brackets adapted to be clamped on opposing sides of the wooden post to en ##cap ##sul ##ate said fractured portion ; each bracket made from an integral sheet of metal cut and pressed to define a longitudinal hem ##is ##pher ##ical chamber of compatible diameter dimensions to the wooden post ; a pair of longitudinal fl ##ange ##s extended out from each side at a ci ##rc ##um ##fer ##ential ci ##rc ##um ##ference edge at an open end of the longitudinal hem ##is ##pher ##ical chamber , wherein each fl ##ange includes a plurality of holes so that when each bracket is brought together about the post the plurality of holes on the respective longitudinal fl ##ange ##s line up so that a fast ##ening arrangement can pass through the holes to cl ##amp each bracket to

In [43]:
inputs = (batch['input_ids'][1])
decisions = (batch['decision'][1])
# If new text given then -
# inputs = tokenizer(TEXT, return_tensors="pt").to(device)

with torch.no_grad():
  outputs = model(input_ids=inputs, labels=decisions).logits

print(np.argmax(outputs, axis=-1)) # prediction


tensor([1])


In [46]:
print(outputs)

tensor([[-0.7122,  0.8950]])


In [47]:
print(inputs.shape)

torch.Size([512])


### 2. With getting the dataset from hugging face each time

In [91]:
# Loading a small subset of patent filings during the month of January in 2016
dataset_dict = load_dataset('HUPD/hupd',
    name='sample',
    data_files="https://huggingface.co/datasets/HUPD/hupd/blob/main/hupd_metadata_2022-02-22.feather", 
    icpr_label=None,
    train_filing_start_date='2016-01-01',
    train_filing_end_date='2016-01-21',
    val_filing_start_date='2016-01-22',
    val_filing_end_date='2016-01-31',
)



  0%|          | 0/2 [00:00<?, ?it/s]

In [92]:
# Label-to-index mapping for the decision status field
decision_to_str = {'REJECTED': 0, 'ACCEPTED': 1, 'PENDING': 2, 'CONT-REJECTED': 3, 'CONT-ACCEPTED': 4, 'CONT-PENDING': 5}

# Helper function
def map_decision_to_string(example):
    return {'decision': decision_to_str[example['decision']]}

In [93]:
# Re-labeling/mapping in validation set
val_set = dataset_dict['validation'].map(map_decision_to_string)
# Filtering only those patents that have decisions as accepted/rejected
val_set = val_set.filter(lambda e: e['decision'] <= 1)



In [94]:
# Select dict values with patent number
val_set = val_set.filter(lambda e: e['patent_number'] == '14908791')



In [95]:
# Display abstract and claims
val_set['abstract']

['An expression vector for production of a recombinant protein in a host cell is provided. The expression vector includes a nucleotide sequence of Sequence ID No 2 encoding for a leader peptide of sequence ID No 3.']

In [96]:
# Model/tokenizer name or path
model_name_or_path = '/content/drive/MyDrive/Finetune_models/'
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Model
model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path)

In [97]:
# Tokenize the validation dataset
val_set = val_set.map(
    lambda e: tokenizer((e[_SECTION_]), truncation=True, padding='max_length'),
    batched=True)
val_set.set_format(type='torch', 
    columns=['input_ids', 'attention_mask', 'decision'])
val_dataloader = DataLoader(val_set, batch_size=16)
batch = next(iter(val_dataloader))
inputs = (batch['input_ids'][0])
decisions = (batch['decision'][0])



In [98]:
with torch.no_grad():
  outputs = model(input_ids=inputs, labels=decisions).logits

In [109]:
prediction = np.argmax(outputs, axis=-1).stride()[0] # prediction
print(prediction)
value = {i for i in decision_to_str if decision_to_str[i]==prediction}
str(value)

1


"{'ACCEPTED'}"

In [100]:
# ground-truth
decisions

tensor(1)

In [133]:
# Patentability score
outputs[0][1].item() * 100

88.5982096195221