Spaces:
Running
on
Zero
Running
on
Zero
Commit
·
a9ee52d
0
Parent(s):
feat: first iteration
Browse files- .gitignore +10 -0
- .gradio/certificate.pem +31 -0
- README.md +35 -0
- app.py +465 -0
- pyproject.toml +13 -0
- requirements.txt +0 -0
- sample_spectrum.mgf +33 -0
- uv.lock +0 -0
.gitignore
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Python-generated files
|
2 |
+
__pycache__/
|
3 |
+
*.py[oc]
|
4 |
+
build/
|
5 |
+
dist/
|
6 |
+
wheels/
|
7 |
+
*.egg-info
|
8 |
+
|
9 |
+
# Virtual environments
|
10 |
+
.venv
|
.gradio/certificate.pem
ADDED
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
-----BEGIN CERTIFICATE-----
|
2 |
+
MIIFazCCA1OgAwIBAgIRAIIQz7DSQONZRGPgu2OCiwAwDQYJKoZIhvcNAQELBQAw
|
3 |
+
TzELMAkGA1UEBhMCVVMxKTAnBgNVBAoTIEludGVybmV0IFNlY3VyaXR5IFJlc2Vh
|
4 |
+
cmNoIEdyb3VwMRUwEwYDVQQDEwxJU1JHIFJvb3QgWDEwHhcNMTUwNjA0MTEwNDM4
|
5 |
+
WhcNMzUwNjA0MTEwNDM4WjBPMQswCQYDVQQGEwJVUzEpMCcGA1UEChMgSW50ZXJu
|
6 |
+
ZXQgU2VjdXJpdHkgUmVzZWFyY2ggR3JvdXAxFTATBgNVBAMTDElTUkcgUm9vdCBY
|
7 |
+
MTCCAiIwDQYJKoZIhvcNAQEBBQADggIPADCCAgoCggIBAK3oJHP0FDfzm54rVygc
|
8 |
+
h77ct984kIxuPOZXoHj3dcKi/vVqbvYATyjb3miGbESTtrFj/RQSa78f0uoxmyF+
|
9 |
+
0TM8ukj13Xnfs7j/EvEhmkvBioZxaUpmZmyPfjxwv60pIgbz5MDmgK7iS4+3mX6U
|
10 |
+
A5/TR5d8mUgjU+g4rk8Kb4Mu0UlXjIB0ttov0DiNewNwIRt18jA8+o+u3dpjq+sW
|
11 |
+
T8KOEUt+zwvo/7V3LvSye0rgTBIlDHCNAymg4VMk7BPZ7hm/ELNKjD+Jo2FR3qyH
|
12 |
+
B5T0Y3HsLuJvW5iB4YlcNHlsdu87kGJ55tukmi8mxdAQ4Q7e2RCOFvu396j3x+UC
|
13 |
+
B5iPNgiV5+I3lg02dZ77DnKxHZu8A/lJBdiB3QW0KtZB6awBdpUKD9jf1b0SHzUv
|
14 |
+
KBds0pjBqAlkd25HN7rOrFleaJ1/ctaJxQZBKT5ZPt0m9STJEadao0xAH0ahmbWn
|
15 |
+
OlFuhjuefXKnEgV4We0+UXgVCwOPjdAvBbI+e0ocS3MFEvzG6uBQE3xDk3SzynTn
|
16 |
+
jh8BCNAw1FtxNrQHusEwMFxIt4I7mKZ9YIqioymCzLq9gwQbooMDQaHWBfEbwrbw
|
17 |
+
qHyGO0aoSCqI3Haadr8faqU9GY/rOPNk3sgrDQoo//fb4hVC1CLQJ13hef4Y53CI
|
18 |
+
rU7m2Ys6xt0nUW7/vGT1M0NPAgMBAAGjQjBAMA4GA1UdDwEB/wQEAwIBBjAPBgNV
|
19 |
+
HRMBAf8EBTADAQH/MB0GA1UdDgQWBBR5tFnme7bl5AFzgAiIyBpY9umbbjANBgkq
|
20 |
+
hkiG9w0BAQsFAAOCAgEAVR9YqbyyqFDQDLHYGmkgJykIrGF1XIpu+ILlaS/V9lZL
|
21 |
+
ubhzEFnTIZd+50xx+7LSYK05qAvqFyFWhfFQDlnrzuBZ6brJFe+GnY+EgPbk6ZGQ
|
22 |
+
3BebYhtF8GaV0nxvwuo77x/Py9auJ/GpsMiu/X1+mvoiBOv/2X/qkSsisRcOj/KK
|
23 |
+
NFtY2PwByVS5uCbMiogziUwthDyC3+6WVwW6LLv3xLfHTjuCvjHIInNzktHCgKQ5
|
24 |
+
ORAzI4JMPJ+GslWYHb4phowim57iaztXOoJwTdwJx4nLCgdNbOhdjsnvzqvHu7Ur
|
25 |
+
TkXWStAmzOVyyghqpZXjFaH3pO3JLF+l+/+sKAIuvtd7u+Nxe5AW0wdeRlN8NwdC
|
26 |
+
jNPElpzVmbUq4JUagEiuTDkHzsxHpFKVK7q4+63SM1N95R1NbdWhscdCb+ZAJzVc
|
27 |
+
oyi3B43njTOQ5yOf+1CceWxG1bQVs5ZufpsMljq4Ui0/1lvh+wjChP4kqKOJ2qxq
|
28 |
+
4RgqsahDYVvTH9w7jXbyLeiNdd8XM2w9U/t7y0Ff/9yi0GE44Za4rF2LN9d11TPA
|
29 |
+
mRGunUHBcnWEvgJBQl9nJEiU0Zsnvgc/ubhPgXRR4Xq37Z0j4r7g1SgEEzwxA57d
|
30 |
+
emyPxgcYxn/eR44/KJ4EBs+lVDR3veyJm+kXQ99b21/+jh5Xos1AnX5iItreGCc=
|
31 |
+
-----END CERTIFICATE-----
|
README.md
ADDED
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: InstaNovo De Novo Sequencing
|
3 |
+
emoji: 🚀🧪
|
4 |
+
colorFrom: blue
|
5 |
+
colorTo: green
|
6 |
+
sdk: gradio
|
7 |
+
sdk_version: 4.16.0 # Check your gradio version
|
8 |
+
app_file: app.py
|
9 |
+
pinned: false
|
10 |
+
license: apache-2.0
|
11 |
+
---
|
12 |
+
|
13 |
+
# InstaNovo _De Novo_ Peptide Sequencing Demo
|
14 |
+
|
15 |
+
This Space provides a web interface for the [InstaNovo](https://github.com/instadeepai/InstaNovo) model for _de novo_ peptide sequencing from mass spectrometry data.
|
16 |
+
|
17 |
+
**Features:**
|
18 |
+
|
19 |
+
* Upload MS/MS data in common formats (`.mgf`, `.mzml`, `.mzxml`).
|
20 |
+
* Choose between fast Greedy Search or more accurate but slower Knapsack Beam Search.
|
21 |
+
* View predictions directly in the interface.
|
22 |
+
* Download full results as a CSV file.
|
23 |
+
|
24 |
+
**How to Use:**
|
25 |
+
|
26 |
+
1. Upload your mass spectrometry data file.
|
27 |
+
2. Select the desired decoding method.
|
28 |
+
3. Click "Predict Sequences".
|
29 |
+
4. View the results table and download the CSV if needed.
|
30 |
+
|
31 |
+
**Model:**
|
32 |
+
|
33 |
+
This demo uses the `instanovo-v1.1.0` pretrained model checkpoint.
|
34 |
+
|
35 |
+
**Note:** Processing large files can take time, depending on the file size and the chosen decoding method. Knapsack generation (if needed on the first run) can also add to the initial startup time.
|
app.py
ADDED
@@ -0,0 +1,465 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import gradio as gr
|
2 |
+
import torch
|
3 |
+
import os
|
4 |
+
import tempfile
|
5 |
+
import time
|
6 |
+
import polars as pl
|
7 |
+
import numpy as np
|
8 |
+
from pathlib import Path
|
9 |
+
from omegaconf import OmegaConf, DictConfig
|
10 |
+
|
11 |
+
# --- InstaNovo Imports ---
|
12 |
+
# It's good practice to handle potential import issues
|
13 |
+
try:
|
14 |
+
from instanovo.transformer.model import InstaNovo
|
15 |
+
from instanovo.utils import SpectrumDataFrame, ResidueSet, Metrics
|
16 |
+
from instanovo.transformer.dataset import SpectrumDataset, collate_batch
|
17 |
+
from instanovo.inference import (
|
18 |
+
GreedyDecoder,
|
19 |
+
KnapsackBeamSearchDecoder,
|
20 |
+
Knapsack,
|
21 |
+
ScoredSequence,
|
22 |
+
Decoder,
|
23 |
+
)
|
24 |
+
from instanovo.constants import MASS_SCALE, MAX_MASS
|
25 |
+
from torch.utils.data import DataLoader
|
26 |
+
except ImportError as e:
|
27 |
+
print(f"Error importing InstaNovo components: {e}")
|
28 |
+
print("Please ensure InstaNovo is installed correctly.")
|
29 |
+
# Optionally, raise the error or exit if InstaNovo is critical
|
30 |
+
# raise e
|
31 |
+
|
32 |
+
# --- Configuration ---
|
33 |
+
MODEL_ID = "instanovo-v1.1.0" # Use the desired pretrained model ID
|
34 |
+
KNAPSACK_DIR = Path("./knapsack_cache")
|
35 |
+
DEFAULT_CONFIG_PATH = Path("./configs/inference/default.yaml") # Assuming instanovo installs configs locally relative to execution
|
36 |
+
|
37 |
+
# Determine device
|
38 |
+
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
39 |
+
FP16 = DEVICE == "cuda" # Enable FP16 only on CUDA
|
40 |
+
|
41 |
+
# --- Global Variables (Load Model and Knapsack Once) ---
|
42 |
+
MODEL: InstaNovo | None = None
|
43 |
+
KNAPSACK: Knapsack | None = None
|
44 |
+
MODEL_CONFIG: DictConfig | None = None
|
45 |
+
RESIDUE_SET: ResidueSet | None = None
|
46 |
+
|
47 |
+
def load_model_and_knapsack():
|
48 |
+
"""Loads the InstaNovo model and generates/loads the knapsack."""
|
49 |
+
global MODEL, KNAPSACK, MODEL_CONFIG, RESIDUE_SET
|
50 |
+
if MODEL is not None:
|
51 |
+
print("Model already loaded.")
|
52 |
+
return
|
53 |
+
|
54 |
+
print(f"Loading InstaNovo model: {MODEL_ID} to {DEVICE}...")
|
55 |
+
try:
|
56 |
+
MODEL, MODEL_CONFIG = InstaNovo.from_pretrained(MODEL_ID)
|
57 |
+
MODEL.to(DEVICE)
|
58 |
+
MODEL.eval()
|
59 |
+
RESIDUE_SET = MODEL.residue_set
|
60 |
+
print("Model loaded successfully.")
|
61 |
+
except Exception as e:
|
62 |
+
print(f"Error loading model: {e}")
|
63 |
+
raise gr.Error(f"Failed to load InstaNovo model: {MODEL_ID}. Error: {e}")
|
64 |
+
|
65 |
+
# --- Knapsack Handling ---
|
66 |
+
KNAPSACK_DIR.mkdir(parents=True, exist_ok=True)
|
67 |
+
knapsack_exists = (
|
68 |
+
(KNAPSACK_DIR / "parameters.pkl").exists() and
|
69 |
+
(KNAPSACK_DIR / "masses.npy").exists() and
|
70 |
+
(KNAPSACK_DIR / "chart.npy").exists()
|
71 |
+
)
|
72 |
+
|
73 |
+
if knapsack_exists:
|
74 |
+
print(f"Loading pre-generated knapsack from {KNAPSACK_DIR}...")
|
75 |
+
try:
|
76 |
+
KNAPSACK = Knapsack.from_file(str(KNAPSACK_DIR))
|
77 |
+
print("Knapsack loaded successfully.")
|
78 |
+
except Exception as e:
|
79 |
+
print(f"Error loading knapsack: {e}. Will attempt to regenerate.")
|
80 |
+
KNAPSACK = None # Force regeneration
|
81 |
+
knapsack_exists = False # Ensure generation happens
|
82 |
+
|
83 |
+
if not knapsack_exists:
|
84 |
+
print("Knapsack not found or failed to load. Generating knapsack...")
|
85 |
+
if RESIDUE_SET is None:
|
86 |
+
raise gr.Error("Cannot generate knapsack because ResidueSet failed to load.")
|
87 |
+
try:
|
88 |
+
# Prepare residue masses for knapsack generation (handle negative/zero masses)
|
89 |
+
residue_masses_knapsack = dict(RESIDUE_SET.residue_masses.copy())
|
90 |
+
negative_residues = [k for k, v in residue_masses_knapsack.items() if v <= 0]
|
91 |
+
if negative_residues:
|
92 |
+
print(f"Warning: Non-positive masses found in residues: {negative_residues}. "
|
93 |
+
"Excluding from knapsack generation.")
|
94 |
+
for res in negative_residues:
|
95 |
+
del residue_masses_knapsack[res]
|
96 |
+
# Remove special tokens explicitly if they somehow got mass
|
97 |
+
for special_token in RESIDUE_SET.special_tokens:
|
98 |
+
if special_token in residue_masses_knapsack:
|
99 |
+
del residue_masses_knapsack[special_token]
|
100 |
+
|
101 |
+
# Ensure residue indices used match those without special/negative masses
|
102 |
+
valid_residue_indices = {
|
103 |
+
res: idx for res, idx in RESIDUE_SET.residue_to_index.items()
|
104 |
+
if res in residue_masses_knapsack
|
105 |
+
}
|
106 |
+
|
107 |
+
|
108 |
+
KNAPSACK = Knapsack.construct_knapsack(
|
109 |
+
residue_masses=residue_masses_knapsack,
|
110 |
+
residue_indices=valid_residue_indices, # Use only valid indices
|
111 |
+
max_mass=MAX_MASS,
|
112 |
+
mass_scale=MASS_SCALE,
|
113 |
+
)
|
114 |
+
print(f"Knapsack generated. Saving to {KNAPSACK_DIR}...")
|
115 |
+
KNAPSACK.save(str(KNAPSACK_DIR)) # Save for future runs
|
116 |
+
print("Knapsack saved.")
|
117 |
+
except Exception as e:
|
118 |
+
print(f"Error generating or saving knapsack: {e}")
|
119 |
+
gr.Warning("Failed to generate Knapsack. Knapsack Beam Search will not be available.")
|
120 |
+
KNAPSACK = None # Ensure it's None if generation failed
|
121 |
+
|
122 |
+
# Load the model and knapsack when the script starts
|
123 |
+
load_model_and_knapsack()
|
124 |
+
|
125 |
+
def create_inference_config(
|
126 |
+
input_path: str,
|
127 |
+
output_path: str,
|
128 |
+
decoding_method: str,
|
129 |
+
) -> DictConfig:
|
130 |
+
"""Creates the OmegaConf DictConfig needed for prediction."""
|
131 |
+
# Load default config if available, otherwise create from scratch
|
132 |
+
if DEFAULT_CONFIG_PATH.exists():
|
133 |
+
base_cfg = OmegaConf.load(DEFAULT_CONFIG_PATH)
|
134 |
+
else:
|
135 |
+
print(f"Warning: Default config not found at {DEFAULT_CONFIG_PATH}. Using minimal config.")
|
136 |
+
# Create a minimal config if default is missing
|
137 |
+
base_cfg = OmegaConf.create({
|
138 |
+
"data_path": None,
|
139 |
+
"instanovo_model": MODEL_ID,
|
140 |
+
"output_path": None,
|
141 |
+
"knapsack_path": str(KNAPSACK_DIR),
|
142 |
+
"denovo": True,
|
143 |
+
"refine": False, # Not doing refinement here
|
144 |
+
"num_beams": 1,
|
145 |
+
"max_length": 40,
|
146 |
+
"max_charge": 10,
|
147 |
+
"isotope_error_range": [0, 1],
|
148 |
+
"subset": 1.0,
|
149 |
+
"use_knapsack": False,
|
150 |
+
"save_beams": False,
|
151 |
+
"batch_size": 64, # Adjust as needed
|
152 |
+
"device": DEVICE,
|
153 |
+
"fp16": FP16,
|
154 |
+
"log_interval": 500, # Less relevant for Gradio app
|
155 |
+
"use_basic_logging": True,
|
156 |
+
"filter_precursor_ppm": 20,
|
157 |
+
"filter_confidence": 1e-4,
|
158 |
+
"filter_fdr_threshold": 0.05,
|
159 |
+
"residue_remapping": { # Add default mappings
|
160 |
+
"M(ox)": "M[UNIMOD:35]", "M(+15.99)": "M[UNIMOD:35]",
|
161 |
+
"S(p)": "S[UNIMOD:21]", "T(p)": "T[UNIMOD:21]", "Y(p)": "Y[UNIMOD:21]",
|
162 |
+
"S(+79.97)": "S[UNIMOD:21]", "T(+79.97)": "T[UNIMOD:21]", "Y(+79.97)": "Y[UNIMOD:21]",
|
163 |
+
"Q(+0.98)": "Q[UNIMOD:7]", "N(+0.98)": "N[UNIMOD:7]",
|
164 |
+
"Q(+.98)": "Q[UNIMOD:7]", "N(+.98)": "N[UNIMOD:7]",
|
165 |
+
"C(+57.02)": "C[UNIMOD:4]",
|
166 |
+
"(+42.01)": "[UNIMOD:1]", "(+43.01)": "[UNIMOD:5]", "(-17.03)": "[UNIMOD:385]",
|
167 |
+
},
|
168 |
+
"column_map": { # Add default mappings
|
169 |
+
"Modified sequence": "modified_sequence", "MS/MS m/z": "precursor_mz",
|
170 |
+
"Mass": "precursor_mass", "Charge": "precursor_charge",
|
171 |
+
"Mass values": "mz_array", "Mass spectrum": "mz_array",
|
172 |
+
"Intensity": "intensity_array", "Raw intensity spectrum": "intensity_array",
|
173 |
+
"Scan number": "scan_number"
|
174 |
+
},
|
175 |
+
"index_columns": [
|
176 |
+
"scan_number", "precursor_mz", "precursor_charge",
|
177 |
+
],
|
178 |
+
# Add other defaults if needed based on errors
|
179 |
+
})
|
180 |
+
|
181 |
+
# Override specific parameters
|
182 |
+
cfg_overrides = {
|
183 |
+
"data_path": input_path,
|
184 |
+
"output_path": output_path,
|
185 |
+
"device": DEVICE,
|
186 |
+
"fp16": FP16,
|
187 |
+
"denovo": True,
|
188 |
+
"refine": False,
|
189 |
+
}
|
190 |
+
|
191 |
+
if "Greedy" in decoding_method:
|
192 |
+
cfg_overrides["num_beams"] = 1
|
193 |
+
cfg_overrides["use_knapsack"] = False
|
194 |
+
elif "Knapsack" in decoding_method:
|
195 |
+
if KNAPSACK is None:
|
196 |
+
raise gr.Error("Knapsack is not available. Cannot use Knapsack Beam Search.")
|
197 |
+
cfg_overrides["num_beams"] = 5
|
198 |
+
cfg_overrides["use_knapsack"] = True
|
199 |
+
cfg_overrides["knapsack_path"] = str(KNAPSACK_DIR)
|
200 |
+
else:
|
201 |
+
raise ValueError(f"Unknown decoding method: {decoding_method}")
|
202 |
+
|
203 |
+
# Merge base config with overrides
|
204 |
+
final_cfg = OmegaConf.merge(base_cfg, cfg_overrides)
|
205 |
+
return final_cfg
|
206 |
+
|
207 |
+
|
208 |
+
def predict_peptides(input_file, decoding_method):
|
209 |
+
"""
|
210 |
+
Main function to load data, run prediction, and return results.
|
211 |
+
"""
|
212 |
+
if MODEL is None or RESIDUE_SET is None or MODEL_CONFIG is None:
|
213 |
+
load_model_and_knapsack() # Attempt to reload if None (e.g., after space restart)
|
214 |
+
if MODEL is None:
|
215 |
+
raise gr.Error("InstaNovo model is not loaded. Cannot perform prediction.")
|
216 |
+
|
217 |
+
if input_file is None:
|
218 |
+
raise gr.Error("Please upload a mass spectrometry file.")
|
219 |
+
|
220 |
+
input_path = input_file.name # Gradio provides the path in .name
|
221 |
+
print(f"Processing file: {input_path}")
|
222 |
+
print(f"Using decoding method: {decoding_method}")
|
223 |
+
|
224 |
+
# Create a temporary file for the output CSV
|
225 |
+
with tempfile.NamedTemporaryFile(delete=False, suffix=".csv") as temp_out:
|
226 |
+
output_csv_path = temp_out.name
|
227 |
+
|
228 |
+
try:
|
229 |
+
# 1. Create Config
|
230 |
+
config = create_inference_config(input_path, output_csv_path, decoding_method)
|
231 |
+
print("Inference Config:\n", OmegaConf.to_yaml(config))
|
232 |
+
|
233 |
+
# 2. Load Data using SpectrumDataFrame
|
234 |
+
print("Loading spectrum data...")
|
235 |
+
try:
|
236 |
+
sdf = SpectrumDataFrame.load(
|
237 |
+
config.data_path,
|
238 |
+
lazy=False, # Load eagerly for Gradio simplicity
|
239 |
+
is_annotated=False, # De novo mode
|
240 |
+
column_mapping=config.get("column_map", None),
|
241 |
+
shuffle=False,
|
242 |
+
verbose=True # Print loading logs
|
243 |
+
)
|
244 |
+
# Apply charge filter like in CLI
|
245 |
+
original_size = len(sdf)
|
246 |
+
max_charge = config.get("max_charge", 10)
|
247 |
+
sdf.filter_rows(
|
248 |
+
lambda row: (row["precursor_charge"] <= max_charge) and (row["precursor_charge"] > 0)
|
249 |
+
)
|
250 |
+
if len(sdf) < original_size:
|
251 |
+
print(f"Warning: Filtered {original_size - len(sdf)} spectra with charge > {max_charge} or <= 0.")
|
252 |
+
|
253 |
+
if len(sdf) == 0:
|
254 |
+
raise gr.Error("No valid spectra found in the uploaded file after filtering.")
|
255 |
+
print(f"Data loaded: {len(sdf)} spectra.")
|
256 |
+
except Exception as e:
|
257 |
+
print(f"Error loading data: {e}")
|
258 |
+
raise gr.Error(f"Failed to load or process the spectrum file. Error: {e}")
|
259 |
+
|
260 |
+
# 3. Prepare Dataset and DataLoader
|
261 |
+
ds = SpectrumDataset(
|
262 |
+
sdf,
|
263 |
+
RESIDUE_SET,
|
264 |
+
MODEL_CONFIG.get("n_peaks", 200),
|
265 |
+
return_str=True, # Needed for greedy/beam search targets later (though not used here)
|
266 |
+
annotated=False,
|
267 |
+
pad_spectrum_max_length=config.get("compile_model", False) or config.get("use_flash_attention", False),
|
268 |
+
bin_spectra=config.get("conv_peak_encoder", False),
|
269 |
+
)
|
270 |
+
dl = DataLoader(
|
271 |
+
ds,
|
272 |
+
batch_size=config.batch_size,
|
273 |
+
num_workers=0, # Required by SpectrumDataFrame
|
274 |
+
shuffle=False, # Required by SpectrumDataFrame
|
275 |
+
collate_fn=collate_batch,
|
276 |
+
)
|
277 |
+
|
278 |
+
# 4. Select Decoder
|
279 |
+
print("Initializing decoder...")
|
280 |
+
decoder: Decoder
|
281 |
+
if config.use_knapsack:
|
282 |
+
if KNAPSACK is None:
|
283 |
+
# This check should ideally be earlier, but double-check
|
284 |
+
raise gr.Error("Knapsack is required for Knapsack Beam Search but is not available.")
|
285 |
+
# KnapsackBeamSearchDecoder doesn't directly load from path in this version?
|
286 |
+
# We load Knapsack globally, so just pass it.
|
287 |
+
# If it needed path: decoder = KnapsackBeamSearchDecoder.from_file(model=MODEL, path=config.knapsack_path)
|
288 |
+
decoder = KnapsackBeamSearchDecoder(model=MODEL, knapsack=KNAPSACK)
|
289 |
+
elif config.num_beams > 1:
|
290 |
+
# BeamSearchDecoder is available but not explicitly requested, use Greedy for num_beams=1
|
291 |
+
print(f"Warning: num_beams={config.num_beams} > 1 but only Greedy and Knapsack Beam Search are implemented in this app. Defaulting to Greedy.")
|
292 |
+
decoder = GreedyDecoder(model=MODEL, mass_scale=MASS_SCALE)
|
293 |
+
else:
|
294 |
+
decoder = GreedyDecoder(
|
295 |
+
model=MODEL,
|
296 |
+
mass_scale=MASS_SCALE,
|
297 |
+
# Add suppression options if needed from config
|
298 |
+
suppressed_residues=config.get("suppressed_residues", None),
|
299 |
+
disable_terminal_residues_anywhere=config.get("disable_terminal_residues_anywhere", True),
|
300 |
+
)
|
301 |
+
print(f"Using decoder: {type(decoder).__name__}")
|
302 |
+
|
303 |
+
# 5. Run Prediction Loop (Adapted from instanovo/transformer/predict.py)
|
304 |
+
print("Starting prediction...")
|
305 |
+
start_time = time.time()
|
306 |
+
results_list: list[ScoredSequence | list] = [] # Store ScoredSequence or empty list
|
307 |
+
|
308 |
+
for i, batch in enumerate(dl):
|
309 |
+
spectra, precursors, spectra_mask, _, _ = batch # Ignore peptides/masks for de novo
|
310 |
+
spectra = spectra.to(DEVICE)
|
311 |
+
precursors = precursors.to(DEVICE)
|
312 |
+
spectra_mask = spectra_mask.to(DEVICE)
|
313 |
+
|
314 |
+
with torch.no_grad(), torch.amp.autocast(DEVICE, dtype=torch.float16, enabled=FP16):
|
315 |
+
# Beam search decoder might return list[list[ScoredSequence]] if return_beam=True
|
316 |
+
# Greedy decoder returns list[ScoredSequence]
|
317 |
+
# KnapsackBeamSearchDecoder returns list[ScoredSequence] or list[list[ScoredSequence]]
|
318 |
+
batch_predictions = decoder.decode(
|
319 |
+
spectra=spectra,
|
320 |
+
precursors=precursors,
|
321 |
+
beam_size=config.num_beams,
|
322 |
+
max_length=config.max_length,
|
323 |
+
# Knapsack/Beam Search specific params if needed
|
324 |
+
mass_tolerance=config.get("filter_precursor_ppm", 20) * 1e-6, # Convert ppm to relative
|
325 |
+
max_isotope=config.isotope_error_range[1] if config.isotope_error_range else 1,
|
326 |
+
return_beam=False # Only get the top prediction for simplicity
|
327 |
+
)
|
328 |
+
results_list.extend(batch_predictions) # Should be list[ScoredSequence] or list[list]
|
329 |
+
print(f"Processed batch {i+1}/{len(dl)}")
|
330 |
+
|
331 |
+
end_time = time.time()
|
332 |
+
print(f"Prediction finished in {end_time - start_time:.2f} seconds.")
|
333 |
+
|
334 |
+
# 6. Format Results
|
335 |
+
print("Formatting results...")
|
336 |
+
output_data = []
|
337 |
+
# Use sdf index columns + prediction results
|
338 |
+
index_cols = [col for col in config.index_columns if col in sdf.df.columns]
|
339 |
+
base_df_pd = sdf.df.select(index_cols).to_pandas() # Get base info
|
340 |
+
|
341 |
+
metrics_calc = Metrics(RESIDUE_SET, config.isotope_error_range)
|
342 |
+
|
343 |
+
for i, res in enumerate(results_list):
|
344 |
+
row_data = base_df_pd.iloc[i].to_dict() # Get corresponding input data
|
345 |
+
if isinstance(res, ScoredSequence) and res.sequence:
|
346 |
+
sequence_str = "".join(res.sequence)
|
347 |
+
row_data["prediction"] = sequence_str
|
348 |
+
row_data["log_probability"] = f"{res.sequence_log_probability:.4f}"
|
349 |
+
# Use metrics to calculate delta mass ppm for the top prediction
|
350 |
+
try:
|
351 |
+
_, delta_mass_list = metrics_calc.matches_precursor(
|
352 |
+
res.sequence,
|
353 |
+
row_data["precursor_mz"],
|
354 |
+
row_data["precursor_charge"]
|
355 |
+
)
|
356 |
+
# Find the smallest absolute ppm error across isotopes
|
357 |
+
min_abs_ppm = min(abs(p) for p in delta_mass_list) if delta_mass_list else float('nan')
|
358 |
+
row_data["delta_mass_ppm"] = f"{min_abs_ppm:.2f}"
|
359 |
+
except Exception as e:
|
360 |
+
print(f"Warning: Could not calculate delta mass for prediction {i}: {e}")
|
361 |
+
row_data["delta_mass_ppm"] = "N/A"
|
362 |
+
|
363 |
+
else:
|
364 |
+
row_data["prediction"] = ""
|
365 |
+
row_data["log_probability"] = "N/A"
|
366 |
+
row_data["delta_mass_ppm"] = "N/A"
|
367 |
+
output_data.append(row_data)
|
368 |
+
|
369 |
+
output_df = pl.DataFrame(output_data)
|
370 |
+
|
371 |
+
# Ensure specific columns are present and ordered
|
372 |
+
display_cols = ["scan_number", "precursor_mz", "precursor_charge", "prediction", "log_probability", "delta_mass_ppm"]
|
373 |
+
final_display_cols = []
|
374 |
+
for col in display_cols:
|
375 |
+
if col in output_df.columns:
|
376 |
+
final_display_cols.append(col)
|
377 |
+
else:
|
378 |
+
print(f"Warning: Expected display column '{col}' not found in results.")
|
379 |
+
|
380 |
+
# Add any remaining index columns that weren't in display_cols
|
381 |
+
for col in index_cols:
|
382 |
+
if col not in final_display_cols and col in output_df.columns:
|
383 |
+
final_display_cols.append(col)
|
384 |
+
|
385 |
+
output_df_display = output_df.select(final_display_cols)
|
386 |
+
|
387 |
+
|
388 |
+
# 7. Save full results to CSV
|
389 |
+
print(f"Saving results to {output_csv_path}...")
|
390 |
+
output_df.write_csv(output_csv_path)
|
391 |
+
|
392 |
+
# Return DataFrame for display and path for download
|
393 |
+
return output_df_display.to_pandas(), output_csv_path
|
394 |
+
|
395 |
+
except Exception as e:
|
396 |
+
print(f"An error occurred during prediction: {e}")
|
397 |
+
# Clean up the temporary output file if it exists
|
398 |
+
if os.path.exists(output_csv_path):
|
399 |
+
os.remove(output_csv_path)
|
400 |
+
# Re-raise as Gradio error
|
401 |
+
raise gr.Error(f"Prediction failed: {e}")
|
402 |
+
|
403 |
+
# --- Gradio Interface ---
|
404 |
+
css = """
|
405 |
+
.gradio-container { font-family: sans-serif; }
|
406 |
+
.gr-button { color: white; border-color: black; background: black; }
|
407 |
+
footer { display: none !important; }
|
408 |
+
"""
|
409 |
+
|
410 |
+
with gr.Blocks(css=css, theme=gr.themes.Default(primary_hue="blue", secondary_hue="blue")) as demo:
|
411 |
+
gr.Markdown(
|
412 |
+
"""
|
413 |
+
# 🚀 InstaNovo _De Novo_ Peptide Sequencing
|
414 |
+
Upload your mass spectrometry data file (.mgf, .mzml, or .mzxml) and get peptide sequence predictions using InstaNovo.
|
415 |
+
Choose between fast Greedy Search or more accurate but slower Knapsack Beam Search.
|
416 |
+
"""
|
417 |
+
)
|
418 |
+
with gr.Row():
|
419 |
+
with gr.Column(scale=1):
|
420 |
+
input_file = gr.File(
|
421 |
+
label="Upload Mass Spectrometry File (.mgf, .mzml, .mzxml)",
|
422 |
+
file_types=[".mgf", ".mzml", ".mzxml"]
|
423 |
+
)
|
424 |
+
decoding_method = gr.Radio(
|
425 |
+
["Greedy Search (Fast)", "Knapsack Beam Search (More accurate, but slower)"],
|
426 |
+
label="Decoding Method",
|
427 |
+
value="Greedy Search (Fast)" # Default to fast method
|
428 |
+
)
|
429 |
+
submit_btn = gr.Button("Predict Sequences", variant="primary")
|
430 |
+
with gr.Column(scale=2):
|
431 |
+
output_df = gr.DataFrame(label="Prediction Results", wrap=True)
|
432 |
+
output_file = gr.File(label="Download Full Results (CSV)")
|
433 |
+
|
434 |
+
submit_btn.click(
|
435 |
+
predict_peptides,
|
436 |
+
inputs=[input_file, decoding_method],
|
437 |
+
outputs=[output_df, output_file]
|
438 |
+
)
|
439 |
+
|
440 |
+
gr.Examples(
|
441 |
+
[["./sample_spectra.mgf", "Knapsack Beam Search (Accurate, 5 Beams)"]], # Requires test data fetched
|
442 |
+
inputs=[input_file, decoding_method],
|
443 |
+
outputs=[output_df, output_file],
|
444 |
+
fn=predict_peptides,
|
445 |
+
cache_examples=False, # Re-run examples if needed
|
446 |
+
label="Example Usage"
|
447 |
+
)
|
448 |
+
|
449 |
+
gr.Markdown(
|
450 |
+
"""
|
451 |
+
**Notes:**
|
452 |
+
* Predictions are based on the [InstaNovo](https://github.com/instadeepai/InstaNovo) model ({MODEL_ID}).
|
453 |
+
* Knapsack Beam Search uses pre-calculated mass constraints and yields better results but takes longer.
|
454 |
+
* 'delta_mass_ppm' shows the lowest absolute precursor mass error (in ppm) across potential isotopes (0-1 neutron).
|
455 |
+
* Ensure your input file format is correctly specified. Large files may take time to process.
|
456 |
+
""".format(MODEL_ID=MODEL_ID)
|
457 |
+
)
|
458 |
+
|
459 |
+
# --- Launch the App ---
|
460 |
+
if __name__ == "__main__":
|
461 |
+
# Set share=True for temporary public link if running locally
|
462 |
+
# Set server_name="0.0.0.0" to allow access from network if needed
|
463 |
+
# demo.launch(server_name="0.0.0.0", server_port=7860)
|
464 |
+
# For Hugging Face Spaces, just demo.launch() is usually sufficient
|
465 |
+
demo.launch(share=True) # For local testing with public URL
|
pyproject.toml
ADDED
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[project]
|
2 |
+
name = "instanovo-gradio"
|
3 |
+
version = "0.1.0"
|
4 |
+
description = "Add your description here"
|
5 |
+
readme = "README.md"
|
6 |
+
requires-python = ">=3.12"
|
7 |
+
dependencies = [
|
8 |
+
"gradio>=5.23.1",
|
9 |
+
"instanovo",
|
10 |
+
]
|
11 |
+
|
12 |
+
[tool.uv.sources]
|
13 |
+
instanovo = { path = "../dtu-denovo-sequencing/dist/instanovo-1.1.0-py3-none-any.whl" }
|
requirements.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
sample_spectrum.mgf
ADDED
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
BEGIN IONS
|
2 |
+
TITLE=0
|
3 |
+
PEPMASS=451.25348
|
4 |
+
CHARGE=2+
|
5 |
+
SCANS=F1:2478
|
6 |
+
RTINSECONDS=824.574
|
7 |
+
SEQ=IAHYNKR
|
8 |
+
63.994834899902344 0.0611930787563324
|
9 |
+
70.06543731689453 0.06860413402318954
|
10 |
+
84.081298828125 0.22455614805221558
|
11 |
+
85.08439636230469 0.06763620674610138
|
12 |
+
86.09666442871094 0.22344912588596344
|
13 |
+
110.07109069824219 0.3034861385822296
|
14 |
+
129.1020050048828 0.0932231917977333
|
15 |
+
138.06597900390625 0.07667151838541031
|
16 |
+
157.13291931152344 0.14716865122318268
|
17 |
+
175.1185302734375 0.19198034703731537
|
18 |
+
185.1283721923828 0.09717456996440887
|
19 |
+
209.10263061523438 0.13139843940734863
|
20 |
+
273.1337890625 0.09324286878108978
|
21 |
+
301.1282958984375 0.08515828102827072
|
22 |
+
303.21221923828125 0.07235292345285416
|
23 |
+
304.17529296875 0.07120858132839203
|
24 |
+
322.1859130859375 0.15834060311317444
|
25 |
+
350.6787414550781 0.07397215068340302
|
26 |
+
417.2552185058594 0.14982180297374725
|
27 |
+
580.3185424804688 0.31572264432907104
|
28 |
+
630.36572265625 0.06255878508090973
|
29 |
+
717.376708984375 0.5990896821022034
|
30 |
+
753.3748779296875 0.09976936876773834
|
31 |
+
788.4207763671875 0.35858696699142456
|
32 |
+
866.4544677734375 0.12016354501247406
|
33 |
+
END IONS
|
uv.lock
ADDED
The diff for this file is too large to render.
See raw diff
|
|