BioGeek commited on
Commit
a9ee52d
·
0 Parent(s):

feat: first iteration

Browse files
Files changed (8) hide show
  1. .gitignore +10 -0
  2. .gradio/certificate.pem +31 -0
  3. README.md +35 -0
  4. app.py +465 -0
  5. pyproject.toml +13 -0
  6. requirements.txt +0 -0
  7. sample_spectrum.mgf +33 -0
  8. uv.lock +0 -0
.gitignore ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python-generated files
2
+ __pycache__/
3
+ *.py[oc]
4
+ build/
5
+ dist/
6
+ wheels/
7
+ *.egg-info
8
+
9
+ # Virtual environments
10
+ .venv
.gradio/certificate.pem ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ -----BEGIN CERTIFICATE-----
2
+ MIIFazCCA1OgAwIBAgIRAIIQz7DSQONZRGPgu2OCiwAwDQYJKoZIhvcNAQELBQAw
3
+ TzELMAkGA1UEBhMCVVMxKTAnBgNVBAoTIEludGVybmV0IFNlY3VyaXR5IFJlc2Vh
4
+ cmNoIEdyb3VwMRUwEwYDVQQDEwxJU1JHIFJvb3QgWDEwHhcNMTUwNjA0MTEwNDM4
5
+ WhcNMzUwNjA0MTEwNDM4WjBPMQswCQYDVQQGEwJVUzEpMCcGA1UEChMgSW50ZXJu
6
+ ZXQgU2VjdXJpdHkgUmVzZWFyY2ggR3JvdXAxFTATBgNVBAMTDElTUkcgUm9vdCBY
7
+ MTCCAiIwDQYJKoZIhvcNAQEBBQADggIPADCCAgoCggIBAK3oJHP0FDfzm54rVygc
8
+ h77ct984kIxuPOZXoHj3dcKi/vVqbvYATyjb3miGbESTtrFj/RQSa78f0uoxmyF+
9
+ 0TM8ukj13Xnfs7j/EvEhmkvBioZxaUpmZmyPfjxwv60pIgbz5MDmgK7iS4+3mX6U
10
+ A5/TR5d8mUgjU+g4rk8Kb4Mu0UlXjIB0ttov0DiNewNwIRt18jA8+o+u3dpjq+sW
11
+ T8KOEUt+zwvo/7V3LvSye0rgTBIlDHCNAymg4VMk7BPZ7hm/ELNKjD+Jo2FR3qyH
12
+ B5T0Y3HsLuJvW5iB4YlcNHlsdu87kGJ55tukmi8mxdAQ4Q7e2RCOFvu396j3x+UC
13
+ B5iPNgiV5+I3lg02dZ77DnKxHZu8A/lJBdiB3QW0KtZB6awBdpUKD9jf1b0SHzUv
14
+ KBds0pjBqAlkd25HN7rOrFleaJ1/ctaJxQZBKT5ZPt0m9STJEadao0xAH0ahmbWn
15
+ OlFuhjuefXKnEgV4We0+UXgVCwOPjdAvBbI+e0ocS3MFEvzG6uBQE3xDk3SzynTn
16
+ jh8BCNAw1FtxNrQHusEwMFxIt4I7mKZ9YIqioymCzLq9gwQbooMDQaHWBfEbwrbw
17
+ qHyGO0aoSCqI3Haadr8faqU9GY/rOPNk3sgrDQoo//fb4hVC1CLQJ13hef4Y53CI
18
+ rU7m2Ys6xt0nUW7/vGT1M0NPAgMBAAGjQjBAMA4GA1UdDwEB/wQEAwIBBjAPBgNV
19
+ HRMBAf8EBTADAQH/MB0GA1UdDgQWBBR5tFnme7bl5AFzgAiIyBpY9umbbjANBgkq
20
+ hkiG9w0BAQsFAAOCAgEAVR9YqbyyqFDQDLHYGmkgJykIrGF1XIpu+ILlaS/V9lZL
21
+ ubhzEFnTIZd+50xx+7LSYK05qAvqFyFWhfFQDlnrzuBZ6brJFe+GnY+EgPbk6ZGQ
22
+ 3BebYhtF8GaV0nxvwuo77x/Py9auJ/GpsMiu/X1+mvoiBOv/2X/qkSsisRcOj/KK
23
+ NFtY2PwByVS5uCbMiogziUwthDyC3+6WVwW6LLv3xLfHTjuCvjHIInNzktHCgKQ5
24
+ ORAzI4JMPJ+GslWYHb4phowim57iaztXOoJwTdwJx4nLCgdNbOhdjsnvzqvHu7Ur
25
+ TkXWStAmzOVyyghqpZXjFaH3pO3JLF+l+/+sKAIuvtd7u+Nxe5AW0wdeRlN8NwdC
26
+ jNPElpzVmbUq4JUagEiuTDkHzsxHpFKVK7q4+63SM1N95R1NbdWhscdCb+ZAJzVc
27
+ oyi3B43njTOQ5yOf+1CceWxG1bQVs5ZufpsMljq4Ui0/1lvh+wjChP4kqKOJ2qxq
28
+ 4RgqsahDYVvTH9w7jXbyLeiNdd8XM2w9U/t7y0Ff/9yi0GE44Za4rF2LN9d11TPA
29
+ mRGunUHBcnWEvgJBQl9nJEiU0Zsnvgc/ubhPgXRR4Xq37Z0j4r7g1SgEEzwxA57d
30
+ emyPxgcYxn/eR44/KJ4EBs+lVDR3veyJm+kXQ99b21/+jh5Xos1AnX5iItreGCc=
31
+ -----END CERTIFICATE-----
README.md ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: InstaNovo De Novo Sequencing
3
+ emoji: 🚀🧪
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: gradio
7
+ sdk_version: 4.16.0 # Check your gradio version
8
+ app_file: app.py
9
+ pinned: false
10
+ license: apache-2.0
11
+ ---
12
+
13
+ # InstaNovo _De Novo_ Peptide Sequencing Demo
14
+
15
+ This Space provides a web interface for the [InstaNovo](https://github.com/instadeepai/InstaNovo) model for _de novo_ peptide sequencing from mass spectrometry data.
16
+
17
+ **Features:**
18
+
19
+ * Upload MS/MS data in common formats (`.mgf`, `.mzml`, `.mzxml`).
20
+ * Choose between fast Greedy Search or more accurate but slower Knapsack Beam Search.
21
+ * View predictions directly in the interface.
22
+ * Download full results as a CSV file.
23
+
24
+ **How to Use:**
25
+
26
+ 1. Upload your mass spectrometry data file.
27
+ 2. Select the desired decoding method.
28
+ 3. Click "Predict Sequences".
29
+ 4. View the results table and download the CSV if needed.
30
+
31
+ **Model:**
32
+
33
+ This demo uses the `instanovo-v1.1.0` pretrained model checkpoint.
34
+
35
+ **Note:** Processing large files can take time, depending on the file size and the chosen decoding method. Knapsack generation (if needed on the first run) can also add to the initial startup time.
app.py ADDED
@@ -0,0 +1,465 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import torch
3
+ import os
4
+ import tempfile
5
+ import time
6
+ import polars as pl
7
+ import numpy as np
8
+ from pathlib import Path
9
+ from omegaconf import OmegaConf, DictConfig
10
+
11
+ # --- InstaNovo Imports ---
12
+ # It's good practice to handle potential import issues
13
+ try:
14
+ from instanovo.transformer.model import InstaNovo
15
+ from instanovo.utils import SpectrumDataFrame, ResidueSet, Metrics
16
+ from instanovo.transformer.dataset import SpectrumDataset, collate_batch
17
+ from instanovo.inference import (
18
+ GreedyDecoder,
19
+ KnapsackBeamSearchDecoder,
20
+ Knapsack,
21
+ ScoredSequence,
22
+ Decoder,
23
+ )
24
+ from instanovo.constants import MASS_SCALE, MAX_MASS
25
+ from torch.utils.data import DataLoader
26
+ except ImportError as e:
27
+ print(f"Error importing InstaNovo components: {e}")
28
+ print("Please ensure InstaNovo is installed correctly.")
29
+ # Optionally, raise the error or exit if InstaNovo is critical
30
+ # raise e
31
+
32
+ # --- Configuration ---
33
+ MODEL_ID = "instanovo-v1.1.0" # Use the desired pretrained model ID
34
+ KNAPSACK_DIR = Path("./knapsack_cache")
35
+ DEFAULT_CONFIG_PATH = Path("./configs/inference/default.yaml") # Assuming instanovo installs configs locally relative to execution
36
+
37
+ # Determine device
38
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
39
+ FP16 = DEVICE == "cuda" # Enable FP16 only on CUDA
40
+
41
+ # --- Global Variables (Load Model and Knapsack Once) ---
42
+ MODEL: InstaNovo | None = None
43
+ KNAPSACK: Knapsack | None = None
44
+ MODEL_CONFIG: DictConfig | None = None
45
+ RESIDUE_SET: ResidueSet | None = None
46
+
47
+ def load_model_and_knapsack():
48
+ """Loads the InstaNovo model and generates/loads the knapsack."""
49
+ global MODEL, KNAPSACK, MODEL_CONFIG, RESIDUE_SET
50
+ if MODEL is not None:
51
+ print("Model already loaded.")
52
+ return
53
+
54
+ print(f"Loading InstaNovo model: {MODEL_ID} to {DEVICE}...")
55
+ try:
56
+ MODEL, MODEL_CONFIG = InstaNovo.from_pretrained(MODEL_ID)
57
+ MODEL.to(DEVICE)
58
+ MODEL.eval()
59
+ RESIDUE_SET = MODEL.residue_set
60
+ print("Model loaded successfully.")
61
+ except Exception as e:
62
+ print(f"Error loading model: {e}")
63
+ raise gr.Error(f"Failed to load InstaNovo model: {MODEL_ID}. Error: {e}")
64
+
65
+ # --- Knapsack Handling ---
66
+ KNAPSACK_DIR.mkdir(parents=True, exist_ok=True)
67
+ knapsack_exists = (
68
+ (KNAPSACK_DIR / "parameters.pkl").exists() and
69
+ (KNAPSACK_DIR / "masses.npy").exists() and
70
+ (KNAPSACK_DIR / "chart.npy").exists()
71
+ )
72
+
73
+ if knapsack_exists:
74
+ print(f"Loading pre-generated knapsack from {KNAPSACK_DIR}...")
75
+ try:
76
+ KNAPSACK = Knapsack.from_file(str(KNAPSACK_DIR))
77
+ print("Knapsack loaded successfully.")
78
+ except Exception as e:
79
+ print(f"Error loading knapsack: {e}. Will attempt to regenerate.")
80
+ KNAPSACK = None # Force regeneration
81
+ knapsack_exists = False # Ensure generation happens
82
+
83
+ if not knapsack_exists:
84
+ print("Knapsack not found or failed to load. Generating knapsack...")
85
+ if RESIDUE_SET is None:
86
+ raise gr.Error("Cannot generate knapsack because ResidueSet failed to load.")
87
+ try:
88
+ # Prepare residue masses for knapsack generation (handle negative/zero masses)
89
+ residue_masses_knapsack = dict(RESIDUE_SET.residue_masses.copy())
90
+ negative_residues = [k for k, v in residue_masses_knapsack.items() if v <= 0]
91
+ if negative_residues:
92
+ print(f"Warning: Non-positive masses found in residues: {negative_residues}. "
93
+ "Excluding from knapsack generation.")
94
+ for res in negative_residues:
95
+ del residue_masses_knapsack[res]
96
+ # Remove special tokens explicitly if they somehow got mass
97
+ for special_token in RESIDUE_SET.special_tokens:
98
+ if special_token in residue_masses_knapsack:
99
+ del residue_masses_knapsack[special_token]
100
+
101
+ # Ensure residue indices used match those without special/negative masses
102
+ valid_residue_indices = {
103
+ res: idx for res, idx in RESIDUE_SET.residue_to_index.items()
104
+ if res in residue_masses_knapsack
105
+ }
106
+
107
+
108
+ KNAPSACK = Knapsack.construct_knapsack(
109
+ residue_masses=residue_masses_knapsack,
110
+ residue_indices=valid_residue_indices, # Use only valid indices
111
+ max_mass=MAX_MASS,
112
+ mass_scale=MASS_SCALE,
113
+ )
114
+ print(f"Knapsack generated. Saving to {KNAPSACK_DIR}...")
115
+ KNAPSACK.save(str(KNAPSACK_DIR)) # Save for future runs
116
+ print("Knapsack saved.")
117
+ except Exception as e:
118
+ print(f"Error generating or saving knapsack: {e}")
119
+ gr.Warning("Failed to generate Knapsack. Knapsack Beam Search will not be available.")
120
+ KNAPSACK = None # Ensure it's None if generation failed
121
+
122
+ # Load the model and knapsack when the script starts
123
+ load_model_and_knapsack()
124
+
125
+ def create_inference_config(
126
+ input_path: str,
127
+ output_path: str,
128
+ decoding_method: str,
129
+ ) -> DictConfig:
130
+ """Creates the OmegaConf DictConfig needed for prediction."""
131
+ # Load default config if available, otherwise create from scratch
132
+ if DEFAULT_CONFIG_PATH.exists():
133
+ base_cfg = OmegaConf.load(DEFAULT_CONFIG_PATH)
134
+ else:
135
+ print(f"Warning: Default config not found at {DEFAULT_CONFIG_PATH}. Using minimal config.")
136
+ # Create a minimal config if default is missing
137
+ base_cfg = OmegaConf.create({
138
+ "data_path": None,
139
+ "instanovo_model": MODEL_ID,
140
+ "output_path": None,
141
+ "knapsack_path": str(KNAPSACK_DIR),
142
+ "denovo": True,
143
+ "refine": False, # Not doing refinement here
144
+ "num_beams": 1,
145
+ "max_length": 40,
146
+ "max_charge": 10,
147
+ "isotope_error_range": [0, 1],
148
+ "subset": 1.0,
149
+ "use_knapsack": False,
150
+ "save_beams": False,
151
+ "batch_size": 64, # Adjust as needed
152
+ "device": DEVICE,
153
+ "fp16": FP16,
154
+ "log_interval": 500, # Less relevant for Gradio app
155
+ "use_basic_logging": True,
156
+ "filter_precursor_ppm": 20,
157
+ "filter_confidence": 1e-4,
158
+ "filter_fdr_threshold": 0.05,
159
+ "residue_remapping": { # Add default mappings
160
+ "M(ox)": "M[UNIMOD:35]", "M(+15.99)": "M[UNIMOD:35]",
161
+ "S(p)": "S[UNIMOD:21]", "T(p)": "T[UNIMOD:21]", "Y(p)": "Y[UNIMOD:21]",
162
+ "S(+79.97)": "S[UNIMOD:21]", "T(+79.97)": "T[UNIMOD:21]", "Y(+79.97)": "Y[UNIMOD:21]",
163
+ "Q(+0.98)": "Q[UNIMOD:7]", "N(+0.98)": "N[UNIMOD:7]",
164
+ "Q(+.98)": "Q[UNIMOD:7]", "N(+.98)": "N[UNIMOD:7]",
165
+ "C(+57.02)": "C[UNIMOD:4]",
166
+ "(+42.01)": "[UNIMOD:1]", "(+43.01)": "[UNIMOD:5]", "(-17.03)": "[UNIMOD:385]",
167
+ },
168
+ "column_map": { # Add default mappings
169
+ "Modified sequence": "modified_sequence", "MS/MS m/z": "precursor_mz",
170
+ "Mass": "precursor_mass", "Charge": "precursor_charge",
171
+ "Mass values": "mz_array", "Mass spectrum": "mz_array",
172
+ "Intensity": "intensity_array", "Raw intensity spectrum": "intensity_array",
173
+ "Scan number": "scan_number"
174
+ },
175
+ "index_columns": [
176
+ "scan_number", "precursor_mz", "precursor_charge",
177
+ ],
178
+ # Add other defaults if needed based on errors
179
+ })
180
+
181
+ # Override specific parameters
182
+ cfg_overrides = {
183
+ "data_path": input_path,
184
+ "output_path": output_path,
185
+ "device": DEVICE,
186
+ "fp16": FP16,
187
+ "denovo": True,
188
+ "refine": False,
189
+ }
190
+
191
+ if "Greedy" in decoding_method:
192
+ cfg_overrides["num_beams"] = 1
193
+ cfg_overrides["use_knapsack"] = False
194
+ elif "Knapsack" in decoding_method:
195
+ if KNAPSACK is None:
196
+ raise gr.Error("Knapsack is not available. Cannot use Knapsack Beam Search.")
197
+ cfg_overrides["num_beams"] = 5
198
+ cfg_overrides["use_knapsack"] = True
199
+ cfg_overrides["knapsack_path"] = str(KNAPSACK_DIR)
200
+ else:
201
+ raise ValueError(f"Unknown decoding method: {decoding_method}")
202
+
203
+ # Merge base config with overrides
204
+ final_cfg = OmegaConf.merge(base_cfg, cfg_overrides)
205
+ return final_cfg
206
+
207
+
208
+ def predict_peptides(input_file, decoding_method):
209
+ """
210
+ Main function to load data, run prediction, and return results.
211
+ """
212
+ if MODEL is None or RESIDUE_SET is None or MODEL_CONFIG is None:
213
+ load_model_and_knapsack() # Attempt to reload if None (e.g., after space restart)
214
+ if MODEL is None:
215
+ raise gr.Error("InstaNovo model is not loaded. Cannot perform prediction.")
216
+
217
+ if input_file is None:
218
+ raise gr.Error("Please upload a mass spectrometry file.")
219
+
220
+ input_path = input_file.name # Gradio provides the path in .name
221
+ print(f"Processing file: {input_path}")
222
+ print(f"Using decoding method: {decoding_method}")
223
+
224
+ # Create a temporary file for the output CSV
225
+ with tempfile.NamedTemporaryFile(delete=False, suffix=".csv") as temp_out:
226
+ output_csv_path = temp_out.name
227
+
228
+ try:
229
+ # 1. Create Config
230
+ config = create_inference_config(input_path, output_csv_path, decoding_method)
231
+ print("Inference Config:\n", OmegaConf.to_yaml(config))
232
+
233
+ # 2. Load Data using SpectrumDataFrame
234
+ print("Loading spectrum data...")
235
+ try:
236
+ sdf = SpectrumDataFrame.load(
237
+ config.data_path,
238
+ lazy=False, # Load eagerly for Gradio simplicity
239
+ is_annotated=False, # De novo mode
240
+ column_mapping=config.get("column_map", None),
241
+ shuffle=False,
242
+ verbose=True # Print loading logs
243
+ )
244
+ # Apply charge filter like in CLI
245
+ original_size = len(sdf)
246
+ max_charge = config.get("max_charge", 10)
247
+ sdf.filter_rows(
248
+ lambda row: (row["precursor_charge"] <= max_charge) and (row["precursor_charge"] > 0)
249
+ )
250
+ if len(sdf) < original_size:
251
+ print(f"Warning: Filtered {original_size - len(sdf)} spectra with charge > {max_charge} or <= 0.")
252
+
253
+ if len(sdf) == 0:
254
+ raise gr.Error("No valid spectra found in the uploaded file after filtering.")
255
+ print(f"Data loaded: {len(sdf)} spectra.")
256
+ except Exception as e:
257
+ print(f"Error loading data: {e}")
258
+ raise gr.Error(f"Failed to load or process the spectrum file. Error: {e}")
259
+
260
+ # 3. Prepare Dataset and DataLoader
261
+ ds = SpectrumDataset(
262
+ sdf,
263
+ RESIDUE_SET,
264
+ MODEL_CONFIG.get("n_peaks", 200),
265
+ return_str=True, # Needed for greedy/beam search targets later (though not used here)
266
+ annotated=False,
267
+ pad_spectrum_max_length=config.get("compile_model", False) or config.get("use_flash_attention", False),
268
+ bin_spectra=config.get("conv_peak_encoder", False),
269
+ )
270
+ dl = DataLoader(
271
+ ds,
272
+ batch_size=config.batch_size,
273
+ num_workers=0, # Required by SpectrumDataFrame
274
+ shuffle=False, # Required by SpectrumDataFrame
275
+ collate_fn=collate_batch,
276
+ )
277
+
278
+ # 4. Select Decoder
279
+ print("Initializing decoder...")
280
+ decoder: Decoder
281
+ if config.use_knapsack:
282
+ if KNAPSACK is None:
283
+ # This check should ideally be earlier, but double-check
284
+ raise gr.Error("Knapsack is required for Knapsack Beam Search but is not available.")
285
+ # KnapsackBeamSearchDecoder doesn't directly load from path in this version?
286
+ # We load Knapsack globally, so just pass it.
287
+ # If it needed path: decoder = KnapsackBeamSearchDecoder.from_file(model=MODEL, path=config.knapsack_path)
288
+ decoder = KnapsackBeamSearchDecoder(model=MODEL, knapsack=KNAPSACK)
289
+ elif config.num_beams > 1:
290
+ # BeamSearchDecoder is available but not explicitly requested, use Greedy for num_beams=1
291
+ print(f"Warning: num_beams={config.num_beams} > 1 but only Greedy and Knapsack Beam Search are implemented in this app. Defaulting to Greedy.")
292
+ decoder = GreedyDecoder(model=MODEL, mass_scale=MASS_SCALE)
293
+ else:
294
+ decoder = GreedyDecoder(
295
+ model=MODEL,
296
+ mass_scale=MASS_SCALE,
297
+ # Add suppression options if needed from config
298
+ suppressed_residues=config.get("suppressed_residues", None),
299
+ disable_terminal_residues_anywhere=config.get("disable_terminal_residues_anywhere", True),
300
+ )
301
+ print(f"Using decoder: {type(decoder).__name__}")
302
+
303
+ # 5. Run Prediction Loop (Adapted from instanovo/transformer/predict.py)
304
+ print("Starting prediction...")
305
+ start_time = time.time()
306
+ results_list: list[ScoredSequence | list] = [] # Store ScoredSequence or empty list
307
+
308
+ for i, batch in enumerate(dl):
309
+ spectra, precursors, spectra_mask, _, _ = batch # Ignore peptides/masks for de novo
310
+ spectra = spectra.to(DEVICE)
311
+ precursors = precursors.to(DEVICE)
312
+ spectra_mask = spectra_mask.to(DEVICE)
313
+
314
+ with torch.no_grad(), torch.amp.autocast(DEVICE, dtype=torch.float16, enabled=FP16):
315
+ # Beam search decoder might return list[list[ScoredSequence]] if return_beam=True
316
+ # Greedy decoder returns list[ScoredSequence]
317
+ # KnapsackBeamSearchDecoder returns list[ScoredSequence] or list[list[ScoredSequence]]
318
+ batch_predictions = decoder.decode(
319
+ spectra=spectra,
320
+ precursors=precursors,
321
+ beam_size=config.num_beams,
322
+ max_length=config.max_length,
323
+ # Knapsack/Beam Search specific params if needed
324
+ mass_tolerance=config.get("filter_precursor_ppm", 20) * 1e-6, # Convert ppm to relative
325
+ max_isotope=config.isotope_error_range[1] if config.isotope_error_range else 1,
326
+ return_beam=False # Only get the top prediction for simplicity
327
+ )
328
+ results_list.extend(batch_predictions) # Should be list[ScoredSequence] or list[list]
329
+ print(f"Processed batch {i+1}/{len(dl)}")
330
+
331
+ end_time = time.time()
332
+ print(f"Prediction finished in {end_time - start_time:.2f} seconds.")
333
+
334
+ # 6. Format Results
335
+ print("Formatting results...")
336
+ output_data = []
337
+ # Use sdf index columns + prediction results
338
+ index_cols = [col for col in config.index_columns if col in sdf.df.columns]
339
+ base_df_pd = sdf.df.select(index_cols).to_pandas() # Get base info
340
+
341
+ metrics_calc = Metrics(RESIDUE_SET, config.isotope_error_range)
342
+
343
+ for i, res in enumerate(results_list):
344
+ row_data = base_df_pd.iloc[i].to_dict() # Get corresponding input data
345
+ if isinstance(res, ScoredSequence) and res.sequence:
346
+ sequence_str = "".join(res.sequence)
347
+ row_data["prediction"] = sequence_str
348
+ row_data["log_probability"] = f"{res.sequence_log_probability:.4f}"
349
+ # Use metrics to calculate delta mass ppm for the top prediction
350
+ try:
351
+ _, delta_mass_list = metrics_calc.matches_precursor(
352
+ res.sequence,
353
+ row_data["precursor_mz"],
354
+ row_data["precursor_charge"]
355
+ )
356
+ # Find the smallest absolute ppm error across isotopes
357
+ min_abs_ppm = min(abs(p) for p in delta_mass_list) if delta_mass_list else float('nan')
358
+ row_data["delta_mass_ppm"] = f"{min_abs_ppm:.2f}"
359
+ except Exception as e:
360
+ print(f"Warning: Could not calculate delta mass for prediction {i}: {e}")
361
+ row_data["delta_mass_ppm"] = "N/A"
362
+
363
+ else:
364
+ row_data["prediction"] = ""
365
+ row_data["log_probability"] = "N/A"
366
+ row_data["delta_mass_ppm"] = "N/A"
367
+ output_data.append(row_data)
368
+
369
+ output_df = pl.DataFrame(output_data)
370
+
371
+ # Ensure specific columns are present and ordered
372
+ display_cols = ["scan_number", "precursor_mz", "precursor_charge", "prediction", "log_probability", "delta_mass_ppm"]
373
+ final_display_cols = []
374
+ for col in display_cols:
375
+ if col in output_df.columns:
376
+ final_display_cols.append(col)
377
+ else:
378
+ print(f"Warning: Expected display column '{col}' not found in results.")
379
+
380
+ # Add any remaining index columns that weren't in display_cols
381
+ for col in index_cols:
382
+ if col not in final_display_cols and col in output_df.columns:
383
+ final_display_cols.append(col)
384
+
385
+ output_df_display = output_df.select(final_display_cols)
386
+
387
+
388
+ # 7. Save full results to CSV
389
+ print(f"Saving results to {output_csv_path}...")
390
+ output_df.write_csv(output_csv_path)
391
+
392
+ # Return DataFrame for display and path for download
393
+ return output_df_display.to_pandas(), output_csv_path
394
+
395
+ except Exception as e:
396
+ print(f"An error occurred during prediction: {e}")
397
+ # Clean up the temporary output file if it exists
398
+ if os.path.exists(output_csv_path):
399
+ os.remove(output_csv_path)
400
+ # Re-raise as Gradio error
401
+ raise gr.Error(f"Prediction failed: {e}")
402
+
403
+ # --- Gradio Interface ---
404
+ css = """
405
+ .gradio-container { font-family: sans-serif; }
406
+ .gr-button { color: white; border-color: black; background: black; }
407
+ footer { display: none !important; }
408
+ """
409
+
410
+ with gr.Blocks(css=css, theme=gr.themes.Default(primary_hue="blue", secondary_hue="blue")) as demo:
411
+ gr.Markdown(
412
+ """
413
+ # 🚀 InstaNovo _De Novo_ Peptide Sequencing
414
+ Upload your mass spectrometry data file (.mgf, .mzml, or .mzxml) and get peptide sequence predictions using InstaNovo.
415
+ Choose between fast Greedy Search or more accurate but slower Knapsack Beam Search.
416
+ """
417
+ )
418
+ with gr.Row():
419
+ with gr.Column(scale=1):
420
+ input_file = gr.File(
421
+ label="Upload Mass Spectrometry File (.mgf, .mzml, .mzxml)",
422
+ file_types=[".mgf", ".mzml", ".mzxml"]
423
+ )
424
+ decoding_method = gr.Radio(
425
+ ["Greedy Search (Fast)", "Knapsack Beam Search (More accurate, but slower)"],
426
+ label="Decoding Method",
427
+ value="Greedy Search (Fast)" # Default to fast method
428
+ )
429
+ submit_btn = gr.Button("Predict Sequences", variant="primary")
430
+ with gr.Column(scale=2):
431
+ output_df = gr.DataFrame(label="Prediction Results", wrap=True)
432
+ output_file = gr.File(label="Download Full Results (CSV)")
433
+
434
+ submit_btn.click(
435
+ predict_peptides,
436
+ inputs=[input_file, decoding_method],
437
+ outputs=[output_df, output_file]
438
+ )
439
+
440
+ gr.Examples(
441
+ [["./sample_spectra.mgf", "Knapsack Beam Search (Accurate, 5 Beams)"]], # Requires test data fetched
442
+ inputs=[input_file, decoding_method],
443
+ outputs=[output_df, output_file],
444
+ fn=predict_peptides,
445
+ cache_examples=False, # Re-run examples if needed
446
+ label="Example Usage"
447
+ )
448
+
449
+ gr.Markdown(
450
+ """
451
+ **Notes:**
452
+ * Predictions are based on the [InstaNovo](https://github.com/instadeepai/InstaNovo) model ({MODEL_ID}).
453
+ * Knapsack Beam Search uses pre-calculated mass constraints and yields better results but takes longer.
454
+ * 'delta_mass_ppm' shows the lowest absolute precursor mass error (in ppm) across potential isotopes (0-1 neutron).
455
+ * Ensure your input file format is correctly specified. Large files may take time to process.
456
+ """.format(MODEL_ID=MODEL_ID)
457
+ )
458
+
459
+ # --- Launch the App ---
460
+ if __name__ == "__main__":
461
+ # Set share=True for temporary public link if running locally
462
+ # Set server_name="0.0.0.0" to allow access from network if needed
463
+ # demo.launch(server_name="0.0.0.0", server_port=7860)
464
+ # For Hugging Face Spaces, just demo.launch() is usually sufficient
465
+ demo.launch(share=True) # For local testing with public URL
pyproject.toml ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "instanovo-gradio"
3
+ version = "0.1.0"
4
+ description = "Add your description here"
5
+ readme = "README.md"
6
+ requires-python = ">=3.12"
7
+ dependencies = [
8
+ "gradio>=5.23.1",
9
+ "instanovo",
10
+ ]
11
+
12
+ [tool.uv.sources]
13
+ instanovo = { path = "../dtu-denovo-sequencing/dist/instanovo-1.1.0-py3-none-any.whl" }
requirements.txt ADDED
The diff for this file is too large to render. See raw diff
 
sample_spectrum.mgf ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ BEGIN IONS
2
+ TITLE=0
3
+ PEPMASS=451.25348
4
+ CHARGE=2+
5
+ SCANS=F1:2478
6
+ RTINSECONDS=824.574
7
+ SEQ=IAHYNKR
8
+ 63.994834899902344 0.0611930787563324
9
+ 70.06543731689453 0.06860413402318954
10
+ 84.081298828125 0.22455614805221558
11
+ 85.08439636230469 0.06763620674610138
12
+ 86.09666442871094 0.22344912588596344
13
+ 110.07109069824219 0.3034861385822296
14
+ 129.1020050048828 0.0932231917977333
15
+ 138.06597900390625 0.07667151838541031
16
+ 157.13291931152344 0.14716865122318268
17
+ 175.1185302734375 0.19198034703731537
18
+ 185.1283721923828 0.09717456996440887
19
+ 209.10263061523438 0.13139843940734863
20
+ 273.1337890625 0.09324286878108978
21
+ 301.1282958984375 0.08515828102827072
22
+ 303.21221923828125 0.07235292345285416
23
+ 304.17529296875 0.07120858132839203
24
+ 322.1859130859375 0.15834060311317444
25
+ 350.6787414550781 0.07397215068340302
26
+ 417.2552185058594 0.14982180297374725
27
+ 580.3185424804688 0.31572264432907104
28
+ 630.36572265625 0.06255878508090973
29
+ 717.376708984375 0.5990896821022034
30
+ 753.3748779296875 0.09976936876773834
31
+ 788.4207763671875 0.35858696699142456
32
+ 866.4544677734375 0.12016354501247406
33
+ END IONS
uv.lock ADDED
The diff for this file is too large to render. See raw diff