bernardo-de-almeida commited on
Commit
de4c019
·
verified ·
1 Parent(s): bdfac0e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -2
README.md CHANGED
@@ -4,11 +4,55 @@ tags:
4
  - pytorch_model_hub_mixin
5
  ---
6
 
7
- # segment-enformer
8
 
9
  SegmentEnformer is a segmentation model leveraging [Enformer](https://www.nature.com/articles/s41592-021-01252-x) to predict the location of several types of genomics
10
  elements in a sequence at a single nucleotide resolution. It was trained on 14 different classes, including gene (protein-coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue-invariant and
11
  tissue-specific promoters and enhancers, and CTCF-bound sites) elements.
12
 
13
 
14
- **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - pytorch_model_hub_mixin
5
  ---
6
 
7
+ # SegmentEnformer
8
 
9
  SegmentEnformer is a segmentation model leveraging [Enformer](https://www.nature.com/articles/s41592-021-01252-x) to predict the location of several types of genomics
10
  elements in a sequence at a single nucleotide resolution. It was trained on 14 different classes, including gene (protein-coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue-invariant and
11
  tissue-specific promoters and enhancers, and CTCF-bound sites) elements.
12
 
13
 
14
+ **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
15
+
16
+ ### Model Sources
17
+
18
+ <!-- Provide the basic links for the model. -->
19
+
20
+ - **Repository:** [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer)
21
+ - **Paper:** [Segmenting the genome at single-nucleotide resolution with DNA foundation models](https://www.biorxiv.org/content/biorxiv/early/2024/03/15/2024.03.14.584712.full.pdf)
22
+
23
+ ### How to use
24
+
25
+ To Be Done
26
+
27
+
28
+
29
+ ## Training data
30
+
31
+ The **SegmentEnformer** model was trained on all human chromosomes except for chromosomes 20 and 21, kept as test set, and chromosome 22, used as a validation set.
32
+ During training, sequences are randomly sampled in the genome with associated annotations. However, we keep the sequences in the validation and test set fixed by
33
+ using a sliding window of length 196kb (original enformer input length) over the chromosomes 20 and 21. The validation set was used to monitor training and for early stopping.
34
+
35
+ ## Training procedure
36
+
37
+ ### Preprocessing
38
+
39
+ The DNA sequences are tokenized using one-hot encoding similar to the Enformer model
40
+
41
+ ### Architecture
42
+
43
+ The model is composed of the Enformer backbone, from which we remove the heads and replaced it by a 1-dimensional U-Net segmentation head made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these
44
+ blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively.
45
+
46
+ ### BibTeX entry and citation info
47
+
48
+ ```bibtex
49
+ @article{de2024segmentnt,
50
+ title={SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models},
51
+ author={de Almeida, Bernardo P and Dalla-Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others},
52
+ journal={bioRxiv},
53
+ pages={2024--03},
54
+ year={2024},
55
+ publisher={Cold Spring Harbor Laboratory}
56
+ }
57
+
58
+ ```