GFF file from protein or cds and genome assembly fasta files
1
0
Entering edit mode
4 months ago
resug ▴ 30

Hi,

How can I generate a GFF file from a protein (or CDS) and a genome assembly fasta files accurately? I'm wondering if this even possible. This genome has around 33,000 genes.

My protein fasta file:

>Lup000001.1  locus=Scaffold_1:18368:20288:- [translate_table: standard]
METEERNQRGLKGSEPELFLQWGNRKRLRCVRLKDPRISSRLNGGIRKKL
TVAPSGVTVLEKEGSHLHHQQQPNRFTRNSDGSVHRSAAVDNRKSTSPEK
EDRYYTTRGSSVVADESHSKLTGDREERALVWPKLYITLSSKEKEEDFLA
MKGCKLPHRPKKRAKIIQRSLLLVSPGAWLTDMCQERYEVREKKSNKKRP
RGLKAMGSMESDSE
>Lup000002.1  locus=Scaffold_1:58782:58961:+ [translate_table: standard]
MEALNMKVFLALMVAMLVMAATSVSAAEAPAPSPTSDATTLFIPTAFASL
IALAFGLLF

My CDS fasta file:

>Lup000001.1  locus=Scaffold_1:18368:20288:-
ATGGAAACAGAAGAGAGGAACCAGAGAGGGTTAAAAGGCTCAGAGCCAGA
GCTTTTCTTGCAGTGGGGAAACAGAAAGAGACTGAGATGTGTGAGGCTTA
AGGACCCTCGGATTTCATCAAGACTCAACGGTGGGATCAGAAAAAAGCTC
ACTGTTGCTCCTTCTGGAGTTACTGTTTTGGAGAAAGAAGGTTCTCATCT
TCACCATCAACAACAACCTAATCGTTTCACAAGGAATTCTGATGGTTCTG
TTCACCGGTCAGCCGCTGTAGATAATCGGAAATCAACTTCACCGGAGAAG
GAAGACCGGTACTACACCACAAGGGGATCGTCGGTGGTAGCGGATGAGAG
CCACAGCAAACTCACTGGTGACAGAGAAGAAAGAGCGCTTGTGTGGCCAA
AGCTTTACATCACCCTTTCAAGCAAGGAGAAAGAAGAAGATTTTCTTGCC
ATGAAAGGTTGCAAGCTTCCCCATAGACCCAAAAAGAGGGCCAAAATTAT
CCAAAGAAGCTTACTTTTGGTGAGTCCTGGAGCATGGTTAACTGATATGT
GCCAAGAGAGATATGAAGTTAGGGAGAAGAAAAGTAACAAGAAGAGGCCA
AGAGGATTGAAGGCAATGGGGAGTATGGAAAGTGATTCTGAATGA
>Lup000002.1  locus=Scaffold_1:58782:58961:+
ATGGAGGCATTGAACATGAAGGTTTTCTTGGCTTTGATGGTAGCCATGTT
GGTGATGGCAGCAACAAGTGTGTCAGCTGCTGAGGCACCAGCTCCAAGCC
CTACATCTGATGCTACCACTCTTTTCATTCCAACTGCTTTTGCTTCTCTC
ATTGCTCTTGCATTTGGGCTTCTCTTTTGA

My genome assembly fasta file:

>NLL-01
TACTGGTCCGAAAGGGCATGGGTTCGAATCCCATTCTTGACATTACATTTTATTTTCTAAATCAAAAACATTGCTATCCATGTTACATTGACTTGTTTGA
CAAAATTGTCAGTTGCTCTATTTCAAAATAATTTTCTAGTTAGACAATAAAAAATTGTTTGAAAATATTTTCGTTACAATATAAAATGATAAACTTTTAA
ATTTTAATAATTTCTTATGAAGATAATAAATTGTGGAACGTGCATGGAAAAGTGAAATGGATGGATGAGGATTTAATGTTTTATTAAATGCATGGAAGGA
GGGCGTTGAATTCATAATCTGACACAGTACTGTGAAAACTGAAAAGCCTATGCAAGTTAGTAGATTCGACGCATTTATGACAAATAAATTCCTTCAACTT
TACCAAGTACATGGAATAAAAAAAGTAATAAAAGTAAATAATGAATAATATAATATATAATATAAATGTTTATAGTAAAAAAATAGGGAATGGATAGAAT

Thank you,

Rom

GFF3 GFF • 417 views
ADD COMMENT
4
Entering edit mode
4 months ago

The tool called miniprot should work here

Miniprot aligns a protein sequence against a genome with affine gap penalty, splicing and frameshift. It is primarily intended for annotating protein-coding genes in a new species using known genes from other species. Miniprot is similar to GeneWise and Exonerate in functionality but it can map proteins to whole genomes and is much faster at the residue alignment step.

Code:

Example for GFF

# general command line: index and align in one go (-I sets max intron size based on genome size)
./miniprot -Iut16 --gff genome.fna protein.faa > aln.gff
ADD COMMENT
0
Entering edit mode

That's wonderful! I am wondering how accurate Miniprot would be in a plant genome with lots of duplicated regions, where around 16% of genes are duplicated into genes with similar but not exactly the same sequences.Thanks for your help.

ADD REPLY

Login before adding your answer.

Traffic: 2434 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6