Question: Gffread_How to obtain fasta files of gene features (ORF, UTR) by transcripts based on gff?
0
gravatar for tianshenbio
6 months ago by
tianshenbio70
tianshenbio70 wrote:

I have a gff like this:

Bany_Scaf24 maker   gene    41357   46444   .   +   .   ID=Bany_09696;Name=Bany_09696;Alias=maker-Bany_Scaf24-exonerate_est2genome-gene-0.0;
Bany_Scaf24 maker   mRNA    41357   46444   .   +   .   ID=Bany_09696-RA;Parent=Bany_09696;Name=Bany_09696-RA;Alias=maker-Bany_Scaf24-exonerate_est2genome-gene-0.0-mRNA-1;_AED=0.00;_QI=277|1|1|1|0|0|2|35|76;_eAED=0.00;score=1829;
Bany_Scaf24 maker   exon    46189   46444   .   +   .   ID=Bany_09696-RA:1;Parent=Bany_09696-RA;
Bany_Scaf24 maker   exon    41357   41643   .   +   .   ID=Bany_09696-RA:2;Parent=Bany_09696-RA;
Bany_Scaf24 maker   five_prime_UTR  41357   41633   .   +   .   ID=Bany_09696-RA:five_prime_utr;Parent=Bany_09696-RA;
Bany_Scaf24 maker   CDS 41634   41643   .   +   0   ID=Bany_09696-RA:cds;Parent=Bany_09696-RA;
Bany_Scaf24 maker   CDS 46189   46409   .   +   2   ID=Bany_09696-RA:cds;Parent=Bany_09696-RA;
Bany_Scaf24 maker   three_prime_UTR 46410   46444   .   +   .   ID=Bany_09696-RA:three_prime_utr;Parent=Bany_09696-RA;
B

I know I could generate fasta files for each transcript using gffread. Here is how the result looks like:

>Bany_18948-RA CDS=213-977
TGTAGGTACCTCGATAAAAATACAATATTTAAAATGACAATACAAAAAATTGTCTATTTATAGTAGGTAT
AGATAGGTAGGTACGTTGAAAATCCAAAGTAAAATTGTACTTAGTACTACGTATTGATAATGAAACAATA
GATTATCATTTGATTAGGTAAATTACAAT

I assume each sequence represents the mature transcript (5'UTR, ORF(CDS),3'UTR), where 1-n represent the whole mature transcript, 1-212 represents the 5'UTR (excluding introns), 213-977 represents ORF (CDS) (excluding introns), and 978-n represents 3'UTR (excluding introns).

Am I right? I am not sure if introns within UTRs are indeed included in this sequence...I do have UTRs fragmented by introns.

If that is the case, how can I extract sequences of those lower level features by transcripts? For example, I hope to get a fasta file like this:

    >Bany_18948-RA 
TGTAGGTACCTCGATAAAAATACAATATTTAAAATGACAATACAAAAAATTGTCTATTTATAGTAGGTAT AGATAGGTAGGTACGTTGAAAATCCAAAGTAAAATTGTACTTAGTACTACGTATTGATAATGAAACAATA GATTATCATTTGATTAGGTAAATTACAAT

where the sequence only represents the 3'UTR region (or ORF region, 5'UTR) of mature transcript Bany_18948-RA, excluding introns.

If it cannot be done using gffread, is there any other tools for this? Thank you.

ADD COMMENTlink modified 6 months ago • written 6 months ago by tianshenbio70

I don’t know for gffread but same answer as before you can do it with agat_sp_extract_sequences.pl from AGAT
For cds:
agat_sp_extract_sequences.pl --gff input.gff --fasta input.fasta -o cds.fa

for five_prime_UTR features:
agat_sp_extract_sequences.pl --gff input.gff --fasta input.fasta -t five_prime_UTR -o result_five_prime.fa

for three_prime_UTR features:
agat_sp_extract_sequences.pl --gff input.gff --fasta input.fasta -t three_prime_UTR -o result_three_prime.fa

ADD REPLYlink modified 6 months ago • written 6 months ago by Juke344.9k

Thank you for your answer Juke34, will give it a try

ADD REPLYlink written 6 months ago by tianshenbio70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1757 users visited in the last hour