Get sequences for all genes in a GTF file
0
0
Entering edit mode
3.1 years ago

I have a gtf file like:

scaffold_0      maker  exon    9496    9623    .       +       .       transcript_id "RHOH"; transcript_id_full "RHOH"
scaffold_0      StringTie       exon    11728   11971   .       +       .       transcript_id "RHOH"; transcript_id_full "RHOH"
scaffold_0      maker   exon    12077   12144   .       +       .       transcript_id "RHOH"; transcript_id_full "RHOH"
scaffold_0      StringTie       exon    20708   23579   .       +       .       transcript_id "RHOH"; transcript_id_full "RHOH"
scaffold_0      maker   exon    39534   40131   .       -       .       transcript_id "gene17"; transcript_id_full "gene17"
scaffold_0      maker   exon    43071   43701   .       +       .       transcript_id "gene1"; transcript_id_full "gene1"
scaffold_0      maker   exon    57526   57640   .       +       .       transcript_id "CHRNA9"; transcript_id_full "CHRNA9"
scaffold_0      maker   exon    58475   58630   .       +       .       transcript_id "CHRNA9"; transcript_id_full "CHRNA9"
scaffold_0      maker   exon    59298   59831   .       +       .       transcript_id "CHRNA9"; transcript_id_full "CHRNA9"
scaffold_0      maker   exon    60967   61512   .       +       .       transcript_id "CHRNA9"; transcript_id_full "CHRNA9"

and a .fa scaffold file:

>scaffold_0
agcagggctggagcaggagcagggctggagctggagcaaggctggagcag
gagcagggctggggctggagcagggctggagcaggagcaggagcagggct
ggagcagggctggagctagagcaggggctggagcagggctggggctggag
cagggctggagctggagcagggctggagctagagcaggggctggagcagg
agcagggctggagcaggagcagggctggagctagagcacgggctggagca
gggctggggctggagcagggctggagcaggagcaggggctggggctggag
caggggctggagcagggactggagcagggctggagcagggctggagcagg
gctggagcagggccggggctggagcGGGGTGCCGGCTCCCTCGTGGCTGG
CAGGCGGTGTGTGCTCGGCGAGcCCCCGGAGCCGGAGCCCCGGGGCGGGG

and would like to output the sequences for individual transcripts:

something maybe like (or similar to in some way):

>scaffold_0      maker  exon    9496    9623    .       +       .       transcript_id "RHOH"; transcript_id_full "RHOH"
GGTGCCGGCTCCCTCGTGGCTGGCAGGCGGTGTGTGCTCGGCGAGcCCCCGGAGCCGGAGCCCCGGG ...
>scaffold_0      StringTie    exon    11728   11971   .       +       .       transcript_id "RHOH"; transcript_id_full "RHOH"
GAGAGAGAAAACGGCAAAAGTCAGAGTTTAGAGAAACAGATGTGGGTTTGCACGTTCTGCACGTTCTCCCTTTG ...

and so on.

Do I have to code it up or there is a tool that would allow me to do this? If I do have to code for it. Would you have insights on how to do it in a easy way?

genome assembly • 1.5k views
ADD COMMENT
2
Entering edit mode

Take a look at gffread utility here.

ADD REPLY
1
Entering edit mode

Indeed, a working example here: A: Cufflinks gffread utility

ADD REPLY

Login before adding your answer.

Traffic: 1898 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6