Question

How can I produce a fasta file from a GTF file containing my isoforms?

0

Entering edit mode

8.1 years ago

russell.stewart.j ▴ 20

We performed deep-seq on several samples and have been provided with the resulting analysis. I'm attempting to mine the data for specific transcript isoforms so I can do further wet lab analysis (eg. make primers and quantify each isoform). To do this, I need to use the merged.gtf file to "make" the sequences in fasta format. What is the best way to go about this? I've become familiar with command line and Galaxy functions. Below is a snippet of the data. It's obvious that each line is an exon, and I'm assuming that each potentially novel isoform is just the concatenation of these exons, but that seems too simple and I still don't know how to automate this to get a fasta for any particular transcript. Any help or direction to tutorials would be great. Thanks!

1 Cufflinks exon 1189228 1189283 . + . gene_id "XLOC_000009"; transcript_id "TCONS_00000010"; exon_number "1"; gene_name "CRYZL1"; oId "CUFF.179.5"; nearest_ref "ENSBTAT00000049681"; class_code "j"; tss_id "TSS10";

1 Cufflinks exon 1197951 1198022 . + . gene_id "XLOC_000009"; transcript_id "TCONS_00000010"; exon_number "2"; gene_name "CRYZL1"; oId "CUFF.179.5"; nearest_ref "ENSBTAT00000049681"; class_code "j"; tss_id "TSS10";

1 Cufflinks exon 1199592 1199669 . + . gene_id "XLOC_000009"; transcript_id "TCONS_00000010"; exon_number "3"; gene_name "CRYZL1"; oId "CUFF.179.5"; nearest_ref "ENSBTAT00000049681"; class_code "j"; tss_id "TSS10";

The list of exons continues, and restarts at exon 1 multiple times for any one gene isoform.

rna-seq next-gen • 8.4k views

ADD COMMENT • link updated 8.1 years ago by duxan ▴ 70 • written 8.1 years ago by russell.stewart.j ▴ 20

score 2 · Answer 1 · 2016-03-18

What about bedtools fastaFromBed:

bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>

By the way, from the error you post (actually a warning) it might be that the chromosome naming in the genome reference does not match the one in the gtf. E.g. in your gtf you have chromosome 1, 2, 3... in the genome chr1, chr2, ...

score 0 · Answer 2 · 2016-03-17

Note: I've tried Galaxy's gffread and received the error:

No fasta index found for /galaxy/data/bosTau6/seq/bosTau6.fa. Rebuilding, please wait.. Fasta index rebuilt. Warning: couldn't find fasta record for '15'! Warning: getSpliced(NULL,.. ) called! Warning: couldn't find fasta record for '17'! Warning: getSpl

score 0 · Answer 3 · 2016-03-21

Hi, I have tried Cufflinks gffread and Tophat gtf_to_fasta. Tophat returns some strange headers, but Cufflinks gffread works great:

gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf

Note that genome.fa need to have contigs with same names as in transcripts.gtf

The file genome.fa in this example would be a multi fasta file with the genomic sequences of the target genome. This also requires that every contig or chromosome name found in the 1st column of the input GFF file (transcript.gtf in this example) must have a corresponding sequence entry in chromosomes.fa.