Question: How can I produce a fasta file from a GTF file containing my isoforms?
0
gravatar for russell.stewart.j
3.7 years ago by
russell.stewart.j0 wrote:

We performed deep-seq on several samples and have been provided with the resulting analysis. I'm attempting to mine the data for specific transcript isoforms so I can do further wet lab analysis (eg. make primers and quantify each isoform). To do this, I need to use the merged.gtf file to "make" the sequences in fasta format. What is the best way to go about this? I've become familiar with command line and Galaxy functions. Below is a snippet of the data. It's obvious that each line is an exon, and I'm assuming that each potentially novel isoform is just the concatenation of these exons, but that seems too simple and I still don't know how to automate this to get a fasta for any particular transcript. Any help or direction to tutorials would be great. Thanks!

1 Cufflinks exon 1189228 1189283 . + . gene_id "XLOC_000009"; transcript_id "TCONS_00000010"; exon_number "1"; gene_name "CRYZL1"; oId "CUFF.179.5"; nearest_ref "ENSBTAT00000049681"; class_code "j"; tss_id "TSS10";

1 Cufflinks exon 1197951 1198022 . + . gene_id "XLOC_000009"; transcript_id "TCONS_00000010"; exon_number "2"; gene_name "CRYZL1"; oId "CUFF.179.5"; nearest_ref "ENSBTAT00000049681"; class_code "j"; tss_id "TSS10";

1 Cufflinks exon 1199592 1199669 . + . gene_id "XLOC_000009"; transcript_id "TCONS_00000010"; exon_number "3"; gene_name "CRYZL1"; oId "CUFF.179.5"; nearest_ref "ENSBTAT00000049681"; class_code "j"; tss_id "TSS10";

The list of exons continues, and restarts at exon 1 multiple times for any one gene isoform.

rna-seq next-gen • 4.0k views
ADD COMMENTlink modified 3.7 years ago by duxan50 • written 3.7 years ago by russell.stewart.j0
2
gravatar for dariober
3.7 years ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

What about bedtools fastaFromBed:

bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>

By the way, from the error you post (actually a warning) it might be that the chromosome naming in the genome reference does not match the one in the gtf. E.g. in your gtf you have chromosome 1, 2, 3... in the genome chr1, chr2, ...

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by dariober10k
0
gravatar for russell.stewart.j
3.7 years ago by
russell.stewart.j0 wrote:

Note: I've tried Galaxy's gffread and received the error:

No fasta index found for /galaxy/data/bosTau6/seq/bosTau6.fa. Rebuilding, please wait.. Fasta index rebuilt. Warning: couldn't find fasta record for '15'! Warning: getSpliced(NULL,.. ) called! Warning: couldn't find fasta record for '17'! Warning: getSpl

ADD COMMENTlink written 3.7 years ago by russell.stewart.j0
0
gravatar for duxan
3.7 years ago by
duxan50
Serbia/Novi Sad
duxan50 wrote:

Hi, I have tried Cufflinks gffread and Tophat gtf_to_fasta. Tophat returns some strange headers, but Cufflinks gffread works great:

gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf

Note that genome.fa need to have contigs with same names as in transcripts.gtf

The file genome.fa in this example would be a multi fasta file with the genomic sequences of the target genome. This also requires that every contig or chromosome name found in the 1st column of the input GFF file (transcript.gtf in this example) must have a corresponding sequence entry in chromosomes.fa.

ADD COMMENTlink written 3.7 years ago by duxan50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 813 users visited in the last hour