Question: Problem trying to convert from .gtf to .fasta file
0
gravatar for nattzy94
7 months ago by
nattzy9420
nattzy9420 wrote:

I have a master.list.gtf (generated from Cufflinks on RNAseq data) that I wish to convert to .fasta. So far I have tried using the gffread function in Cufflinks and the getfasta function from bedtools:

# gffread command
 ./gffread path/to/master.list.gtf -g /path/to/GRCh38.p13.fna -w ./master.list.fasta

# bedtools command
bedtools getfasta -fi path/to/GRCh38.p13.fna -bed /path/to/master.list.gtf

However when I ran these commands I get the error:

WARNING. chromosome (chr1) was not found in the FASTA file. Skipping.

Presumably this is because the chromosome IDs in the .fna file and .gtf don't match up. The .fasta reference file I am using begins like this:

CM000663.2 Homo sapiens chromosome 1, GRCh38 reference primary assembly
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

whereas my gtf file is formatted like so:

#!genome-build GRCh38.p2
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.17
#!genebuild-last-updated 2015-01
chr1    havana  gene    11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
chr1    havana  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";

How can I edit my gtf file so that the chromosome reference are the same?

bash rna-seq • 284 views
ADD COMMENTlink modified 7 months ago by saadbadday10 • written 7 months ago by nattzy9420
1

Or you could get the reference from Ensembl so it matches the GTF?

ADD REPLYlink modified 7 months ago • written 7 months ago by genomax91k

i managed to get the ref genome used to generate the gtf file and converted the gtf to fasta successfully. thanks!

ADD REPLYlink modified 7 months ago • written 7 months ago by nattzy9420

Please paste the exact files that you have used (for other users) - thanks. I will then move this to an answer. Please also paste the commands that you used - again, thanks.

ADD REPLYlink written 7 months ago by Kevin Blighe66k
1

I managed to get the ref genome from the graduate student who generated the gtf file. However I believe you can also download the fasta reference (release 79) from here: ftp://ftp.ensembl.org/pub/release-79/fasta/homo_sapiens/dna. I had problems connecting to the ftp server due to issues with my school wifi.

The command I used was ./gffread path/to/master.list.gtf -g /path/to/GRCh38.p13.fna -w ./master.list.fasta

An index of the reference genome will be made if there isn't one already.

ADD REPLYlink written 7 months ago by nattzy9420

Why don't you simply download the matching fasta file for your GTF, which should be this one from GENCODE:

ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_22/GRCh38.primary_assembly.genome.fa.gz

ADD REPLYlink modified 7 months ago • written 7 months ago by ATpoint40k

The .gtf file was previously generated by a PhD student running Cufflinks on fasta files of RNAseq data from my lab. He has told me that the .gtf contains novel transcripts (transcripts with no 'transcript biotype' annotation on ensembl). I assume that the fasta sequences for these novel transcripts cannot be found in the matching fasta file on GENCODE?

ADD REPLYlink written 7 months ago by nattzy9420
0
gravatar for saadbadday
7 months ago by
saadbadday10
saadbadday10 wrote:

please i didn't see like your command before( ./gffread path/to/master.list.gtf -g /path/to/GRCh38.p13.fna -w ./master.list.fasta) i suggest you to read http://ccb.jhu.edu/software/stringtie/gff.shtml for explaining gtf file and gffread . you can use transcript.gtf out put file from cufflinks then you should use this command for each file:

"gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf "

or you can merge all transcript.gtf for all sampls to produce cuffmerge-OUT/merged.gtf from cuffmerge present inside cufflinks then put merged.gtf instead of transcripts.gtf in command above and you will get transcripts.fa file. i wish help you to solve your problem.thanks

ADD COMMENTlink written 7 months ago by saadbadday10

Hi, is this an answer to the original question by the user nattzy94 ? Thank you!

ADD REPLYlink written 7 months ago by Kevin Blighe66k
1

Hi, yes that is my opinion if i didn't get mistake .thanks alot

ADD REPLYlink written 6 months ago by saadbadday10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1892 users visited in the last hour