Question

Extract transcript fasta using gff

0

Entering edit mode

4.4 years ago

boczniak767 ▴ 880

Hi,

is there simple way to extract fasta sequences of transcripts using genome fasta and gff?

I came across gffread which propose command gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf. I just don't know, what this gff file has to contain. Exons, all features,...

The second possibility is mapToTranscript from GenomicFeatures R package. The manual is just quite complicated and I don't know steps to retrieve the transcripts.

Has anybody have any experience with such procedure?

fasta gff • 5.2k views

ADD COMMENT • link updated 3 months ago by cmdcolin ★ 4.4k • written 4.4 years ago by boczniak767 ▴ 880

1

Entering edit mode

3 months ago

Bourumir ▴ 10

gffread is generally the right tool, but it does not produce the gene map required by downstream tools such as Salmon. Additionally, with my specific dataset (Zunla 3.0 genome/GFF3), gffread experienced a segmentation fault, unfortunately. Additionally, the feature to extract varies in different cases. It can be "exon", "CDS", or something else.

To overcome this limitation, I developed thaf, a Rust-based application (using Rust specifically for robustness). thaf performs extensive internal consistency checks, allows to set the list of features to consider from the command line and also generates the required gene map.

The program is open source. Our primary goal is that it performs its task reliably and effectively. In this context, having more users and open code is always beneficial.

ADD COMMENT • link 3 months ago by Bourumir ▴ 10

0

Entering edit mode

3 months ago

cmdcolin ★ 4.4k

As long as this thread is revived, there is minigff now also https://github.com/lh3/minigff

For gffread, the documentation for some reason is very hard to find (https://ccb.jhu.edu/software/stringtie/gff.shtml#gffread), but the commands to extract "cDNA", "CDS", and "pep" sequences

to get "cDNA" (all the exon sequences stitched together), use this command:

gffread -w outputted_cdna.fa yourfile.gff -g yourgenome.fa

to get coding sequence "CDS" (all the CDS feature sequences stitched together):

gffread -x outputted_cds.fa yourfile.gff -g yourgenome.fa

to get protein translation "pep" sequences, use this command:

gffread -y outputted_pep.fa yourfile.gff -g yourgenome.fa

ADD COMMENT • link 3 months ago by cmdcolin ★ 4.4k

score 1 · Accepted Answer · 2021-06-07

1

Entering edit mode

4.4 years ago

Juke34 9.3k

gffread is quite straightforward. It has to contain exons.

You might consider agat_sp_extract_sequences.pl from AGAT

Look at here for examples: Extracting genomic feature sequences from GTF/GFF files with AGAT

ADD COMMENT • link 4.4 years ago by Juke34 9.3k

0

Entering edit mode

Thanks, indeed gffread seems to work perfect. I'm just checking results. Also thanks for link to AGAT.

ADD REPLY • link 4.4 years ago by boczniak767 ▴ 880