Question

Extracting transcript sequences with gene name (gffread)

2

Entering edit mode

5.7 years ago

cgias ▴ 20

Hello,

I've extracted the transcripts using gffread from Cufflinks package using

gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf

It gave me the sequences with the transcript_id but I'd like to have it with the gene_name. Is there any way to extract it straight with gene_name using gffread?

If not, how can I change transcript_id to gene_name using the gtf file? I know it's possible using the shell but I'm not sure how to tell it to search for the right gene and then change its identification.

The extracted sequences look like this

>rna30121
GGCGGATGTAGCCAAGTGGATCAAGGCAGTGGATTGTGAATCCACCATGCGCGGGTTCAATTCCCGTCAT
TCGCC
>gene20202 CDS=1-1062
ATGACTGCAATTTTAGAGAGACGCGAAAGCGAAAGCCTATGGGGTCGCTTCTGTAACTGGATAACCAGCA
CTGAAAACCGTCTTTACATTGGATGGTTTGGTGTTTTGATGATCCCTACCTTATTGACCGCAACTTCTGT
ATTTATTATCGCATTCATTGCTGCTCCTCCAGTAGATATTGATGGTATTCGTGAACCTGTTTCTGGATCT
C
>gene20203
GGGTTGCTAACTCAATGGTAGAGTACTCGGCTTTTATCCGACTAGTTCCGGGTTCGAGTCCCGGGCAACC
CA
>rna30122
GGGTTGCTAACTCAATGGTAGAGTACTCGGCTTTTAACCGACTAGTTCCGGGTTCGAGTCCCGGGCAACC
CA
>gene20204 CDS=1-1521
ATGGAGGAATTTCAAGTATATTTAGAACTAGATAGATTTCGGCAACACGACTTCCTATACCCACTTATTT
TTCGGGAGTATATTTATGCACTTGCTCATGATCATAGTTTAAATATAAATAATAGATCCGGTTTGTTGGA
A

And the gtf

NC_010323.1 RefSeq  exon    63  137 .   -   .   transcript_id "rna30121"; gene_id "gene20201"; gene_name "trnH-GUG";
NC_010323.1 RefSeq  CDS 592 1653    .   -   0   transcript_id "gene20202"; gene_id "gene20202"; gene_name "psbA";
NC_010323.1 RefSeq  exon    1924    1959    .   -   .   transcript_id "gene20203"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq  exon    4534    4569    .   -   .   transcript_id "gene20203"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq  exon    1924    1958    .   -   .   transcript_id "rna30122"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq  exon    4533    4569    .   -   .   transcript_id "rna30122"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq  CDS 2266    3786    .   -   0   transcript_id "gene20204"; gene_id "gene20204"; gene_name "matK";

Thanks in advance

RNA-Seq • 10k views

ADD COMMENT • link updated 5.7 years ago by lakhujanivijay 5.8k • written 5.7 years ago by cgias ▴ 20

1

Entering edit mode

A simple hack is to try following on gtf (take a backup of existing gtf) and use it to annotate. Issue is that a gene can have multiple transcripts and one cannot simply replace a transcript with gene name as this would result in duplicate headers. Instead append gene name to transcript ID.

sed 's/"; gene_id "/_/g' transcripts.gtf> transcripts_new.gtf

Output fasta will have transcriptID_geneID as headers. If OP gtf has same transcript ID and gene ID then, try following on gtf:

sed 's/"\w\+[0-9]\+"; gene_id\s//g' transcripts.gtf> transcripts_new.gtf

ADD REPLY • link 5.7 years ago by cpad0112 21k

score 2 · Answer 1 · 2018-07-31

Hello cgias,

I haven't seen an option to do it in one run. What you can do is take the fasta and replace the transcript_id by the gene_name. This can be done with awk like this:

$ awk -v FS="\t" 'NR==FNR {match($9, /transcript_id "([^"]+)"/ , t); match($9, /gene_name "([^"]+)"/, g); transcript[t[1]]=g[1]; next} {match($0, />([^ ]+)/, t); gsub(t[1], transcript[t[1]]); print}' transcripts.gtf input.fasta > output.fasta

What awk is doing here is to first go through the gtf file and creates a list where the transcript_id is the index and the gene_name the value. Next it go through the fasta file extract the transcript_id and replaces it with the corresponding value (=gene_name) from our list we' created before.

fin swimmer

score 2 · Answer 2 · 2018-07-31

2

Entering edit mode

5.7 years ago

lakhujanivijay 5.8k

Try that

gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf -F

where

-F  full GFF attribute preservation (all attributes are shown)

ADD COMMENT • link 5.7 years ago by lakhujanivijay 5.8k