Question: Extracting transcript sequences with gene name (gffread)
0
gravatar for cgias
7 months ago by
cgias0
cgias0 wrote:

Hello,

I've extracted the transcripts using gffread from Cufflinks package using

gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf

It gave me the sequences with the transcript_id but I'd like to have it with the gene_name. Is there any way to extract it straight with gene_name using gffread?

If not, how can I change transcript_id to gene_name using the gtf file? I know it's possible using the shell but I'm not sure how to tell it to search for the right gene and then change its identification.

The extracted sequences look like this

>rna30121
GGCGGATGTAGCCAAGTGGATCAAGGCAGTGGATTGTGAATCCACCATGCGCGGGTTCAATTCCCGTCAT
TCGCC
>gene20202 CDS=1-1062
ATGACTGCAATTTTAGAGAGACGCGAAAGCGAAAGCCTATGGGGTCGCTTCTGTAACTGGATAACCAGCA
CTGAAAACCGTCTTTACATTGGATGGTTTGGTGTTTTGATGATCCCTACCTTATTGACCGCAACTTCTGT
ATTTATTATCGCATTCATTGCTGCTCCTCCAGTAGATATTGATGGTATTCGTGAACCTGTTTCTGGATCT
C
>gene20203
GGGTTGCTAACTCAATGGTAGAGTACTCGGCTTTTATCCGACTAGTTCCGGGTTCGAGTCCCGGGCAACC
CA
>rna30122
GGGTTGCTAACTCAATGGTAGAGTACTCGGCTTTTAACCGACTAGTTCCGGGTTCGAGTCCCGGGCAACC
CA
>gene20204 CDS=1-1521
ATGGAGGAATTTCAAGTATATTTAGAACTAGATAGATTTCGGCAACACGACTTCCTATACCCACTTATTT
TTCGGGAGTATATTTATGCACTTGCTCATGATCATAGTTTAAATATAAATAATAGATCCGGTTTGTTGGA
A

And the gtf

NC_010323.1 RefSeq  exon    63  137 .   -   .   transcript_id "rna30121"; gene_id "gene20201"; gene_name "trnH-GUG";
NC_010323.1 RefSeq  CDS 592 1653    .   -   0   transcript_id "gene20202"; gene_id "gene20202"; gene_name "psbA";
NC_010323.1 RefSeq  exon    1924    1959    .   -   .   transcript_id "gene20203"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq  exon    4534    4569    .   -   .   transcript_id "gene20203"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq  exon    1924    1958    .   -   .   transcript_id "rna30122"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq  exon    4533    4569    .   -   .   transcript_id "rna30122"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq  CDS 2266    3786    .   -   0   transcript_id "gene20204"; gene_id "gene20204"; gene_name "matK";

Thanks in advance

rna-seq • 519 views
ADD COMMENTlink modified 7 months ago by bioExplorer3.7k • written 7 months ago by cgias0

A simple hack is to try following on gtf (take a backup of existing gtf) and use it to annotate. Issue is that a gene can have multiple transcripts and one cannot simply replace a transcript with gene name as this would result in duplicate headers. Instead append gene name to transcript ID.

sed 's/"; gene_id "/_/g' transcripts.gtf> transcripts_new.gtf

Output fasta will have transcriptID_geneID as headers. If OP gtf has same transcript ID and gene ID then, try following on gtf:

sed 's/"\w\+[0-9]\+"; gene_id\s//g' transcripts.gtf> transcripts_new.gtf
ADD REPLYlink modified 7 months ago • written 7 months ago by cpad011211k
0
gravatar for finswimmer
7 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

Hello cgias,

I haven't seen an option to do it in one run. What you can do is take the fasta and replace the transcript_id by the gene_name. This can be done with awk like this:

$ awk -v FS="\t" 'NR==FNR {match($9, /transcript_id "([^"]+)"/ , t); match($9, /gene_name "([^"]+)"/, g); transcript[t[1]]=g[1]; next} {match($0, />([^ ]+)/, t); gsub(t[1], transcript[t[1]]); print}' transcripts.gtf input.fasta > output.fasta

What awk is doing here is to first go through the gtf file and creates a list where the transcript_id is the index and the gene_name the value. Next it go through the fasta file extract the transcript_id and replaces it with the corresponding value (=gene_name) from our list we' created before.

fin swimmer

ADD COMMENTlink written 7 months ago by finswimmer11k
0
gravatar for bioExplorer
7 months ago by
bioExplorer3.7k
bioExplorer3.7k wrote:

Try that

gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf -F

where

-F  full GFF attribute preservation (all attributes are shown)
ADD COMMENTlink written 7 months ago by bioExplorer3.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1825 users visited in the last hour