Question: How can i extract FPKM from my GTF file which is created by stringtie?
0
gravatar for Zeason
9 months ago by
Zeason0
Zeason0 wrote:

i just do some practice on ballgown and stringtie , and i got some GTF file and ballgown`s file. however, i just find i cant use R or something else to deliver a excel file which contain FPKM. i think the excel file i want maybe looks like this :

gene_id      FPKM
    A        124
    B        541   
    C        122
  

please help me ,thanks a lot :)

R • 831 views
ADD COMMENTlink modified 9 months ago by EagleEye6.4k • written 9 months ago by Zeason0

Can you paste a few sample lines from the GTF file you are working with?

ADD REPLYlink written 9 months ago by vkkodali1.1k

this is the top ten lines:

1 StringTie transcript 337772 338047 1000 + . gene_id "Zm00001d027250"; transcript_id "Zm00001d027250_T001"; cov "66.076088"; FPKM "8.407302"; TPM "14.141658"; 1 StringTie exon 337772 338047 1000 + . gene_id "Zm00001d027250"; transcript_id "Zm00001d027250_T001"; exon_number "1"; cov "66.076088"; 1 StringTie transcript 426764 432130 1000 + . gene_id "Zm00001d027254"; transcript_id "Zm00001d027254_T001"; cov "2.043083"; FPKM "0.259955"; TPM "0.437262"; 1 StringTie exon 426764 426798 1000 + . gene_id "Zm00001d027254"; transcript_id "Zm00001d027254_T001"; exon_number "1"; cov "0.000000"; 1 StringTie exon 426869 426970 1000 + . gene_id "Zm00001d027254"; transcript_id "Zm00001d027254_T001"

i think maybe a python script can extract the fpkm , but i dont know how to edit a complex python script.so i want to find some software to do this work.thanks a lot

ADD REPLYlink modified 9 months ago • written 9 months ago by Zeason0
2
gravatar for EagleEye
9 months ago by
EagleEye6.4k
Sweden
EagleEye6.4k wrote:

Form StringTie output you can use 'sample1_gene_abund.tab' file to extract these information,

FPKM:

cat {sample1}_gene_abund.tab | cut -d$'\t' -f1,8 | sed "s/FPKM/sample1_FPKM/" | sed 's/Gene ID/gene_id/' > {sample1}_gene_abund.fpkm.txt

TPM:

cat {sample1}_gene_abund.tab | cut -d$'\t' -f1,9 | sed "s/TPM/sample1_TPM/" | sed 's/Gene ID/gene_id/' > {sample1}_gene_abund.tpm.txt
ADD COMMENTlink modified 9 months ago • written 9 months ago by EagleEye6.4k

thank you firstly ,i will try it i got another question here , i always think the gene FPKM is a sums of its all transcript FPKM , is that right ? thanks a lot

ADD REPLYlink modified 9 months ago • written 9 months ago by Zeason0

It is not always the case. There are also cases like this where it depends on the quantification approach,

enter image description here

Image publication ref

ADD REPLYlink modified 9 months ago • written 9 months ago by EagleEye6.4k

really really thank you very much , i got it.

ADD REPLYlink written 9 months ago by Zeason0
1
gravatar for vkkodali
9 months ago by
vkkodali1.1k
United States
vkkodali1.1k wrote:

You can use standard unix commands for this as follows:

$ cat stringtie.txt
1  StringTie  transcript  337772  338047  1000  +  .  gene_id "Zm00001d027250"; transcript_id "Zm00001d027250_T001"; cov "66.076088"; FPKM "8.407302"; TPM "14.141658";
1  StringTie  exon        337772  338047  1000  +  .  gene_id "Zm00001d027250"; transcript_id "Zm00001d027250_T001"; exon_number "1"; cov "66.076088";
1  StringTie  transcript  426764  432130  1000  +  .  gene_id "Zm00001d027254"; transcript_id "Zm00001d027254_T001"; cov "2.043083"; FPKM "0.259955"; TPM "0.437262";
1  StringTie  exon        426764  426798  1000  +  .  gene_id "Zm00001d027254"; transcript_id "Zm00001d027254_T001"; exon_number "1"; cov "0.000000";
1  StringTie  exon        426869  426970  1000  +  .  gene_id "Zm00001d027254"; transcript_id "Zm00001d027254_T001"
$ grep 'FPKM' temp.txt | cut -f9 | sed -r 's/gene_id "([^"]*).*FPKM "([^"]*).*/\1\t\2/g' | sed '1i#gene_id\tFPKM'
#gene_id        FPKM
Zm00001d027250  8.407302
Zm00001d027254  0.259955
ADD COMMENTlink written 9 months ago by vkkodali1.1k

thank you very much , i will try

ADD REPLYlink written 9 months ago by Zeason0

How can we get transcript level TPM values instead of gene level TPM values, I have tried to replace gene_id with transcript_id but it didn't work for me?

ADD REPLYlink written 4 months ago by waqaskhokhar99960

Assuming that your GTF file is same as above, you can do the following:

$ grep 'TPM' temp.txt | cut -f9 | sed -r 's/.*transcript_id "([^"]*).*TPM "([^"]*).*/\1\t\2/g' | sed '1i#transcript_id\tTPM'
ADD REPLYlink written 4 months ago by vkkodali1.1k

Many thanks for your response, it work fine most of the lines till the pattern sustains like:

gene_id "MSTRG.26629"; transcript_id "AT5G53360.2"; cov "14.228090"; FPKM "5.268032"; TPM "8.616198";

But generates error when ref_gene_name in 9th column contains TPM letters in gene name (ATPMEPCRF)

gene_id "MSTRG.26631"; transcript_id "AT5G53370.1"; ref_gene_name "ATPMEPCRF"; cov "279.969208"; FPKM "103.660202"; TPM "169.542801";

Can you please check this issue?

ADD REPLYlink written 4 months ago by waqaskhokhar99960

Simple solution will be,

grep -w "TPM"

OR

grep " TPM "

OR

grep "; TPM "
ADD REPLYlink modified 4 months ago • written 4 months ago by EagleEye6.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1388 users visited in the last hour