Question: Generating FPKM matrix accross all samples after stringtie
gravatar for ever_wudi
8 days ago by
ever_wudi10 wrote:


I finished the "stringtie --merge" step and used "stringtie -e -B -G" to reestimate fpkm for each sample and generated the ballgown folder. Now I am trying to get FPKM for all genes and across all samples into a matrix. I know there is a script, but it generates gene/transcript counts instead of FPKM values, and I really need an FPKM matrix. Is there a good way to output such matrix in ballgown, stringtie, or a software that can be directly connected to the outputs of stringtie?

Thanks! Di

rna-seq • 112 views
ADD COMMENTlink modified 8 days ago by manuel.belmadani40 • written 8 days ago by ever_wudi10
gravatar for ATpoint
8 days ago by
ATpoint5.6k wrote:

If you did use stringtie properly, there should now be a folder "ballgown" that contains subdirectories for each sample, and each of these subdirectories should have any kind of common pattern in its name, like "Sample1_ballgown", "Sample2_ballgown". If so, load everything into R and get FPKM, which is as easy as:


## Create ballgown object: = ballgown(dataDir=/Path/To/ballgown_Folder/, samplePattern='*_ballgown', meas='all')

## get FPKM:

There are a lot more one-liner commands to extract common things from the data, such as exon coverage and stuff, which can easily be looked up in the BioC tutorial. A tip for the future: Really try to find these kind of answers yourself, including proper googling and a lot of try and error if you have the time. You learn so much by trying-out things and get good practice in finding solutions. A provided solution may help you know, but does not really challenge you to improve yourself.

ADD COMMENTlink modified 8 days ago • written 8 days ago by ATpoint5.6k

Thanks ATpoint, the commands worked. I am very new to R and RNA-seq, did try to search for solutions before but didn't find a satisfactory way. Thanks for the help.

ADD REPLYlink written 12 hours ago by ever_wudi10
gravatar for manuel.belmadani
8 days ago by
manuel.belmadani40 wrote:

Edit: The other answer is probably a more proper way to get the gene-level matrix, using gexpr and should probably be marked as accepted. I'll leave mine here in case the info is useful, but I wrote my answer thinking of transcript per sample matrices. texpr seems to do that as well.

I think you can get your FPKMs with Tablemaker per sample with your ballgown output, but you might have to assemble your matrix yourself (should be very few lines of Bash and/or R).


Under Running Tablemaker

tablemaker -p 4 -q -W -G merged.gtf -o sample01_output read_alignments.bam

The output is 5 files, written to the specified output directory:

t_data.ctab: transcript-level expression measurements. One row per transcript. Columns are:

  • ...
  • FPKM: Cufflinks-estimated FPKM for the transcript (available for each sample)

Here's an example t_data.ctab:

Note that each sample should have it's own t_data.ctab file, so all you need to do it grab the FPKM column for each sample and merge them together.

You can get the relevant info from the ctab with cut

$ cut -f6,12 sample1.ctab
t_name  FPKM
TCONS_00000010  800.706
TCONS_00000017  715.775
TCONS_00000020  579.569
TCONS_00000598  2205.3
TCONS_00000024  304.873
TCONS_00000613  101.848
TCONS_00000029  87.7438
TCONS_00000032  0.0249079
TCONS_00000637  323.543

And join with join

$ join -j1 <(cut -f6,12 sample1.ctab | head | sort -k1) <(cut -f6,12 sample2.ctab | head | sort -k1) 
TCONS_00000010 800.706 800.706
TCONS_00000017 715.775 715.775
TCONS_00000020 579.569 579.569
TCONS_00000024 304.873 304.873
TCONS_00000029 87.7438 87.7438
TCONS_00000032 0.0249079 0.0249079
TCONS_00000598 2205.3 2205.3
TCONS_00000613 101.848 101.848
TCONS_00000637 323.543 323.543

For more than 2 files, you can join the first 2, then join the joined output with the next file etc. Or this thread had more elaborate solutions:

Or R could do something similar if you load them as dataframes and used merge(all.x=T, all.y=T) iteratively.

Whatever you end up doing, make sure that you test your results. Check how many rows you end up with. I'm not familiar with the stringtie ctab files, so I'm not sure if they would all have the same number of rows or different; that's something to check too. Spot check some transcripts, ideally a random sample for top, bottom, middle of your files; make sure they match with the appropriate sample values.

Hope it helps!

ADD COMMENTlink modified 8 days ago • written 8 days ago by manuel.belmadani40

Thanks for the detailed steps, manuel.belmadani! I really appreciate your help.

ADD REPLYlink written 12 hours ago by ever_wudi10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1372 users visited in the last hour