Generating FPKM matrix accross all samples after stringtie
2
0
Entering edit mode
3.5 years ago
ever_wudi ▴ 10

Hi,

I finished the "stringtie --merge" step and used "stringtie -e -B -G" to reestimate fpkm for each sample and generated the ballgown folder. Now I am trying to get FPKM for all genes and across all samples into a matrix. I know there is a prepDE.py script, but it generates gene/transcript counts instead of FPKM values, and I really need an FPKM matrix. Is there a good way to output such matrix in ballgown, stringtie, or a software that can be directly connected to the outputs of stringtie?

Thanks! Di

RNA-Seq • 6.1k views
0
Entering edit mode

Why do you need FPKM units? These are derived from an outdated form of normalisation for RNA-seq, and effectively renders your samples incomparable in terms of differential expression analysis.

The idea of using prepDE is to generate 'raw' counts for the purpose of using other [more robust] differential expression analysis tools, like EdgeR and DESeq2.

0
Entering edit mode

Thanks for reminding. I didn't intend to feed the count matrix to EdgeR or DESeq2, so I just needed the matrix in the FPKM form. Is there something wrong with the FPKM normalization method?

0
Entering edit mode

Yes, it renders the samples non-comparable in terms of differential expression analysis (you can compare them, obviously, but the false-positives / -negatives will be higher than other methods). Not a life or death situation, for now, but just be aware that there are proven better methods.

0
Entering edit mode

Thanks for the tips. What are the better metrics used now? TPM or some others?

2
Entering edit mode

TPM is better but still not ideal. You should use those methods employed by either DESeq2 or EdgeR. Here is the manuscript that I had wanted to share with you (it is a very well cited manuscript):

An update (6th October 2018):

You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

0
Entering edit mode

I see, thank you very much!

7
Entering edit mode
3.5 years ago
ATpoint 57k

If you did use stringtie properly, there should now be a folder "ballgown" that contains subdirectories for each sample, and each of these subdirectories should have any kind of common pattern in its name, like "Sample1_ballgown", "Sample2_ballgown". If so, load everything into R and get FPKM, which is as easy as:

require(ballgown)

## Create ballgown object:

## get FPKM:
gexpr(my.data)


There are a lot more one-liner commands to extract common things from the data, such as exon coverage and stuff, which can easily be looked up in the BioC tutorial. A tip for the future: Really try to find these kind of answers yourself, including proper googling and a lot of try and error if you have the time. You learn so much by trying-out things and get good practice in finding solutions. A provided solution may help you know, but does not really challenge you to improve yourself.

0
Entering edit mode

Thanks ATpoint, the commands worked. I am very new to R and RNA-seq, did try to search for solutions before but didn't find a satisfactory way. Thanks for the help.

0
Entering edit mode

You're welcome. As you'll see when you get more experience with R and Bioconductor, most commonly there are simple commands like this for standard tasks. You only have to find them ;-D Please consider to mark the answer as accepted to help others in the future.

0
Entering edit mode

I just saw this message, and just realized there is a "tick" sign for marking an accepted answer. I already marked it accepted.

0
Entering edit mode

Hi ATpoint, I have another question. When I use gexpr() to output the matrix, it generates a matrix with MSTRG.x gene ids as identifiers. However, I really need gene_name (or gene symbol) as my gene identifiers for the next step. I tried really hard to fiddle it around and search online but still can't do it. Would you have any suggestions for it? Thanks.

5
Entering edit mode
3.5 years ago

Edit: The other answer is probably a more proper way to get the gene-level matrix, using gexpr https://www.rdocumentation.org/packages/ballgown/versions/2.4.2/topics/gexpr and should probably be marked as accepted. I'll leave mine here in case the info is useful, but I wrote my answer thinking of transcript per sample matrices. texpr seems to do that as well.

I think you can get your FPKMs with Tablemaker per sample with your ballgown output, but you might have to assemble your matrix yourself (should be very few lines of Bash and/or R).

Under Running Tablemaker

tablemaker -p 4 -q -W -G merged.gtf -o sample01_output read_alignments.bam


The output is 5 files, written to the specified output directory:

t_data.ctab: transcript-level expression measurements. One row per transcript. Columns are:

• ...
• FPKM: Cufflinks-estimated FPKM for the transcript (available for each sample)

Here's an example t_data.ctab: https://github.com/alyssafrazee/ballgown/blob/master/inst/extdata/sample02/t_data.ctab

Note that each sample should have it's own t_data.ctab file, so all you need to do it grab the FPKM column for each sample and merge them together.

You can get the relevant info from the ctab with cut

$cut -f6,12 sample1.ctab t_name FPKM TCONS_00000010 800.706 TCONS_00000017 715.775 TCONS_00000020 579.569 TCONS_00000598 2205.3 TCONS_00000024 304.873 TCONS_00000613 101.848 TCONS_00000029 87.7438 TCONS_00000032 0.0249079 TCONS_00000637 323.543  And join with join $ join -j1 <(cut -f6,12 sample1.ctab | head | sort -k1) <(cut -f6,12 sample2.ctab | head | sort -k1)
TCONS_00000010 800.706 800.706
TCONS_00000017 715.775 715.775
TCONS_00000020 579.569 579.569
TCONS_00000024 304.873 304.873
TCONS_00000029 87.7438 87.7438
TCONS_00000032 0.0249079 0.0249079
TCONS_00000598 2205.3 2205.3
TCONS_00000613 101.848 101.848
TCONS_00000637 323.543 323.543


For more than 2 files, you can join the first 2, then join the joined output with the next file etc. Or this thread had more elaborate solutions: https://stackoverflow.com/questions/10726471/join-multiple-files

Or R could do something similar if you load them as dataframes and used merge(all.x=T, all.y=T) iteratively.

Whatever you end up doing, make sure that you test your results. Check how many rows you end up with. I'm not familiar with the stringtie ctab files, so I'm not sure if they would all have the same number of rows or different; that's something to check too. Spot check some transcripts, ideally a random sample for top, bottom, middle of your files; make sure they match with the appropriate sample values.

Hope it helps!

1
Entering edit mode

0
Entering edit mode

Hello, apologies if I may have missed a post but would it be possible to obtain a isoform x sample matrix of TPM values from the output of StringTie to use for Ballgown? I would like to pass it as the 'gowntable' argument to use instead of FPKMs but would really appreciate any help in how to start. Thanks very much.

0
Entering edit mode

You should open a new question and ask there, more likely that people will see and answer, and easier to find for people who have the same question later.

But also, why would you want to do that if you have FPKM and ballgown accepts it? Does ballgown even work with TPM? I'm not sure it does, and doing a differential expression analysis with FPKM would be different than TPM, so it might not be even supported. If you have a good reason to want to work with TPM, you should explain in your question, otherwise just use FPKM.