Question: Bowtie2/HISAT using single-end to generate FPKM???
2
gravatar for jamieson.pierce
16 months ago by
jamieson.pierce30 wrote:

Hi all,

I am familiar with the fact that paired-end reads are used to generate FPKM, and single-end reads are used to make RPKM. I recently received a sample report from a third party RNA-seq service which provided me with FPKM normalized read counts for each transcript and sample in spreadsheet.

That was fine until I examined the .fastQ files they gave me, I found the following format for each read

@XX100011323L1C001R014_28
GAAAAACTCAAATCGCCTCTAAGAAAAGACGAAGTCGAAGAAAGAGACAA
+
eeeeeeeeeeeeeeeeee\eeeeeeeeedeeeeeefeeZeeeeeeeefZc

Given that there is no /1 or /2 at the end of the @ identifier, and further that all those @IDs in the .fastQ file contain an "R" (suggesting reverse?), I am wondering how in the hell they generated FPKM-- and most importantly whether or not these people have just given me the runaround. Is this an interleaved .fastQ?

In their report they provided the parameters they would use for mapping both PE and SE reads in Bowtie2 and HISAT, which seems odd, since they only seem to provide FPKM to everyone. Here are the arguments:

Bowtie2 parameters for PE reads: -q --phred64 --sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1 --
score-min L,0,-0.1 -I 1 -X 1000 --no-mixed --no-discordant -p 16 -k 200 
Bowtie2 parameters for SE reads: -
q --phred64 --sensitive --dpad 0 --gbar
99999999 --mp 1,1 --np 1 --score-min L,0,-0.1 -p 16 -k 200
HISAT parameters for PE reads: -p 8 --phred64 --sensitive --no-discordant --no-mixed -I 1 -X 1000
HISTA parameters for SE reads: -p 8 --phred64 --sensitive -I 1 -X 1000

After that, they said they used RSEM to calculate FPKM.

Then, laughably, they said,

The FPKM method is able to eliminate the influence of different gene length and sequencing discrepancy on the calculation of gene expression. Therefore, the calculated gene expression can be directly used for comparing the difference of gene expression among samples.

Which I think we all know isn't quite true unless you're using a trimmed mean of M adjustment anyway because total FPKM/sample is always a little different.

If anyone can give me some insight on this one, I'd be much obliged.

rpkm rna-seq hisat fpkm bowtie2 • 903 views
ADD COMMENTlink modified 16 months ago by Brian Bushnell15k • written 16 months ago by jamieson.pierce30
2

For single-end reads, RPKM==FPKM...

ADD REPLYlink written 16 months ago by Devon Ryan84k

Did you contract both sequencing and bioinformatics? Have you been given the raw sequencing data? I am guessing they used some software to process the raw reads (adapter and quality trimming, etc), which resulted in PE and SE files. Did you check all fastq files, at least 5 lines from each? The file names should follow some sort of sane naming convention.

ADD REPLYlink written 16 months ago by h.mon19k

My PIs contracted both seq and bioinf. Yes in the documentation they removed adapters and quality trimmed. (Not sure what software was used)

How would this create SE and PE files? I also checked 10,000 reads in the fastQ and found them all named with the same convention. No F or RF or FR or /1 or /2 or anything that could feasibly be interpreted as paired end.

Based on everyone's answer here I'm guessing they took SE aligned reads and did "FPKM" calculations using RSEM which I guess are mathematically indistinguishable from RPKM in this case?

ADD REPLYlink written 16 months ago by jamieson.pierce30

How would this create SE and PE files?

If one of the reads in a pair gets removed during trimming that would leave a single-end read in the other file. For the sake of sanity, I generally prefer to discard both reads when that happens. It keeps the PE reads in proper order in R1/R2 files.

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax55k

Someone is not providing you all the information you need to do a proper job, so you should:

1) get the raw sequencing data 2) get a complete description of the analyses performed - software and commands used, if possible

Either your PI or the center should have those.

ADD REPLYlink written 16 months ago by h.mon19k

Well, I hope they do a decent job sequencing because their bioinformatics service looks doubtful. It seems you are better off getting access to the rawest data you can find, and do things yourself.

ADD REPLYlink written 16 months ago by WouterDeCoster32k

For paired-end reads, the FPKM is much better than RPKM. However, for single-end reads, the results of FPKM are the same as the RPKM.

ADD REPLYlink written 16 months ago by Ben50
5
gravatar for Brian Bushnell
16 months ago by
Walnut Creek, USA
Brian Bushnell15k wrote:

FPKM does not require paired-end reads. For single-ended reads, it makes the calculation much simpler since you never have the problem of the two ends mapping to different genes :)

It would be helpful if you could post the first... 16 lines of the file rather than the first 4 lines, to see if maybe the names are the same, or something, which might indicate the reads are interleaved. But if I were you, I would demand the raw data from the 3rd-party source rather than the weird renamed stuff they gave you.

And yes, their comment is completely wrong which makes me wonder how competent they are. There is interplay between library insert size and gene length that affects relative gene coverage and is not modeled by FPKM.

ADD COMMENTlink written 16 months ago by Brian Bushnell15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1225 users visited in the last hour