I have a bam file with alignment of reads to the transcripts generated by RSEMdeweylab.biostat.wisc.edu/rsem/) Different genes have different FPKM values.
Does anybody know a straight forward way to down sample the reads so that all genes have exactly 1 FPKM as per RSEM?
I guess you need to state what is the purpose of doing so. Instead, you could just simulate a dataset such that all (of genes of interest) the genes has 1FPKM ?
The reason for down sampling is to check how well RSEM estimates percent isoform usage at different expression levels. Would the percent isoform usage be the same at 1 FPKM as 20 FPKM expression of the same gene.
Excuse me if I am misunderstanding your approach completely, but it makes no sense to me, why would you want that all measurements are the same, why then measure at all? If all genes have different measurements, there can be no random downsampling such that all measurements are identical afterwards, even if including rounding error. With FPKM these values should be even mostly unchanged by downsampling, because the division by the number of reads, which would be less for downsampling.
Say we have gene a of length 1kb, you sequenced 1 million reads, and a has 1000 reads, while gene b also is 1kb but has 500 reads, so gene a has 1000 FPKM and b 500 FPKM, say you downsample - truly randomly - 10% (100,000 reads) of all, what is the expected value of reads for a and b? well, say it is 100 and 50 respectively, but you also have only 100k reads, so the FPKMs for a and be will be 1000 and 500 with the expected value of downsampled reads.
The reason for down sampling is to check how well RSEM estimates percent isoform usage at different expression levels. The down sampling need not be truly random. It just needs to be random within a gene (random positions within the gene).
May be i should have used a better word than down sampling. It is gene specific sampling of reads to get genes with equal power to detect isoform usage.
I guess you need to state what is the purpose of doing so. Instead, you could just simulate a dataset such that all (of genes of interest) the genes has 1FPKM ?
The reason for down sampling is to check how well RSEM estimates percent isoform usage at different expression levels. Would the percent isoform usage be the same at 1 FPKM as 20 FPKM expression of the same gene.