5.9 years ago by
If the library preps are the same, and the only difference is from the sequencing, then your cDNA and flowcell clusters should be the same. That is, the only difference is batch2 has more data, which you could trim off to make equivalent datasets.
Longer sequencing will give more accurate mapping, so if you have the money, redoing Batch1 at 75bp would be beneficial. (or do everything at 100bp for that matter)
Alternatively, your idea to just calculate FPKM should be equivalent, up to the mismapped shorter reads. In essence, the 50bp has some biases the 75bp cures, and trimming it back is not making the data better.
If Batch1 and Batch2 are all the same conditions, then pooling results should be fine, but if for example, one is control and the other test, then you need the biases to be equivalent, and trimming the reads of B2 is a good idea. Perhaps you should compute differential expression between the batches and see which genes come out (hopefully very few, and nothing after p-adjustment).
This all hinges on the assumption the library preps are equivalent. If one was size selected for 100bp fragments, and the other for 150bp fragments, of course you will have selected different mRNAs.
You might want to try a tool called RSEM, which calculates things like FPKM by mapping the reads to a transcriptome reference in ungapped fashion. This could further reduce sources of error from the differential alignment of the read batches. A tool like Tophat works on the genome, and will find more places to put a 50bp read than a 75bp read.