For a few days I have been trying to convert .bam to .mtx. Which on its own is not that hard. The real problem is reproducing the results. I am using a bamfile which already has a .mtx file, so what I found was this.
The bam file has nearly 245 million reads with a MAPQ value of 255 where the matrix.mtx only has a mere 30 million (sum of col 3, skip first 3 entries).
Also, GeneRanger does output the bamfile with their reads annotated, but not all of them. Of the 245 million reads, only 100 million actually have a gene in GN:Z: while the other 145 million reads do not. (I see similar cases from BAMs that I download from SRA's).
Furthermore, if I use all 100 million annotated reads and load this in scanpy, I only get about 2/3rd of the genes that the original matrix.mtx has when I load it.
When I annotated the other 145 million reads, the amount of total genes I find is pretty similar to what the matrix.mtx has. The problem is that the counts per cell do not match at all. I get cells with 50 thousand reads if I convert the entire bam file to .mtx. The amount of cells are pretty similar though.
Now the first thing I thought was, they must be removing duplicates. However, if I look at all NH:i:1 entries, it exceeds way more than 30 million reads and the cells per gene do totally not scale when loading it into scanpy, some genes even have less cells eventhough there are more reads. Also looking at their xf:i: it does not make sense. This bam file has xf:i:0, xf:i:17, xf:i:19 and xf:i:25. And when statting the amount of reads per xf:i entry, I don't get any entry close to 30 million reads. Furthermore, other bam files I loaded online do not even have xf:i:.
I also do not think this is a phred score issue, as it should have been filtered out before alignment.
What could this be? How can I identify the 30 million reads that CellRanger used to construct the matrix.mtx I must be missing something.
I was thinking of trying FeatureCounts, but there still remains a problem if you want to get data such as SNPs from the bam file, if you don't know which reads to use and which not.
Edit: I think my problem is that I am not correcting for UMIs.