CellRanger scRNA .mtx only uses 15% of reads in .bam?
2
1
Entering edit mode
14 months ago
Diedes ▴ 20

Hi,

For a few days I have been trying to convert .bam to .mtx. Which on its own is not that hard. The real problem is reproducing the results. I am using a bamfile which already has a .mtx file, so what I found was this.

The bam file has nearly 245 million reads with a MAPQ value of 255 where the matrix.mtx only has a mere 30 million (sum of col 3, skip first 3 entries).

Also, GeneRanger does output the bamfile with their reads annotated, but not all of them. Of the 245 million reads, only 100 million actually have a gene in GN:Z: while the other 145 million reads do not. (I see similar cases from BAMs that I download from SRA's).

Furthermore, if I use all 100 million annotated reads and load this in scanpy, I only get about 2/3rd of the genes that the original matrix.mtx has when I load it.

When I annotated the other 145 million reads, the amount of total genes I find is pretty similar to what the matrix.mtx has. The problem is that the counts per cell do not match at all. I get cells with 50 thousand reads if I convert the entire bam file to .mtx. The amount of cells are pretty similar though.

Now the first thing I thought was, they must be removing duplicates. However, if I look at all NH:i:1 entries, it exceeds way more than 30 million reads and the cells per gene do totally not scale when loading it into scanpy, some genes even have less cells eventhough there are more reads. Also looking at their xf:i: it does not make sense. This bam file has xf:i:0, xf:i:17, xf:i:19 and xf:i:25. And when statting the amount of reads per xf:i entry, I don't get any entry close to 30 million reads. Furthermore, other bam files I loaded online do not even have xf:i:.

I also do not think this is a phred score issue, as it should have been filtered out before alignment.

What could this be? How can I identify the 30 million reads that CellRanger used to construct the matrix.mtx I must be missing something.

I was thinking of trying FeatureCounts, but there still remains a problem if you want to get data such as SNPs from the bam file, if you don't know which reads to use and which not.

Thanks all!

Edit: I think my problem is that I am not correcting for UMIs.

scRNA .mtx .bam cellranger single cell • 952 views
1
Entering edit mode

There are whitelists of UMI indexes that 10x uses. Have you looked into that angle? Their tech support is also pretty responsive so I suggest that you also contact them. Please come back and post their explanation when you hear back.

0
Entering edit mode

Hey, wouldn't those UMIs be corrected in the BAM? Althought I did just find out that my problem is probably that I did not correct for UMIs, so gonna rerun my script and give an update later.

0
Entering edit mode

There are UMI white lists? In addition to cell barcode white lists?

0
Entering edit mode

I was thinking out aloud. Sorry. Wrong feature type.

1
Entering edit mode
14 months ago

It is almost certainly to do with UMIs: if two reads have the UMI and map to the same gene one will be removed. I think CellRanger also removes any read where the Phred score of the UMI or cell barcode is below a threshold, althuogh I can't remember what that threshold is.

The reason you have lots of genes without a gene annotated in the "GN:Z" tag is probably that they don't map to an annotated gene.

Furthermore, if I use all 100 million annotated reads and load this in scanpy, I only get about 2/3rd of the genes that the original matrix.mtx has when I load it.

This, I have to admit, is weird though.

If you want to try and reproduce something similar to CellRanger, try processing with UMI-tools count with the method set to unique.

0
Entering edit mode

Hi,

Thank you and all others for answering.

It turns out that those 1/3rd that I did not get in my recreated matrix, are actually empty entries in the original CellRanger matrix. Basically the genes are not removed from features.tsv and the .mtx file is not adjusted. I think this happens after the initial filter step. This creates a bigger matrix than necessary and uses more RAM when loaded in Python or R. If I ever write a script to kick out empty gene entries, I will post it on Biostars.

Yes, a big part of my problem was indeed that I counted all reads and not the UMIs, after doing that, I was able to get down to a similar number of reads, but still 20% higher for some reason. So now I am stuck at that point.

Actually, it puzzles me why some reads do not have GN:Z:, I think they are filtered out initially, at the same step where those 1/3rd of the genes are kicked out. As when I do annotate those genes myself, the missing 1/3rd of those genes are suddenly found. Also when I annotate those reads, I do find a lot the reads aligning to the genes in the other 2/3rd of the genes, so I really think they are filtered out for whatever reason. Just really a fraction, less than 1% do not align at all. (Probably they are those LOCs that you might find in UCSC genomebrowser, or totally unknown long non coding RNAs.)

This whole recreation of the matrix was a learning experience for me, as now I understand the scRNA data a lot better and started to understand what UMIs really are. I want to thank everyone for helping me and if someone has any clue why I am still getting 20% more reads please let me know.

Best

0
Entering edit mode

I also checked if the bam file contained many reads below a Q score of 30 and all reads I checked actually have higher than 30. (I believe 30 was the cutoff). Didnt check all due my code being slow.

samtools view -h file.bam | awk '{if (substr($1,1,1) != "@" &&$0 ~ /GN:Z:/){print}}' | awk '{g=0;ab=split(\$10,charz,"");for(i=1;i<ab;i++){for(n=0;n<256;n++)ord[sprintf("%c",n)]=n;g=g+ord[charz[i]]-33;} if (g/ab < 30){print}}'

0
Entering edit mode

I believe the read filtering is a bit more complex, this this from the STAR-solo instructions:

Annotations

CellRanger uses its own "filtered" version of annotations (GTF file) which is a subset of ENSEMBL annotations, with several gene biotypes removed (mostly small non-coding RNA). Annotations affect the counts, and to match CellRanger counts CellRanger annotations have to be used.

10X provides several versions of the CellRanger annotations: https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest For the best match, the annotations in CellRanger run and STARsolo run should be exactly the same.

So CellRanger uses its own custom annotations in which some genes are removed from the annotations. I'd also expect some reads to be DNA contamination and to map to genomic regions not in any gene/transcript.

UMI and barcode filtering

STARsolo has to match CellRanger 3 (I don't know if you are using 2 or 3) bu must use: To get the best agreement between STARsolo and CellRanger 3.x.x, add these parameters:

--soloUMIfiltering MultiGeneUMI --soloCBmatchWLtype 1MM_multi_pseudocounts

soloCBmatchWLtype is about cell barcodes, not what we are interested in here, but soloUMIfiltering MultiGeneUMI is about UMI filttering, and if you look what its means it says....

MultiGeneUMI ... remove lower-count UMIs that map to more than one gene (introduced in CellRanger 3.x.x)

By default CellRanger also does error aware UMI collapsing by combining all UMIs within an edit distance of each other (similar to UMI-tools cluster mode).

0
Entering edit mode

I'd also expect some reads to be DNA contamination and to map to genomic regions not in any gene/transcript.

Ah good point!

So what I have been doing, is to write a script that would create a .mtx file out of a bam file, as I do not have the RAM capacity to go back to fastq and run celranger and I would like to use bam files from SRAs. I found out too late about featurecounts, but then I would not have learned this much about scRNA as I did now, so kind of happy I did not use featurecounts haha.

remove lower-count UMIs that map to more than one gene

Ahh, I did see a few thousand UMIs that map to multiple genes. It is pretty weird in my opinion that this can happen in non-cancer data where fusions are not that likely. If I understand it correct, the UMIs is attached before amplification and since different genes should have different transcripts, in theory one UMI should not be linked to different genes, though I do see this is happening in very few cases. Maybe the UMI (as barcode) is introduced twice by accident in a droplet and therefore 2 transcripts have the same UMI?

Anyway, I am wondering if filtering multi gene UMIs out is not creating a bias? If I understand it correctly, if Gene A and Gene B share the same UMI, one of the genes, for example Gene A would lose an UMI count where Gene B would keep its count for that UMI? What do they actually mean with lower-count UMIs, the one that is covered by just a few reads? So if a UMI has high read support, both genes would be kept?

In the end after filtering for the normal stuff like MT fraction and Umi cut off per cell, I was able to reduce the read discrepancy to just about 2%, which should be pretty representable.

Thank you a lot for the interesting exchange for information, very helpful!

0
Entering edit mode
14 months ago

Do the number of primary alignments match the alignments 10X reports? If I remember right, that's how Cellranger does it; it makes one read with one UMI sequence on one gene primary, and all the rest of the reads sharing that cell barcode, UMI and gene ID are secondary.