Entering edit mode
4.3 years ago
banbanana
•
0
Hello,
I am analyzing single-cell sequencing dataset from the website 10xgenomics, with 2000 cells. It is a BAM file and I am trying to obtain the individual cells per sample. I used the command
samtools view mtdnaAsorted.bam | grep CB:Z: | sed 's/.*CB:Z:\([ACGT]*\).*/\1/' | sort | uniq -c > reads_per_barcode_mtdnaA
to get a list of the individual bar codes and their reads, but there are 100,000 unique barcodes even with only 2000 cells. Does anyone know what could be the problem?
I know that there could be noise, but this is a problem for me because I'm trying to analyze coverage per cell, and if there are a lot of cells that contribute towards the total number of reads, the net coverage per cell is going to be really low
Hello banbanana!
It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/11087/single-cell-sequencing-dataset-has-too-many-barcodes
This is typically not recommended as it runs the risk of annoying people in both communities.
Its unclear exactly what you are trying to achieve. Are you just trying to extract the number of cells? Or count the number of reads per cell? Or extract the reads associated with each cell?
There are two types of noise in CBs:
sequencing/PCR errors where the barcode is attached to a real cell, but the sequence has changed. These can often be removed by filtering against a list of known correct barcodes. The CB tag from a cellranger processed BAM file is supposedly already filtered against this list and CBs not on the list either removed or corrected to one that is.
Barcodes associated with ambient RNA or RNA from non-viable cells. Here the barcode is "real" but the RNA it is associated with is not. Filtering out these CBs is harder, but most scRNA analysis pipelines include a step that attempts to do this, doing anything from just taking the top X most frequent barcodes to complex machine learning algorithms.