I am analyzing single-cell sequencing dataset from the website 10xgenomics, with 2000 cells. It is a BAM file and I am trying to obtain the individual cells per sample. I used the command
samtools view mtdnaAsorted.bam | grep CB:Z: | sed 's/.*CB:Z:\([ACGT]*\).*/\1/' | sort | uniq -c > reads_per_barcode_mtdnaA
to get a list of the individual bar codes and their reads, but there are 100,000 unique barcodes even with only 2000 cells. Does anyone know what could be the problem?
I know that there could be noise, but this is a problem for me because I'm trying to analyze coverage per cell, and if there are a lot of cells that contribute towards the total number of reads, the net coverage per cell is going to be really low