I have single cell RNA-seq reads (from 10x Chromium) that have already been pre-processed. The cell-bacode and UMI tag were cut-and-pasted to the header (via umi-tools) and low quality reads were removed. Next, I mapped the reads (with STAR) and isolated the reads from a gene of interest (via samtools). At the end of the day I want to genotype each cell for a specific gene, using the cell-barcode and UMI pair, and call variants.
How do I split the BAM file into separate BAM files based on the cell-barcode and UMI pair? In other words, I want a bam file of aligned reads with the same cell-barcode and UMI pair.
Thank you!
You really want your data split into hundreds of thousands of files?
There are only about 50 cell barcodes in my gene/region of interest. So it would be about 50-70 files.
I should clarify I used samtools to only grab the portion of the bam file with alignments to one gene.
Okay, so 50 cell barcodes times, 20 UMIs per sample? A thousand files, are you sure this is helpful? 10xgenomics software will tag every read with cell barcode and gene, why can't you make use of that?
Sorry for the late reply! I am using the 10x cell barcode tags, they are now placed in the read headers. I cut out the tags because I do not want misalignments caused by the cell barcode and UMI tags.
Are you sure that your UMIs are in the read, and not in read 2? Why isn't the software 10xGenomics makes appropriate for what you are doing?
Did you find a solution for this? I have also generated R2.fastq files tagged with the cell-barcode and UMI (using UMI tools) and mapped the reads (with STAR). The resulting bam files contain the cell barcode in the alignments and I would like to split the alignments for the different cells to perform variant calling. Thanks!