Entering edit mode
9 months ago
zbidav
▴
30
Dear Biostar community,
I hope the question is not duplicated. I search for a way to collapse molecules from the same UMI (UB tag) together. I tried to use the fgbio tool, but.... sadly, I received reads with a constant length - in my case, 101bp, identical to the original sequencing length.
Is there a tool that can combine aligned bam reads by UMI tag?
I attached my code just in case (I extracted manually only one cell from my sample, so the cell barcode is identical to all the reads):
in_bam=CCACACTTCAAGCGTT-1.bam
grouped_bam=CCACACTTCAAGCGTT-1_grouped.bam
out_bam=CCACACTTCAAGCGTT-1_out.bam
fgbio GroupReadsByUmi -i $in_bam -o $grouped_bam --raw-tag "UR" --strategy "edit" --min-map-q 255 --edits 1
fgbio CallMolecularConsensusReads -i $grouped_bam -o $out_bam --min-reads 4 --read-name-prefix "Consensus"
Thank you kindly in advance!
Can you clarify why receiving reads of 101bp is a problem? If reads are PCR duplicates of each other, then they will all have length 101bp, and when they are collapsed to a consensus, that will leave a read of 101bp.
Definitely, I apologize for the delay. In my case it reads from 10x protocol. I observed few reads that mapped adjacently, but not fully overlap. I assumed it due to the random cleavage of the original RNA molecule in the 10x protocol - that may return reads that mapped adjacently, but might be a bit different.
Yes, the random cleavage in the 10x protocol means that reads that PCR duplicates of each other won't necessarily map to the same location as each other (in fact, generally don't). In 10x we need to look for reads with the same UMI mapping to the same gene, not mapping to the same location.
But I'm still not sure what you are trying to do - there are several UMI deduplication tools, but all return reads of the same length as the input sequencing. UMI-tools (which in this case you'd run with --per-gene) will select one read to represent all PCR duplicates, and fgbio will take reads with identical mapping coordinates (thus not suitable for 10x data), and from, say 100 101bp reads return a 101bp read where each base is most common amongst the 100 input reads.