Entering edit mode
7 months ago
ww22runner
▴
60
Hello everyone.
- I am currently running a umi-based NGS workflow and notice that the reduction of bam size after consensus reads is almost 18 fold.
- Moreover, downstream tools such as Pindel show almost 50 times fewer results in variants when compared to the results obtained when bams are subject to a non-umi pipeline.
- When the bams are analyzed using IGV, we see that multiple unique reads share the same umi barcode and that after the consensus call, only 1 read is chosen to represent all the reads that have the same barcode. I wonder if this has led to loss of data from the bam.
Does anyone have experience with this issue?
Thanks.
If your library has been overamplified and/or oversequenced then this seems logical.
What do you mean by unique? Do you mean they have different mapping coordinates, or different sequences? What sort pipeline is being used? How was deduplocation performed?
By unique, I meant that they have different mapping coordinates (at least 10 kb apart). In the pipeline, I remove adapter sequences using cutadapt, convert the fastqs to unaligned bams, extract umis from these bams into RX tags, convert bams back to fastqs and align them using bwa mem, group reads by umi and call consensus reads, convert bams back to fastqs and re-align using BWA mem.
What tool are use using to group reads by UMI and call consensus reads? Reads 10Kb apart shouldn't be collapsed together by any tool that uses mapped reads, I don't think.
Is this DNA-seq or RNA-seq?
I am using fgbio's GroupReadsByUMI to group reads by UMI, and fgbio's CallDuplexConsensusReads to call on consensus reads. This is for DNA sequencing. I think reads that are so far apart are being collapsed together because they seem to be sharing the same UMI barcode as their RX tag.
As far as I am aware, fgbio should only collapse reads that both have the same UMI and the same mapping coordinates - I'm pretty sure it shouldn't collpase reads with the same UMI that map 10kb away from each other.