Question

Multiple unique reads sharing same UMI barcode

0

Entering edit mode

8 months ago

ww22runner ▴ 60

Hello everyone.

I am currently running a umi-based NGS workflow and notice that the reduction of bam size after consensus reads is almost 18 fold.
Moreover, downstream tools such as Pindel show almost 50 times fewer results in variants when compared to the results obtained when bams are subject to a non-umi pipeline.
When the bams are analyzed using IGV, we see that multiple unique reads share the same umi barcode and that after the consensus call, only 1 read is chosen to represent all the reads that have the same barcode. I wonder if this has led to loss of data from the bam.

Does anyone have experience with this issue?

Thanks.

umi ngs • 877 views

ADD COMMENT • link updated 8 months ago by i.sudbery 20k • written 8 months ago by ww22runner ▴ 60

0

Entering edit mode

If your library has been overamplified and/or oversequenced then this seems logical.

ADD REPLY • link 8 months ago by GenoMax 147k

0

Entering edit mode

What do you mean by unique? Do you mean they have different mapping coordinates, or different sequences? What sort pipeline is being used? How was deduplocation performed?

ADD REPLY • link 8 months ago by i.sudbery 20k

0

Entering edit mode

By unique, I meant that they have different mapping coordinates (at least 10 kb apart). In the pipeline, I remove adapter sequences using cutadapt, convert the fastqs to unaligned bams, extract umis from these bams into RX tags, convert bams back to fastqs and align them using bwa mem, group reads by umi and call consensus reads, convert bams back to fastqs and re-align using BWA mem.

ADD REPLY • link 8 months ago by ww22runner ▴ 60

0

Entering edit mode

What tool are use using to group reads by UMI and call consensus reads? Reads 10Kb apart shouldn't be collapsed together by any tool that uses mapped reads, I don't think.

Is this DNA-seq or RNA-seq?

ADD REPLY • link 8 months ago by i.sudbery 20k

0

Entering edit mode

I am using fgbio's GroupReadsByUMI to group reads by UMI, and fgbio's CallDuplexConsensusReads to call on consensus reads. This is for DNA sequencing. I think reads that are so far apart are being collapsed together because they seem to be sharing the same UMI barcode as their RX tag.

ADD REPLY • link 8 months ago by ww22runner ▴ 60

0

Entering edit mode

As far as I am aware, fgbio should only collapse reads that both have the same UMI and the same mapping coordinates - I'm pretty sure it shouldn't collpase reads with the same UMI that map 10kb away from each other.

ADD REPLY • link 8 months ago by i.sudbery 20k