Question

fgbio: Calling consensus reads using UMI

0

Entering edit mode

7.6 years ago

Nicolas Rosewick 11k

Hi,

I'm struggling a little bit with some data containing UMIs. It's enriched DNA-seq on specific genomic regions. To remove PCR duplicates and potential sequencing errors I would like to merge the reads with the same UMI and the same alignment position.

I'm following a workflow proposed by Nils Homer on his blog : http://nilshomer.com/2017/07/05/single-strand-umi-somatic-variant-calling/ The workflow used fgbio toolset.

I used fgbio DemuxFastqs on my raw R1 and R2 files to produce an unmapped bam file containing for each read the associated UMI (encoded in the RX tag). Example for a pair of reads :

R1

NS500186:291:HTYKYAFXX:1:11101:10000:18413  77  *   0   0   *   *   0   0   GCGGATTTTGAAAAGGCAGCAAAGAGCTTCATGCAAGAGACTTTAAAATTGGGAAAGTTACTTCGGCCAAAACG  WbbbbPffffbWfffbfbfffffffffbbfffffPffffPbP]fPffffffffffWffffffPffffffffPff  BC:Z:CGCTCTTCCGATTT RG:Z:unmatched  RX:Z:GGAGACGGCGGT

R2

NS500186:291:HTYKYAFXX:1:11101:10000:18413  141 *   0   0   *   *   0   0   CTTCTTTTCTCTACATCAAAGCAACTTCCATTGTAACTAGGTTGGTTATA  ffbffbfffbWffPfPfffPfffPfffbff]ffbPffbfPffbfbf]Pff  BC:Z:CGCTCTTCCGATTT RG:Z:unmatched  RX:Z:GGAGACGGCGGT

Ok great I have now a bam file containing the UMI for each par of read. Now I would like to perform his point 1) suggestion :

Map the raw reads (with bwa-mem), duplicate mark them (optional, with Picard’s MarkDuplicates) , and group them by UMI (fgbio’s GroupReadsByUmi).

But how to go from an unmapped bam file (containing the RX tag with the UMI information) to a mapped bam file (also containing the RX tag) in order to apply fgbio GroupReadsByUmi ? bwa mem do not take bam fil in input. If I use SamToFastq from Picard I will lose the RX tag no ?

Thanks

fgbio consensus • 10k views

ADD COMMENT • link updated 7.5 years ago by nilshomer ▴ 70 • written 7.6 years ago by Nicolas Rosewick 11k

score 3 · Answer 1 · 2017-12-19

The answer is in the source code of the SingleStrandUmiSomaticVariantCalling pipeline, which references the following task in dagr, which was linked in the original blog post.

The main idea is to create the unmapped BAM, feed the unmapped BAM into SamToFastq, stream the output of SamToFastq into bwa, then stream the output of bwa into MergeBamAlignment, which also takes the unmapped BAM as a named input, to output a mapped BAM. You can coordinate sort the final BAM by setting the SO=coordinate option in MergeBamAlignment. The key thing is that SamToFastq produces the FASTQ that bwa needs, and then MergeBamAlignment will merge the output of bwa back into the unmapped BAM so you don't lose any read attributes like the RX tag.

Also, when your run MarkDuplicates, don't forget to use the BARCODE_TAG=RX option.