Question: fgbio: Calling consensus reads using UMI
gravatar for Nicolas Rosewick
16 months ago by
Belgium, Brussels
Nicolas Rosewick7.5k wrote:


I'm struggling a little bit with some data containing UMIs. It's enriched DNA-seq on specific genomic regions. To remove PCR duplicates and potential sequencing errors I would like to merge the reads with the same UMI and the same alignment position.

I'm following a workflow proposed by Nils Homer on his blog : The workflow used fgbio toolset.

I used fgbio DemuxFastqs on my raw R1 and R2 files to produce an unmapped bam file containing for each read the associated UMI (encoded in the RX tag). Example for a pair of reads :


NS500186:291:HTYKYAFXX:1:11101:10000:18413  77  *   0   0   *   *   0   0   GCGGATTTTGAAAAGGCAGCAAAGAGCTTCATGCAAGAGACTTTAAAATTGGGAAAGTTACTTCGGCCAAAACG  WbbbbPffffbWfffbfbfffffffffbbfffffPffffPbP]fPffffffffffWffffffPffffffffPff  BC:Z:CGCTCTTCCGATTT RG:Z:unmatched  RX:Z:GGAGACGGCGGT


NS500186:291:HTYKYAFXX:1:11101:10000:18413  141 *   0   0   *   *   0   0   CTTCTTTTCTCTACATCAAAGCAACTTCCATTGTAACTAGGTTGGTTATA  ffbffbfffbWffPfPfffPfffPfffbff]ffbPffbfPffbfbf]Pff  BC:Z:CGCTCTTCCGATTT RG:Z:unmatched  RX:Z:GGAGACGGCGGT

Ok great I have now a bam file containing the UMI for each par of read. Now I would like to perform his point 1) suggestion :

Map the raw reads (with bwa-mem), duplicate mark them (optional, with Picard’s MarkDuplicates) , and group them by UMI (fgbio’s GroupReadsByUmi).

But how to go from an unmapped bam file (containing the RX tag with the UMI information) to a mapped bam file (also containing the RX tag) in order to apply fgbio GroupReadsByUmi ? bwa mem do not take bam fil in input. If I use SamToFastq from Picard I will lose the RX tag no ?


fgbio consensus • 2.1k views
ADD COMMENTlink modified 16 months ago by nilshomer60 • written 16 months ago by Nicolas Rosewick7.5k
gravatar for nilshomer
16 months ago by
nilshomer60 wrote:

The answer is in the source code of the SingleStrandUmiSomaticVariantCalling pipeline, which references the following task in dagr, which was linked in the original blog post.

The main idea is to create the unmapped BAM, feed the unmapped BAM into SamToFastq, stream the output of SamToFastq into bwa, then stream the output of bwa into MergeBamAlignment, which also takes the unmapped BAM as a named input, to output a mapped BAM. You can coordinate sort the final BAM by setting the SO=coordinate option in MergeBamAlignment. The key thing is that SamToFastq produces the FASTQ that bwa needs, and then MergeBamAlignment will merge the output of bwa back into the unmapped BAM so you don't lose any read attributes like the RX tag.

Also, when your run MarkDuplicates, don't forget to use the BARCODE_TAG=RX option.

ADD COMMENTlink modified 16 months ago • written 16 months ago by nilshomer60

thanks I'll give a try !

ADD REPLYlink written 16 months ago by Nicolas Rosewick7.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1587 users visited in the last hour