I'm struggling a little bit with some data containing UMIs. It's enriched DNA-seq on specific genomic regions. To remove PCR duplicates and potential sequencing errors I would like to merge the reads with the same UMI and the same alignment position.
I'm following a workflow proposed by Nils Homer on his blog : http://nilshomer.com/2017/07/05/single-strand-umi-somatic-variant-calling/ The workflow used fgbio toolset.
I used fgbio DemuxFastqs on my raw R1 and R2 files to produce an unmapped bam file containing for each read the associated UMI (encoded in the RX tag). Example for a pair of reads :
NS500186:291:HTYKYAFXX:1:11101:10000:18413 77 * 0 0 * * 0 0 GCGGATTTTGAAAAGGCAGCAAAGAGCTTCATGCAAGAGACTTTAAAATTGGGAAAGTTACTTCGGCCAAAACG WbbbbPffffbWfffbfbfffffffffbbfffffPffffPbP]fPffffffffffWffffffPffffffffPff BC:Z:CGCTCTTCCGATTT RG:Z:unmatched RX:Z:GGAGACGGCGGT
NS500186:291:HTYKYAFXX:1:11101:10000:18413 141 * 0 0 * * 0 0 CTTCTTTTCTCTACATCAAAGCAACTTCCATTGTAACTAGGTTGGTTATA ffbffbfffbWffPfPfffPfffPfffbff]ffbPffbfPffbfbf]Pff BC:Z:CGCTCTTCCGATTT RG:Z:unmatched RX:Z:GGAGACGGCGGT
Ok great I have now a bam file containing the UMI for each par of read. Now I would like to perform his point 1) suggestion :
Map the raw reads (with bwa-mem), duplicate mark them (optional, with Picard’s MarkDuplicates) , and group them by UMI (fgbio’s GroupReadsByUmi).
But how to go from an unmapped bam file (containing the RX tag with the UMI information) to a mapped bam file (also containing the RX tag) in order to apply fgbio GroupReadsByUmi ? bwa mem do not take bam fil in input. If I use SamToFastq from Picard I will lose the RX tag no ?