Question

Realigning BAM files to new reference

1

Entering edit mode

13 months ago

Christian ▴ 30

Hi,

I am looking to create a panel of normals for a somatic variant caller. For normals, I have been provided with a set of WES bam files that have been preprocessed according to GATK best practices. However, they have been aligned to another reference genome than my case samples - and I cannot find out which.

My thought process in the code below was as follows: i) sort bam file; ii) convert back to fastq; iii) realign to a new reference (the same as my case samples); iv) subset to the intervals of my case samples (panels).

Here is the code:

#!/bin/bash
###STANDARD_SLURM_HEADER

module load samtools
module load bedtools
module load bwa/0.7.17

#############################
### PATHS AND VARIABLES #####
#############################

reference="/SOME_PATH/human_ref/hg19.fa"
regions="/SOME_PATH/gene_panel_intervals.chr.bed.gz"
bam_loc="/SOME_PATH/control_bams"
bam_file=$(ls ${bam_loc}/*.recalibrated.bam | sed -n ${SLURM_ARRAY_TASK_ID}p)
sample=$(echo ${bam_file} | sed 's@.*/@@' | sed "s/\..*//" )

#############################
### SORT BY READ NAME #######
#############################

samtools sort -n -l 1 -@ 4 -o ${bam_loc}/${sample}_s.bam ${bam_file}

#############################
### CONVERT BACK TO FASTQ ###
#############################

bedtools bamtofastq -i ${bam_loc}/${sample}_s.bam \
                    -fq ${bam_loc}/${sample}_e1.fq \
                    -fq2 ${bam_loc}/${sample}_e2.fq

#############################
### REALIGN TO REFERENCE ####
#############################

bwa mem \
        -t 4 \
        -M \
        ${reference} \
        ${bam_loc}/${sample}_e1.fq \
        ${bam_loc}/${sample}_e2.fq \
        | \
        samtools sort -@ 4 -o ${bam_loc}/${sample}_a.bam -

#############################
### SUBSET BAM TO BED FILE ##
#############################

samtools view -L ${regions} -b ${bam_loc}/${sample}_a.bam > ${bam_loc}/${sample}_pon.bam

#############################
### FINAL CLEANUP ###########
#############################
rm ${bam_loc}/${sample}_s.bam
rm ${bam_loc}/${sample}_a.bam
rm ${bam_loc}/${sample}_e1.fq
rm ${bam_loc}/${sample}_e2.fq

This runs without an error. However, I then proceeded to validate my files as such:

# validate bam file
java -jar ${PICARD} ValidateSamFile \
      I=122722_pon.bam \
      MODE=SUMMARY

## HISTOGRAM    java.lang.String
Error Type  Count
ERROR:INVALID_VERSION_NUMBER    1
ERROR:MATE_NOT_FOUND    261784
ERROR:MISSING_READ_GROUP    1
WARNING:RECORD_MISSING_READ_GROUP   2653476

These represent two different errors: missing read groups, and "MATE_NOT_FOUND". I am unsure on how to proceed from here.

i) The read group error should be relatively straightforward, by using the header from the fastq files to build the RG line, right? Like so:

R1_file=$(ls ${fastq_loc}/*R1* | sed -n ${SLURM_ARRAY_TASK_ID}p)
R2_file=$(echo ${R1_file} | sed s/R1/R2/)
sample=$(echo ${R1_file} | sed s/_S.*// | sed s/.*A00.*_// )
lib=$(echo $sample"_lib")
rgid=$(gzip ${R1_file} -cd | head -1 | sed 's/\s.*$//')
rg=$(echo "@RG\tID:$rgid\tSM:$sample\tLB:$lib\tPL:ILLUMINA")

ii) The "MATE_NOT_FOUND" error is stumping me. From what I can see online, this happens if I run bwa mem with paired reads, but either the file does not contain paired reads, or they are not recognized as such. My PI is telling me that the WES BAMs were created with paired-end reads. How can I check this assumption?

Any help or guidance would be greatly appreciated. Thanks.

Christian

bam samtools bedtools bwa • 1.4k views

ADD COMMENT • link 13 months ago by Christian ▴ 30

1

Entering edit mode

13 months ago

Pierre Lindenbaum 161k

i) The read group error should be relatively straightforward, by using the header from the fastq files to build the RG line,

use the output of

samtools samples *.recalibrated.bam

ii) The "MATE_NOT_FOUND" error is stumping me.

when doing this:

samtools view -L ${regions} -b ${bam_loc}/${sample}_a.bam

there 's a great chance that a read falls within a BED region but not it's mate.

#!/bin/bash
###STANDARD_SLURM_HEADER

module load samtools
module load bedtools
module load bwa/0.7.17

you'd better use a workflow manager like nextflow or snakemake instead of this bash script.

ADD COMMENT • link 13 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you, much appreciated - I will get around to trying this on Monday.

use the output of

samtools samples *.recalibrated.bam

I was under the impression that read group fields should be filled according to GATK requirements (https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups), while samtools samples outputs other information (i.e. sample name, path to alignment file, and path to reference genome). This would not include information that is otherwise contained in read groups (i.e. flowcell barcode, lane, library, platform). Is that correct?

when doing this:

samtools view -L ${regions} -b ${bam_loc}/${sample}_a.bam there 's a great chance that a read falls within a BED region but not it's mate.

That's a great point. On second thought, I may not need to do this step to use the bam file as normals for my tumor samples. No real advantage (other than file size).

ADD REPLY • link 13 months ago by Christian ▴ 30

score 4 · Accepted Answer · 2023-03-11

4

Entering edit mode

13 months ago

cmdcolin ★ 3.8k

here is a post on this from heng li https://lh3.github.io/2021/07/06/remapping-an-aligned-bam

it shows you can do this with less steps, and streaming

paste from blog

samtools collate -Oun128 in.bam | samtools fastq -OT RG,BC - \
  | bwa mem -pt8 -CH <(samtools view -H in.bam|grep ^@RG) ref.fa - \
  | samtools sort -@4 -m4g -o out.bam -