Dear all,
I am trying to filter out reads by re-aligning a BAM file. The filtered reads then need to be analyzed with things like GATK or Picard, hence I think I need a read group. The same sample was sequenced in two different runs, so it is split between file1 and file2, paired. Since I understand that these files cannot be combined before alignment, I am running them in parallel. I am selecting the reads that mapped to a first reference genome, then I want to select the reads that map to a second reference genome.
This is the procedure I followed:
# align
bwa mem -R <read_group1> <ref.fa> <file1_1.fq> <file1_2.fq> -o <file1_aln.sam>
bwa mem -R <read_group2> <ref.fa> <file2_1.fq> <file2_2.fq> -o <file2_aln.sam>
# convert
samtools view -Sb <file1_aln.sam> > <file1_aln.bam>
samtools view -Sb <file2_aln.sam> > <file2_aln.bam>
# select mapped reads
rm 501N-D3_alnHUM.sam
samtools view -h -F 4 <file1_aln.bam> -o <file1_map.bam>
samtools view -h -F 4 <file2_aln.bam> -o <file2_map.bam>
# sort
samtools sort <file1_map.bam> -o <file1_mapSRT.bam>
samtools sort <file2_map.bam> -o <file2_mapSRT.bam>
# reconvert to fastq
samtools fastq -1 <file1_1map.fq> -2 <file1_2map.fq> <file2_mapSRT.bam>
samtools fastq -1 <file2_1map.fq> -2 <file2_2map.fq> <file2_mapSRT.bam>
# re-align
bwa mem -R <read_group1> <ref.fa> <file1_1map.fq> <file1_2map.fq> -o <file1map_aln.sam>
However, when running the first re-alignment I get an error due to the read group:
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 105748 sequences (10000048 bp)...
[M::process] read 104770 sequences (10000147 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (413, 343, 0, 304)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (4, 13, 25)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 67)
[M::mem_pestat] mean and std.dev: (15.77, 13.79)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 88)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (42, 53, 70)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 126)
[M::mem_pestat] mean and std.dev: (55.71, 16.50)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 154)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (7, 16, 28)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 70)
[M::mem_pestat] mean and std.dev: (18.78, 14.24)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 91)
[mem_sam_pe] paired reads have different names: "HWI-ST1437:64:C3UM1ACXX:4:1214:3833:62731", "HWI-ST1437:64:C3UM1ACXX:4:1202:18519:45906"
It looks like the pairs are no longer synchronized (different x-y coordinates). I had to sort the reads in order to re-create the fastq files (that's the way I know, mpileup I understand is more for creating a consenus fasta).
How can I keep synchronization to realign a BAM file?
thank you