Question: Realign BAM files to other reference file
0
gravatar for marongiu.luigi
3 months ago by
Germany, Mannheim, UMM
marongiu.luigi330 wrote:

Dear all,

I completed an alignment of two groups of fastq files to a reference. To re-align the bam files to another reference I did the following:

# align
bwa mem -R <read_group1> <ref.fa> <file1_1.fq> <file1_2.fq> -o <file1_aln.sam>
bwa mem -R <read_group2> <ref.fa> <file2_1.fq> <file2_2.fq> -o <file2_aln.sam>
# convert
samtools view -Sb <file1_aln.sam>  > <file1_aln.bam>
samtools view -Sb <file2_aln.sam>  > <file2_aln.bam>
# sort
samtools sort <file1_aln.bam> -o <file1_alnSRT.bam>
samtools sort <file2_aln.bam> -o <file2_alnSRT.bam>
# merge
samtools merge <file_alnSRT.bam> <file1_aln.bam> <file2_aln.bam>
# select mapped
samtools view -h -F 4 <file_alnSRT.bam> -o <file_alnMAP.bam>
# convert to fastq
samtools fastq -1 <filemap_1.fq.gz> -2 <filemap_1.fq.gz> <file_alnMAP.bam>
# re-align
bwa mem -R <read_group> <ref2.fa> <filemap_1.fq.gz> <filemap_1.fq.gz> | samtools sort -o <file_realign.bam>

but I got this error at the last step:

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 831790 sequences (80000139 bp)...
[M::process] read 872512 sequences (80000021 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1981, 3107, 95, 1551)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (7, 14, 27)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 67)
[M::mem_pestat] mean and std.dev: (16.00, 12.21)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 87)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (42, 47, 60)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (6, 96)
[M::mem_pestat] mean and std.dev: (52.98, 15.79)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 116)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (5897, 5897, 5897)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (5897, 5897)
[M::mem_pestat] mean and std.dev: (5897.00, 0.00)
[M::mem_pestat] low and high boundaries for proper pairs: (5897, 5897)
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (6, 17, 31)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 81)
[M::mem_pestat] mean and std.dev: (19.94, 15.35)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 106)
[M::mem_pestat] skip orientation RF
[mem_sam_pe] paired reads have different names: "HISEQ:90:C3UNJACXX:1:1107:16388:61141", "HISEQ:90:C3UNJACXX:3:1303:2008:58095"
[mem_sam_pe] [mem_sam_pe] paired reads have different names: "HWI-ST1437:64:C3UM1ACXX:4:1214:3833:62731", "HWI-ST1437:64:C3UM1ACXX:4:1202:18519:45906"

[mem_sam_pe] paired reads have different names: "HISEQ:90:C3UNJACXX:1:2111:17172:73588", "DHGDKXP1:247:C3MF6ACXX:2:1304:11299:10212"
[mem_sam_pe] @HD    VN:1.3  SO:coordinate
@SQ SN:V    LN:348678459
@PG ID:bwa  PN:bwa  VN:0.7.15-r1140 CL:bwa mem -t 8 [...]

Apparently, the two read groups have given the problem.

How can I solve it?

Thnak you.

ADD COMMENTlink modified 3 months ago • written 3 months ago by marongiu.luigi330
1

While I've never used samtools fastq before, my impression is that it will simply output reads in the same order as they are found in the bam file. The error suggests you need to sort your bam file by name before converting it to fastq.

ADD REPLYlink written 3 months ago by Eric Lim1.1k

are you using a latest version?

ADD REPLYlink written 3 months ago by naive_user70

i am using Version: 1.7

ADD REPLYlink written 3 months ago by marongiu.luigi330

Newer versions of samtools sort will sort sam files, so you can skip the samtools view step.

Also, why are you filtering away unmapped reads if you are aligning to a different reference? How can you be sure those reads won't align to your new reference?

Your alignment to the new reference is obliterating read groups from the old files. If you don't care about read groups, you could just cat the initial files together, so you don't have to run everything twice and then merge.

ADD REPLYlink modified 3 months ago • written 3 months ago by swbarnes24.2k

dear drkennetz, see comment below. the goal i would like to achieve is filtering stuff out...

ADD REPLYlink written 3 months ago by marongiu.luigi330
3
gravatar for marongiu.luigi
3 months ago by
Germany, Mannheim, UMM
marongiu.luigi330 wrote:

Looks like I solved it by using PICARD: extracting the aligned reads but without sorting, I used

java -jar picard.jar SamToFastq I=<file_alnMAP.bam> FASTQ=<filemap_1.fq> SECOND_END_FASTQ=<filemap_2.fq>
bwa mem -R <read_group> <ref2.fa> <filemap_1.fq> <filemap_1.fq>

From samtools flagstat the output looks fine. I'd say case closed.

ADD COMMENTlink modified 3 months ago • written 3 months ago by marongiu.luigi330

okay. you have converted them (bam) back to fastqs and then realigned. Thanks.

ADD REPLYlink written 3 months ago by cpad01129.3k

How long does picard.jar SamToFastq run? samtools fastq?

ADD REPLYlink written 3 months ago by Eric Lim1.1k

I haven't count it, but it was for a few minutes. I can update next time I'll run it...

ADD REPLYlink written 3 months ago by marongiu.luigi330

Why is this superior to just using the original fastqs?

ADD REPLYlink written 3 months ago by swbarnes24.2k

The fact that I have removed hundreds of thousands of reads that I did not want, which were a noise that increased the computational demand. Now, instead of hours, I can re-align in minutes.

ADD REPLYlink written 3 months ago by marongiu.luigi330

You are saying that you filtered away huge numbers of reads with -F 4 without disturbing the read 1 and read 2 pairing? -F 12 would probably work, I'm really surprised that Picard took all those orphan reads in stride.

ADD REPLYlink written 3 months ago by swbarnes24.2k

How would I check whether there is a difference between the files produced by -F4 and -F12? Specifically, how to check that the mates are still in sync?

ADD REPLYlink written 10 weeks ago by marongiu.luigi330

I think the problem originally was due to the wrong flag: the mapped reads are selected with -f, not -F. When I re-run the thing with -f, the filtered file could be easily converted in fastq with Picard:

java -jar picard.jar SamToFastq I=<file_sorted.bam> FASTQ=file_1.fq SECOND_END_FASTQ=file_2.fq

even if the original bam file was sorted. There is truly no need to keep intermediate unsorted bam.

ADD REPLYlink written 23 days ago by marongiu.luigi330

No, samtools -f 4 gets unmapped reads, samtools -F 4 gets mapped ones.

-f INT only include reads with all of the FLAGs in INT present [0] -F INT only include reads with none of the FLAGS in INT present [0]

ADD REPLYlink written 23 days ago by swbarnes24.2k

This was a bit confusing. I went back to the BIOSTAR's manual and I understand this:

f – selects reads that match the flag’s number
F – filters the reads that match the flag’s number

So I got not the f/F flag wrong, but actually the value: 4 does not mean mapped but UNMAPPED. So with -F 4 I am kind of doing a double negation. My bad.

ADD REPLYlink modified 22 days ago • written 22 days ago by marongiu.luigi330
1
gravatar for drkennetz
3 months ago by
drkennetz350
drkennetz350 wrote:

You don't want to realign aligned.bams, just start over from the file_1.fq and file_2.fq and repeat the same alignment process.The original fastqs wouldn't have been modified in any of these steps so just start from the beginning with the second reference.

ADD COMMENTlink written 3 months ago by drkennetz350

but what I want to do is to get what aligns to the first reference and from that filtering out what aligns to the second reference. I reckon the problem is with the reading group. If I have one sample that is run on two different occasions (hence the two reading groups) can I simply merge them? or at least use the same reading group? The reading group is required for the downstream analysis, for instance with GATK.

ADD REPLYlink written 3 months ago by marongiu.luigi330

You can merge the reads if they are from the same sample and if they are the same prep (IE WES, WGS, RNASeq). Merge them before alignment, so you don't have to do it after and re-sort them.

For your specific analysis, we will call it a metagenomic comparison because you are trying to see if reads are aligning to two genomes.

1) Create a new reference genome by just merging the two reference genomes of interest, and then using bwa index:

$ bwa index merged_genome.fasta

2) align multimapped reads to new merged_genome.fasta (the merged.fastqs are just the merged fastqs of the same sample from two sequencing runs):

bwa mem merged_R1.fastq merged_R2.fastq merged_genome.fasta > merged_aln_pe.sam

3) convert to bam:

samtools view -h -b -S merged_aln_pe.sam > merged_aln_pe.bam

4) If you want to extract reads only aligned to the merged_genome.fasta (if not just skip this step):

samtools view -b -F 4 merged_aln_pe.bam > merged_aln_pe.mapped.bam

5) extract uniquely mapped reads (reads that only mapped to one place, so if a read mapped to both genomes it will get tossed out):

samtools view reads.bam | grep 'XT:A:U' | samtools view -bS -T referenceSequence.fa - > merged_reads.uniqueMap.bam

The XT:A:U flag is a bwa flag meaning the read was unambiguously uniquely mapped (mapped to only one place).

I hope this gives you what you are looking for!

ADD REPLYlink modified 3 months ago • written 3 months ago by drkennetz350

Thank you, but in this way, the computational demands increase (the reference file is bigger than the original one); the filtering step I am seeking is actually carried out in some publications to filter out, for instance, the human genome and speed up the analysis...

ADD REPLYlink written 3 months ago by marongiu.luigi330

The read group is not the problem. The problem is that bwa expects the first read of file one to be the mate of the first read of file 2. That assumption obviously does not hold throughout the whole file, because sorting by coordinates and throwing away reads ruins that.

ADD REPLYlink written 3 months ago by swbarnes24.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 767 users visited in the last hour