Question: Realign BAM files to other reference file
0
gravatar for marongiu.luigi
13 months ago by
Germany, Mannheim, UMM
marongiu.luigi380 wrote:

Dear all,

I completed an alignment of two groups of fastq files to a reference. To re-align the bam files to another reference I did the following:

# align
bwa mem -R <read_group1> <ref.fa> <file1_1.fq> <file1_2.fq> -o <file1_aln.sam>
bwa mem -R <read_group2> <ref.fa> <file2_1.fq> <file2_2.fq> -o <file2_aln.sam>
# convert
samtools view -Sb <file1_aln.sam>  > <file1_aln.bam>
samtools view -Sb <file2_aln.sam>  > <file2_aln.bam>
# sort
samtools sort <file1_aln.bam> -o <file1_alnSRT.bam>
samtools sort <file2_aln.bam> -o <file2_alnSRT.bam>
# merge
samtools merge <file_alnSRT.bam> <file1_aln.bam> <file2_aln.bam>
# select mapped
samtools view -h -F 4 <file_alnSRT.bam> -o <file_alnMAP.bam>
# convert to fastq
samtools fastq -1 <filemap_1.fq.gz> -2 <filemap_1.fq.gz> <file_alnMAP.bam>
# re-align
bwa mem -R <read_group> <ref2.fa> <filemap_1.fq.gz> <filemap_1.fq.gz> | samtools sort -o <file_realign.bam>

but I got this error at the last step:

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 831790 sequences (80000139 bp)...
[M::process] read 872512 sequences (80000021 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1981, 3107, 95, 1551)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (7, 14, 27)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 67)
[M::mem_pestat] mean and std.dev: (16.00, 12.21)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 87)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (42, 47, 60)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (6, 96)
[M::mem_pestat] mean and std.dev: (52.98, 15.79)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 116)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (5897, 5897, 5897)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (5897, 5897)
[M::mem_pestat] mean and std.dev: (5897.00, 0.00)
[M::mem_pestat] low and high boundaries for proper pairs: (5897, 5897)
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (6, 17, 31)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 81)
[M::mem_pestat] mean and std.dev: (19.94, 15.35)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 106)
[M::mem_pestat] skip orientation RF
[mem_sam_pe] paired reads have different names: "HISEQ:90:C3UNJACXX:1:1107:16388:61141", "HISEQ:90:C3UNJACXX:3:1303:2008:58095"
[mem_sam_pe] [mem_sam_pe] paired reads have different names: "HWI-ST1437:64:C3UM1ACXX:4:1214:3833:62731", "HWI-ST1437:64:C3UM1ACXX:4:1202:18519:45906"

[mem_sam_pe] paired reads have different names: "HISEQ:90:C3UNJACXX:1:2111:17172:73588", "DHGDKXP1:247:C3MF6ACXX:2:1304:11299:10212"
[mem_sam_pe] @HD    VN:1.3  SO:coordinate
@SQ SN:V    LN:348678459
@PG ID:bwa  PN:bwa  VN:0.7.15-r1140 CL:bwa mem -t 8 [...]

Apparently, the two read groups have given the problem.

How can I solve it?

Thnak you.

ADD COMMENTlink modified 13 months ago • written 13 months ago by marongiu.luigi380
1

While I've never used samtools fastq before, my impression is that it will simply output reads in the same order as they are found in the bam file. The error suggests you need to sort your bam file by name before converting it to fastq.

ADD REPLYlink written 13 months ago by Eric Lim1.4k

are you using a latest version?

ADD REPLYlink written 13 months ago by naive_user70

i am using Version: 1.7

ADD REPLYlink written 13 months ago by marongiu.luigi380

Newer versions of samtools sort will sort sam files, so you can skip the samtools view step.

Also, why are you filtering away unmapped reads if you are aligning to a different reference? How can you be sure those reads won't align to your new reference?

Your alignment to the new reference is obliterating read groups from the old files. If you don't care about read groups, you could just cat the initial files together, so you don't have to run everything twice and then merge.

ADD REPLYlink modified 13 months ago • written 13 months ago by swbarnes26.2k

dear drkennetz, see comment below. the goal i would like to achieve is filtering stuff out...

ADD REPLYlink written 13 months ago by marongiu.luigi380
3
gravatar for marongiu.luigi
13 months ago by
Germany, Mannheim, UMM
marongiu.luigi380 wrote:

Looks like I solved it by using PICARD: extracting the aligned reads but without sorting, I used

java -jar picard.jar SamToFastq I=<file_alnMAP.bam> FASTQ=<filemap_1.fq> SECOND_END_FASTQ=<filemap_2.fq>
bwa mem -R <read_group> <ref2.fa> <filemap_1.fq> <filemap_1.fq>

From samtools flagstat the output looks fine. I'd say case closed.

ADD COMMENTlink modified 13 months ago • written 13 months ago by marongiu.luigi380

okay. you have converted them (bam) back to fastqs and then realigned. Thanks.

ADD REPLYlink written 13 months ago by cpad011211k

How long does picard.jar SamToFastq run? samtools fastq?

ADD REPLYlink written 13 months ago by Eric Lim1.4k

I haven't count it, but it was for a few minutes. I can update next time I'll run it...

ADD REPLYlink written 13 months ago by marongiu.luigi380

Why is this superior to just using the original fastqs?

ADD REPLYlink written 13 months ago by swbarnes26.2k

The fact that I have removed hundreds of thousands of reads that I did not want, which were a noise that increased the computational demand. Now, instead of hours, I can re-align in minutes.

ADD REPLYlink written 13 months ago by marongiu.luigi380

You are saying that you filtered away huge numbers of reads with -F 4 without disturbing the read 1 and read 2 pairing? -F 12 would probably work, I'm really surprised that Picard took all those orphan reads in stride.

ADD REPLYlink written 13 months ago by swbarnes26.2k

How would I check whether there is a difference between the files produced by -F4 and -F12? Specifically, how to check that the mates are still in sync?

ADD REPLYlink written 12 months ago by marongiu.luigi380

I think the problem originally was due to the wrong flag: the mapped reads are selected with -f, not -F. When I re-run the thing with -f, the filtered file could be easily converted in fastq with Picard:

java -jar picard.jar SamToFastq I=<file_sorted.bam> FASTQ=file_1.fq SECOND_END_FASTQ=file_2.fq

even if the original bam file was sorted. There is truly no need to keep intermediate unsorted bam.

ADD REPLYlink written 10 months ago by marongiu.luigi380

No, samtools -f 4 gets unmapped reads, samtools -F 4 gets mapped ones.

-f INT only include reads with all of the FLAGs in INT present [0] -F INT only include reads with none of the FLAGS in INT present [0]

ADD REPLYlink written 10 months ago by swbarnes26.2k

This was a bit confusing. I went back to the BIOSTAR's manual and I understand this:

f – selects reads that match the flag’s number
F – filters the reads that match the flag’s number

So I got not the f/F flag wrong, but actually the value: 4 does not mean mapped but UNMAPPED. So with -F 4 I am kind of doing a double negation. My bad.

ADD REPLYlink modified 10 months ago • written 10 months ago by marongiu.luigi380
1
gravatar for drkennetz
13 months ago by
drkennetz380
drkennetz380 wrote:

You don't want to realign aligned.bams, just start over from the file_1.fq and file_2.fq and repeat the same alignment process.The original fastqs wouldn't have been modified in any of these steps so just start from the beginning with the second reference.

ADD COMMENTlink written 13 months ago by drkennetz380

but what I want to do is to get what aligns to the first reference and from that filtering out what aligns to the second reference. I reckon the problem is with the reading group. If I have one sample that is run on two different occasions (hence the two reading groups) can I simply merge them? or at least use the same reading group? The reading group is required for the downstream analysis, for instance with GATK.

ADD REPLYlink written 13 months ago by marongiu.luigi380

You can merge the reads if they are from the same sample and if they are the same prep (IE WES, WGS, RNASeq). Merge them before alignment, so you don't have to do it after and re-sort them.

For your specific analysis, we will call it a metagenomic comparison because you are trying to see if reads are aligning to two genomes.

1) Create a new reference genome by just merging the two reference genomes of interest, and then using bwa index:

$ bwa index merged_genome.fasta

2) align multimapped reads to new merged_genome.fasta (the merged.fastqs are just the merged fastqs of the same sample from two sequencing runs):

bwa mem merged_R1.fastq merged_R2.fastq merged_genome.fasta > merged_aln_pe.sam

3) convert to bam:

samtools view -h -b -S merged_aln_pe.sam > merged_aln_pe.bam

4) If you want to extract reads only aligned to the merged_genome.fasta (if not just skip this step):

samtools view -b -F 4 merged_aln_pe.bam > merged_aln_pe.mapped.bam

5) extract uniquely mapped reads (reads that only mapped to one place, so if a read mapped to both genomes it will get tossed out):

samtools view reads.bam | grep 'XT:A:U' | samtools view -bS -T referenceSequence.fa - > merged_reads.uniqueMap.bam

The XT:A:U flag is a bwa flag meaning the read was unambiguously uniquely mapped (mapped to only one place).

I hope this gives you what you are looking for!

ADD REPLYlink modified 13 months ago • written 13 months ago by drkennetz380

Thank you, but in this way, the computational demands increase (the reference file is bigger than the original one); the filtering step I am seeking is actually carried out in some publications to filter out, for instance, the human genome and speed up the analysis...

ADD REPLYlink written 13 months ago by marongiu.luigi380

The read group is not the problem. The problem is that bwa expects the first read of file one to be the mate of the first read of file 2. That assumption obviously does not hold throughout the whole file, because sorting by coordinates and throwing away reads ruins that.

ADD REPLYlink written 13 months ago by swbarnes26.2k

Hi Swbarnes2,

So, your mean is to sort based on name instead of coordinate for bwa? I also have an issue with bwa and identifying paired read by it. Could you please take a look at my issue and let me know your idea?

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by seta1.2k

No, I realigned by coordinates as follows:

# keep only pairs not aligned
samtools view -u -f12 -F256 <AlnSrt.bam> > <R1uR2u.bam> 
# sort
samtools sort -n -O BAM <R1uR2u.bam> -o <UmpSrt.bam>
# convert
bamToFastq -i <UmpSrt.bam> -fq <Ump_1.fq> -fq2 <Ump_2.fq>

ADDENDUM Anyway, I eventually used the following approach:

(1) map to composite genome primary + secondary

(2) deduplicate sambamba markdup -r -t 8 --overflow-list-size 1000000 --hash-table-size 1000000 --tmpdir=SCRATCH_DIR <AlnSrt.bam> <AlnSrtDed.bam> (3) select reads that are read pair, read pair mapped, first in pair (-f 67) and that are not second in pair, secondary alignment, optical or PCR duplicate (-F 1408) on the chromosome of interest samtools view -h -q 10 -f67 -F1408 <AlnSrtDed.bam> "chr..." and work on the reads directly.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by marongiu.luigi380

Thanks for the feedback. I think my problem is related to bwa mem, the previous steps sounds well work and I got correct PE reads during converting the extracted bam file to fastq reads. however, mapping these fastq reads with my own reference by bwa mem is not good. Please let me know if you have any suggestion

ADD REPLYlink written 10 weeks ago by seta1.2k

Hi, what kind of 'not good'? I also found that recreating the fastq generates issues so I chose the second approach, merge the genome of reference in the other one and take it from there.

ADD REPLYlink written 10 weeks ago by marongiu.luigi380

My mean is: after mapping the generated fastq reads to my own reference with bwa mem, the properly paired percentage is very low. I explained the issue in this post. Could you please take a look at it and let me know your idea.

Thanks

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by seta1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1445 users visited in the last hour