Question: How to properly combine two bam files of a paired-end data
0
gravatar for izzy.yichao.cai
2.5 years ago by
izzy.yichao.cai150 wrote:

Hi all!

I am mapping a paired-end read separately using bowtie2.

After that, I want to combine the two bam file into one for downstream analysis.

How to properly do this combination?

I tried:

samtools sort -n R1.bam R1.sorted
samtools sort -n R1.bam R1.sorted
samtools merge R1_R2.bam R1.sorted.bam R2.sorted.bam

But it seems not working. The header of the R1_R2.bam is like following:

603889264 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 duplicates

590620124 + 0 mapped (97.80%:-nan%)

0 + 0 paired in sequencing

0 + 0 read1

0 + 0 read2

0 + 0 properly paired (-nan%:-nan%)

0 + 0 with itself and mate mapped

0 + 0 singletons (-nan%:-nan%)

0 + 0 with mate mapped to a different chr

0 + 0 with mate mapped to a different chr (mapQ>=5)

Which shows that the alignment of R1 and R2 are no properly paired. Also there is no @RG header line in the combined file.

What might be the problem?

Thanks a lot!!!

sequencing paired-end • 3.6k views
ADD COMMENTlink modified 2.5 years ago by i.sudbery4.4k • written 2.5 years ago by izzy.yichao.cai150
1

Why do you map paired-end reads separately with bowtie2? I think your problem is in here...

ADD REPLYlink written 2.5 years ago by Benn6.7k

Since my reads may contain PETs that span quite a long distance(even on different chromosomes), I would be better to map them separately. I was told that the ones with a long distance may be neglected by bowtie2.

ADD REPLYlink written 2.5 years ago by izzy.yichao.cai150
1

Please avoid using abbreviations like "PETs", not everyone is equally aware of what it means (and google is obviously not helpful for this). I don't think your solution for solving this is optimal, and merging bam files of separate reads will not magically generate read pairs.

ADD REPLYlink written 2.5 years ago by WouterDeCoster38k
1

PETs are animals in your house, right? :P

ADD REPLYlink written 2.5 years ago by Benn6.7k
2

I suspect that these are ChIA-PETs, which are technically plants...

ADD REPLYlink written 2.5 years ago by harold.smith.tarheel4.3k
1

Only if those animals are made of polyethylene terephthalate which regularly perform positron emission tomography.

ADD REPLYlink written 2.5 years ago by WouterDeCoster38k

Hi, have a look at http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#aligning-pairs.

ADD REPLYlink written 2.5 years ago by Charles Plessy2.7k

Yes, I read this before. But I think this is mapping paired-end to the genome in a single run. However, I need to map them separately to the genome, so this may not be helpful in this situation.

ADD REPLYlink written 2.5 years ago by izzy.yichao.cai150

However, I need to map them separately to the genome,

Would you mind elaborating why that is so?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by genomax67k

The point is that the paired-end reads that I got might cross a very very long distance along the chromosome, and even the two tag of the paired-end reads are on different chromosomes.

If I subject the two fastq file and run bowtie2 in paired-edn mode, some reads might not be mapped to the genome when the distance of the reads are way too different from the majority of the reads.

Those are the information that I need, so I can not afford to loose them.

ADD REPLYlink written 2.5 years ago by izzy.yichao.cai150

If you use a splice aware software and set a long splicing distance then you should be able to achieve mapping without problems. For those reads that do not map in initial round, you could collect and map individually to genome again to see if you have real evidence of translocations by number of mapped reads that support an observation.

ADD REPLYlink written 2.5 years ago by genomax67k
2
gravatar for harold.smith.tarheel
2.5 years ago by
United States
harold.smith.tarheel4.3k wrote:

The bitwise FLAG (second field of BAM file) encodes information about read pairs. Since you aligned reads 1 and 2 individually, these flags will not contain that pairing information.

The solution is to align your paired-end data with an aligner like BWA MEM or BBMap that can map long distance/discordant read pairs.

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by harold.smith.tarheel4.3k

Hi, can BWA map long distance/discordant read pairs?

The relative new paper on ChIA-PET data pipeline also aligns two ends of reads independently to the genome even using BWA.

ADD REPLYlink written 2.5 years ago by izzy.yichao.cai150

BWA can map discordent pairs accross different chromosomes etc.

We use BWA MEM to map our Capture-C data, which is similar to ChIA-PET, but without the immunoprecipitation step.

You can also map the two ends separately with BWA aln, then then merge the two alignments with sampe.

But note that you will not get the proper-pair flag set because when pairs are on different chromosomes, they ARE paired, but they are NOT properly paired.

ADD REPLYlink written 2.5 years ago by i.sudbery4.4k
1
gravatar for i.sudbery
2.5 years ago by
i.sudbery4.4k
Sheffield, UK
i.sudbery4.4k wrote:

Irrespective of why you might be trying to do this, have you tried samtools fixmate? You first merge the files and then sort by name with samtools sort and then run samtools fixmate -p whos purpose is, to quote the manual:

Fill in mate coordinates, ISIZE and mate related flags from a name-sorted alignment.

I don't know if it will work on entirely unpaired files, but its worth a try.

If not, BWA's sampe merges single end alignments from bwa aln into a single alignment, it will probably set your pairs to be "not proper pairs", but you'll just have to ignore that flag.

ADD COMMENTlink written 2.5 years ago by i.sudbery4.4k
1

Thinking some more, and reading harold.smith.tarheel's answer above, for the fixmate you'll need to set the read2 flag in the reads from your second alignment file. You can do this with the python pysam module:

import pysam
insam = pysam.AlignmentFile("input_read2s.bam")
outsam = pysam.AlignmentFile("outfile_read2s.bam", "wb", template=insam)

for read in insam.fetch(until_eof=True):
    read.is_read1 = False
    read.is_read2=True
    outsam.write(read)

outsam.close()
ADD REPLYlink written 2.5 years ago by i.sudbery4.4k

Hi I tried fixmate but it doesn't seem to working.

Here is what i did: 1) Sort the two bam files by name. 2) use samtools merge to merge two bam files. 3) sorted the merged bam files by name. 4) samtools fixmate

But still, there are no properly-paired reads. (0 + 0 paired in sequencing; 0 + 0 read1; 0 + 0 read2; 0 + 0 properly paired (-nan%:-nan%))

If I am correct, your python code tries to set the flag of the second bam file and let fixmate recognize it as "read 2". Do I also need to set the flag in the first bam file?

Also, is it correct that I first set the flag of the bam files before I merge them and then use fixmate to let them pair properly?

EDIT: Hi I tried the following script:

import pysam

R1_dir = "./test_L001_R1_F50_sorted.bam" 
R1_out = "./test_L001_R1_F50_sorted_flaged.bam"
R2_dir = "./test_L001_R2_F50_sorted.bam"
R2_out = "./test_L001_R2_F50_sorted_flaged.bam"

insam1 = pysam.AlignmentFile(R1_dir,"rb")
outsam1 = pysam.AlignmentFile(R1_out, "wb",template=insam1)
insam2 = pysam.AlignmentFile(R2_dir,"rb")
outsam2 = pysam.AlignmentFile(R2_out, "wb",template=insam2)


for readA in insam1.fetch(until_eof=True):
    readA.is_read1 = True
    readA.is_read2 = False
    outsam1.write(readA)
outsam1.close()
for readB in insam2.fetch(until_eof=True):
    readB.is_read1 = False
    readB.is_read2 = True
    outsam2.write(readB)
outsam2.close()

Then I use samtools merge and fixmate. Still, there is no paired-end reads.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by izzy.yichao.cai150

You are not going to get properly paired reads, because your reads do not meet anyone else's definition of properly paired. If you wanted to define some criteria for "properly-paired" that is suitable for your experiment, you'll have to test it yourself, and manually set the properly paired flag for the reads that meet it.

However, I did think the above would work for getting paired reads, if not properly paired reads. Do your read1 alignments and your read2 alignments have the same name?

ADD REPLYlink written 2.5 years ago by i.sudbery4.4k

Yes, they are of the same name.

Okay, I got what you mean. It seems that I am too particular on "properly paired end". :p

But then I found another problem, which is that the sorted bam file has a different order of lines. This result in the empty result of bedpe file when I subject the merged bam file to bedtools bamtobed

The total number of line is the same, while the lines are not corresponding to each other(Some reads in R2.bam is missing, while it is in the R1.bam. In the fastq file it did exist in bothe file). This should not happen. What I did is just sort the bam file converted from sam file.

Did set the wrong option that deletes the unmapped reads alignment?

bowtie2 --local -p 32 -x hg19 -U R1.fastq -S R1.sam
ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by izzy.yichao.cai150

This is to be expected: In some cases your read1 will map, while your read2 doesn't, or vice versa, you may also get more than one location for read1, but only one for read2 (or 10 for read1 and 12 for read2 etc). There are many reasons why a read might not map, such as it contains sequencing errors, or because it maps to an unsequenced part of the genome or any other of a number of reasons. Futher in something like ChiA-PET, it is possible that one of your reads crosses the digestion point, such that the sequence of the read never actually occurs in nature.

ADD REPLYlink written 2.5 years ago by i.sudbery4.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1145 users visited in the last hour