Hi everyone, I got really confused by some of the observations with my current data set (my second set of ChIP-Seq ever). After paired-end sequencing, I got 32 fastq.gz files for 16 samples. I use trimmomatic paired-end mode to trim illumina adaptors. Then, I ran BWA-MEM with the default parameters, using the paired reads from trimmomatic output. I then filtered for mapping quality > 5 and "IsProperPair" using Bamtools. Here's the problem: 5 out of 16 samples returned extremely small files. I ran Samtools "Stats" and found that while these files had lots of mapped reads, there were 0 proper pairs. I triple checked the input files and they were all matching paired reads.
Since my experience in ChIP-seq analysis is very limited, it'd be very helpful if someone can enlighten me on the cause of this problem and whether I can still use the alignment files without filtering for proper pairs. Thank you very much!
It's important to find out why these weren't marked as proper pairs. Namely, was it due to having the wrong relative orientation, or wrong fragment size, or being on different chromosomes, or something else? You can likely discern this by quickly looking at a few of the alignments and guesstimating from that. If it's just a matter of insert size and the observed insert size isn't too out there, then I'd say forget worrying about proper pairs. If, however, there's a different underlying reason behind the metrics then you have cause for concern.
Hi! Thank you for the advice. I just looked at the samtools stats and found that the reads indeed had wrong orientations, and were mapped to different chromosomes.
So I guess the reads were no good? What might have caused this to happen to a few files out of the bunch?
I tried Bowtie 2 and got the same bad result.
When I visualized their BigWig files in IGV, they were actually exactly where they should be, all consistent with each other, and gave me the expected pattern of enrichment.
The problem is that when I used MACS2, it complained that "No common chromosome names can be found from treatment and control! Check your input files! MACS will quit..."
samtools idxstats option showed that the headers look identical to the ones in the files that are working.
I guess I'll try aligning them as single end reads and combine later. I'm just very puzzled by the observation that 5 files in a row in a group of 16 samples are giving me this problem. :S