Error in samtools fixmate when dealing with published pair-end data
1
1
Entering edit mode
5.3 years ago
jamwest101 ▴ 10

Hello,

I am trying to analyse a published pair-end ChIP-seq data. However, most of reads info are missing after running samtools fixmate. For example: After sorting bam file by reads name, the report of samtools flagstat shows that:

13431858 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
12770824 + 0 mapped (95.08% : N/A)
13431858 + 0 paired in sequencing
6715929 + 0 read1
6715929 + 0 read2
8381710 + 0 properly paired (62.40% : N/A)
12560308 + 0 with itself and mate mapped
210516 + 0 singletons (1.57% : N/A)
269058 + 0 with mate mapped to a different chr
50930 + 0 with mate mapped to a different chr (mapQ>=5)

then, after running samtools fixmate -m in.bam out.bam and samtools flagstat out.bam. I got following info:

13431858 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
12770824 + 0 mapped (95.08% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

I have processed my own pair-end ChIP-seq data and all the reads info are same as before running fixmate. Is there any thing I need to process differently when dealing with published data? The published data is obtained by fastq-dump -I --split-files SRRXXXX and mapped to genome use Bowtie2 with default settings. My samtools version is 1.7

Thanks,

ChIP-Seq samtools • 2.9k views
ADD COMMENT
0
Entering edit mode

Please show the full command lines and the full SRR number. I assume something in the sort command went wrong, probably sorted by coordinate, rather than name.

ADD REPLY
0
Entering edit mode

The bam file is sorted by name for sure. samtools sort -n -T temp -o out.bam in.bam I have tried several published pair-end data and all have this problem. For example: SRR1848385 My total command lines list as follow:

./fastq-dump -I --split-files SRR1848385
bowtie2 -x tair10 -1 SRR1848385_1.fastq -2 SRR1848385_2.fastq -S SRR1848385.sam
samtools -bS SRR1848385.sam > SRR1848385.bam
samtools sort -n -T temp -o SRR1848385_namesort.bam SRR1848385.bam
samtools fixmate -m SRR1848385_namesort.bam SRR1848385_fixmate.bam
ADD REPLY
5
Entering edit mode
5.3 years ago
ATpoint 81k

The problem seems to be that using option -I in fastq-dump appends a .1 and .2 to the read name in the fastq files.

fastq-dump --split-files SRR1848385.sra && head -n 1 *.fastq

==> SRR1848385_1.fastq <==
@SRR1848385.1 HISEQ:115:C49JYACXX:1:1101:1708:2225 length=100

==> SRR1848385_2.fastq <==
@SRR1848385.1 HISEQ:115:C49JYACXX:1:1101:1708:2225 length=100

fastq-dump --split-files -I SRR1848385.sra && head -n 1 *.fastq

==> test_1.fq <==
@SRR1848385.1.1 HISEQ:115:C49JYACXX:1:1101:1708:2225 length=100

==> test_2.fq <==
@SRR1848385.1.2 HISEQ:115:C49JYACXX:1:1101:1708:2225 length=100

As a consequence, fixmate interpretes the read names as non-matching and therefore assumes they do not belong to one pair, rendering them unpaired. Try converting to fastq without -I. Also, there is no need to sort by name right after alignment, as paired-end reads are by default ordered by name in the fastq files (and therefore in the SAM file). You can also pipe everything to avoid intermediate files:

bowtie2 -x tair10 -1 SRR1848385_1.fastq -2 SRR1848385_2.fastq | \
  samtools fixmate -m - SRR1848385_fixmate.bam
ADD COMMENT
1
Entering edit mode

Using -F option with fastq-dump to recover original Illumina fastq headers may be preferable. That removes the @SRR1848385 id from read headers.

ADD REPLY
0
Entering edit mode

Thank you for the help!

ADD REPLY

Login before adding your answer.

Traffic: 2591 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6