Question: BWA-MEM - Improperly Paired Reads
0
gravatar for mglasena
21 days ago by
mglasena0
mglasena0 wrote:

After aligning paired-end 100bp reads to a reference genome, I am getting very low properly paired percentage:

369208441   0   total (QC-passed reads + QC-failed reads)

8985531 0   secondary

289733341   0   mapped

78.47%  N/A mapped %

360222910   0   paired in sequencing

180111455   0   read1

180111455   0   read2

1393338 0   properly paired

0.39%   N/A properly paired %

280747810   0   with itself and mate mapped

0   0   singletons

0.00%   N/A singletons %

39590468    0   with mate mapped to a different chr

0   0   with mate mapped to a different chr (mapQ>=5)

I followed GATK best practices to align paired-end short-read data to a reference genome. I downloaded the short-read data from NCBI SRA into fastq files using SRA toolkit's fastq-dump, converted the fastq files into unmapped bam using Picard FastqToSam, and marked adapters using Picard MarkIlluminaAdapters. I then piped Picard SamToFastq, bwa mem, and Picard MergeBamAlignment. To get stats on the alignment, I used samtools flagstat. For several of my samples, the alignment went great (90% mapped, 80% properly paired). However, for a couple of my samples, the properly paired percentage was well below 1%. I'm wondering how I could have a normal amount of reads mapping (~78%) but have only .39% of those reads properly paired.

I have double-checked that my fastq files from fastq-dump have identical read counts, and that they are properly interleaved after Picard FastqToSam. I additionally ran Picard ValidateSamFile to troubleshoot the file output by MergeBamAlignment and found no errors.

bwa alignment next-gen gatk genome • 177 views
ADD COMMENTlink modified 21 days ago by genomax91k • written 21 days ago by mglasena0

check some of your improperly paired reads. Usually they get marked like that only if either one of the reads does not map or if both get mapped and the distance between the outermost coordinates is either too large or too little.

it is possible that your reads overlap (a good deal) and get marked as improper pair for that reason.

ADD REPLYlink modified 21 days ago • written 21 days ago by Istvan Albert ♦♦ 85k

I found the source of my issue. The data I've downloaded from NCBI is supposed to be paired-end reads of 100 bp length. However, the forward and reverse reads are identical, which must have been a mistake during NBCI SRA submission. Am I just out of luck at this point? Or should I contact the researcher who submitted the data to NCBI SRA?

ADD REPLYlink written 10 days ago by mglasena0

yes, contact the original submitter and tell NCBI (via the info email address) as well, they may put more pressure on the submitter

ADD REPLYlink written 8 days ago by Istvan Albert ♦♦ 85k

The data I've downloaded from NCBI is supposed to be paired-end reads of 100 bp length. However, the forward and reverse reads are identical, which must have been a mistake during NBCI SRA submission.

Which accession number is this so we can take a look and let you know for sure. Did you use --split-files option when you dumped the reads from SRA using fastq-dump?

ADD REPLYlink modified 8 days ago • written 8 days ago by genomax91k

Also look at these lines:

39590468 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

seems to indicate that your mates map to different chromosomes

ADD REPLYlink modified 21 days ago • written 21 days ago by Istvan Albert ♦♦ 85k
0
gravatar for swbarnes2
21 days ago by
swbarnes28.8k
United States
swbarnes28.8k wrote:

converted the fastq files into unmapped bam using Picard FastqToSam, and marked adapters using Picard MarkIlluminaAdapters. I then piped Picard SamToFastq,

"Marked" adapters? Did you trim them away? Was this marking carried over into the fastq names? If not, didn't you lose it when you converted back to fastq?

You probably goofed the fastq ordering converting to and from fastq format.

ADD COMMENTlink written 21 days ago by swbarnes28.8k

Picard MarkIlluminaAdapters adds the XT tag to a read record to mark the 5' start position of the specified adapter sequence. Picard SamToFastq changes the quality scores of bases marked with an XT tag to two, which should effectively prevent the adapter portion from contributing to read alignment and metrics. Picard MergeBamAlignment restores the original base qualities. I've been following this gatk tutorial - "(How to) Map and clean up short short read sequence data efficiently."

I will double-check the fastq conversion steps.

ADD REPLYlink written 21 days ago by mglasena0

to be honest with you, that tutorial that you linked and seem to follow appears to advise an over-engineered approach that seems unnecessarily complicated and probably has no effect whatsoever. these back and forth conversions ... are silly

altering data to manufacture artificial quality scores to avoid alignment then restoring the qualities just feels outlandish. As far as I know, bwa would not take quality scores into account when performing the alignment in the first place, so what would even be the point of altering the qualities?

In fact bwa mem will align the original reads just fine, adapters and all, will soft clip the adapters none of the actions should be needed.

Regardless, this was not your original problem, but I do feel the need to point out that I think the approach described there appears to be unnecessarily complicated.

ADD REPLYlink modified 21 days ago • written 21 days ago by Istvan Albert ♦♦ 85k

I appreciate the tips. I'm just getting started with genome analysis.

ADD REPLYlink written 21 days ago by mglasena0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2061 users visited in the last hour