Paired-end reads reported without mates: how to play matchmaker?
2
0
Entering edit mode
4.0 years ago

Hi Everyone,

I am currently looking at Acute Myeloid Leukemia (AML) paired-end WGS samples from the TARGET data https://ocg.cancer.gov/programs/target/target-methods#3241.

A bioinformatician in our group remapped the samples from hg19 to hg38. Unfortunately, we do not have any copies of the hg19 version anymore.

However, when I try to run anything with the BAMs, I get an error that the mates of the paired-end sequences are not found. I have run Picard's ValidateSam and samblaster --AddMateTags to try to fix it, with no avail.

samblaster: Found   1033191258 of 1038680280 (99.472%) total read ids are marked paired yet are unmated.

This is just a subset of the data I am interested in (chrM reads), ValidateSam would not work for the WGS bam:

 HISTOGRAM    java.lang.String
    Error Type    Count
    ERROR:INVALID_VERSION_NUMBER    1
    ERROR:MATE_NOT_FOUND    304414

This is how the reads look like. As you can see, they miss the MC and MQ tags.

SRR1168035.161988987.1    83    MT    1178    40    100M    =    1035    -243        DDDDDEEDDDDBDDDDDDDDDDDDDDDDDDDDDDDDDDDEEEDCBFHJJIGEHFGIGBF@HD?IHCIIJJIIJJJJJJJJJGHJIIJHHHHHFDDA4#CC    NM:i:1    AS:i:98    XS:i:98    RG:Z:PAMYMABMNF_BCCAGSC_S1_L001_001


SRR1168035.161988987.2    163    MT    1035    40    100M    =    1178    243        @CCFFFFFHHHHHJGIIJJJEHIGIGIEGIJJGGIGEIIGIJJJD#####00<FHI#.;DEHGGHHHE@BC#############################    NM:i:14    AS:i:72    XS:i:72    RG:Z:PAMYMABMNF_BCCAGSC_S1_L001_001

My question basically is: How can I fix the mate information? I was wondering whether I could write a simple script that could go through the queryname-sorted file, and then take the lines with the same name (xxxx.1 and xxxx.2) and based on those lines create the MC and MQ columns, or is there a program / script out there that already does this?

Kind regards,

Jip

paired-end sequencing DNA WGS unpaired mates • 1.7k views
ADD COMMENT
1
Entering edit mode
4.0 years ago

I finally found the error after scouring the internetr for the manner how BAM files recognise paired reads. As it turns out, paired reads are recognised by the the same alignment name. So instead of XXX.1 and XXX.2, the paired reads should carry both carry the queryname XXX.

A simple pysam script did the job by loading each line and removing the last 2 characters from query_name.

ADD COMMENT
0
Entering edit mode

Yes, samblaster requires identical names and bam files sorted by queryname.

I personally have samblaster right after the alignment like:

aligner (...) | samblaster | samtools sort -o out.bam

In your case probably samtools fastq was used to convert aligned data back to fastq but without option::

  -n                   don't append /1 and /2 to the read name
ADD REPLY
0
Entering edit mode

I've run into the same problem and was hoping if you could maybe share your pysam script? I have no experience using pysam whatsoever. Thank you in advance!

ADD REPLY
1
Entering edit mode
4.0 years ago

Samtools fixmate may work for you:

Usage: samtools fixmate <in.nameSrt.bam> <out.nameSrt.bam>
Options:
  -r           Remove unmapped reads and secondary alignments
  -p           Disable FR proper pair check
  -c           Add template cigar ct tag
  -m           Add mate score tag
 ...

http://www.htslib.org/doc/samtools-fixmate.html

Note that for this to work your BAM file will need to be re-sorted by name.

ADD COMMENT
0
Entering edit mode

Hi Istvan,

Thanks for the heads-up on samtools, I apparently haven't checked their documentation too extensive enough. I will try what you suggested and come back if it did not work out.

ADD REPLY

Login before adding your answer.

Traffic: 1780 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6