Question: Creating pseudo-replicates from a paired-end BAM file
0
gravatar for James Ashmore
3.2 years ago by
James Ashmore3.0k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore3.0k wrote:

I want to create pseudo-replicates from a paired-end BAM file (Both mates from a pair should be placed in the same BAM file). I wrote the following script to achieve this, but wondered if there was software available which already provides this functionality, or avoids loading all the BAM file into memory (which my script does during the read shuffling section of the code). I thought it would be great if samtools implemented a pseudo command to deal with this.

#!/bin/bash

# Set header
set -e
set -u
set -o pipefail

# Command arguments
FILENAME=$1
THREADS=$2

# Global variables
FILEPATH="${PWD}/${FILENAME}"
BASENAME=${FILEPATH%.*}

# Count mapped reads
PAIRS=$(samtools view -c -f 2 "${FILEPATH}")
MATES=$((PAIRS / 2))

# If mates is odd, round down to even number
if [[ $((MATES % 2)) -eq 0 ]];
   then LINES=$MATES;
   else LINES=$((MATES - 1));
fi

# Shuffle and split file
samtools sort -n -T "${BASENAME}_NAMESORT" -@ "${THREADS}" "${FILEPATH}" | samtools view -f 2 - | awk '{printf("%s%s",$0,(NR%2==0)?"\n":"\0")}' | shuf | tr "\0" "\n" | split -d -l "${LINES}" - "${BASENAME}"
samtools view -H "${FILEPATH}" > "${BASENAME}_HEAD.sam"

# Replace header
cat "${BASENAME}_HEAD.sam" "${BASENAME}00" | samtools sort -@ "${THREADS}" -T "${BASENAME}_PR1" -o "${BASENAME}_PR1.bam" -
cat "${BASENAME}_HEAD.sam" "${BASENAME}01" | samtools sort -@ "${THREADS}" -T "${BASENAME}_PR2" -o "${BASENAME}_PR2.bam" -

# Clean up
rm -f "${BASENAME}_HEAD.sam"
rm -f "${BASENAME}00"
rm -f "${BASENAME}01"
rm -f "${BASENAME}02"
pseudo-replicates bam • 1.4k views
ADD COMMENTlink modified 3.2 years ago by Istvan Albert ♦♦ 85k • written 3.2 years ago by James Ashmore3.0k

seqtk can sub-sample files. You will need to convert your BAM to fastq though.

ADD REPLYlink written 3.2 years ago by genomax92k

Hi, I tried your code but right after the first sorting step I get the following error from samtools view

[bam_header_read] invalid BAM binary header (this is not a BAM file). [main_samview] fail to read the header from "-".

Please advice. Many thanks!

ADD REPLYlink written 2.5 years ago by Giakountis0

Without looking at your BAM file I can only guess that it's because you have a different version of samtools. Since writing this script there has been a few updates and I believe the way you pass in data through standard input may have changed. You don't specify which line the script fails on, if it's the first instance of samtools view (line 27) try replacing "samtools view -" with "samtools view - -", if it's the second instance of samtools view (line 28) it seems that your original BAM file doesn't have a header or is corrupted in someway.

ADD REPLYlink written 2.5 years ago by James Ashmore3.0k

Thanks James, it is the first instance (line 27). I will try your fix later and keep you posted. The same BAM has been used subsequently for other BASH and R based analyses without problems. That been said I am using samtools-0.1.19 which also generated the known "EOF marker is absent" bug which I ignored since my hexdump -C output is correct.

ADD REPLYlink written 2.5 years ago by Giakountis0

Hi James, as guessed it was a compatibility issue with samtools, it runs smoothly now. I have another question though. I notice a significant size difference between the original BAM file and the two pseudo-reps. For example an original BAM may have a size of 6 Gb and the sum of the two pseudo-rep BAMs may total to 4.5 Gb. My original mapping is performed with tophat (I an doing RNA-seq analysis) with a concordant read percentage up to 90%, therefore the discorcordant reads cannot simply account for this difference. Have you observed something similar? Is there a certain bias that I should keep in mind? I will perform severral diagnostics on my side I just wanted to have your opinion about it.

Once again thank for your time

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Giakountis0

Glad to heard it still works! The script only extracts properly paired reads. Have a look how many reads are classed as properly paired in your original BAM file, and then double check the number in each of the pseudo BAM files.

ADD REPLYlink written 2.5 years ago by James Ashmore3.0k

Yeap the maths do sum up now thanks James!

ADD REPLYlink written 2.5 years ago by Giakountis0
0
gravatar for genomax
3.2 years ago by
genomax92k
United States
genomax92k wrote:

Give reformat.sh from BBMap suite a try. It should be able to do this. Take a look at the sampling parameters and SAM/BAM specific options.

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by genomax92k
0
gravatar for Istvan Albert
3.2 years ago by
Istvan Albert ♦♦ 85k
University Park, USA
Istvan Albert ♦♦ 85k wrote:

I would say that this is an example where the simplest solution is writing a script (python/perl) that reads the file and randomly keeps some pairs to keep at the cost of storing the read names.

You'd avoid the extra sorting, shuffling etc.

Alternatively, a less generic but likely faster method would be to look for a pattern (spot ids, counts, coordinates) in the read names and subsample by those patterns.

ADD COMMENTlink written 3.2 years ago by Istvan Albert ♦♦ 85k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2073 users visited in the last hour