Creating pseudo-replicates from a paired-end BAM file
2
0
Entering edit mode
3.8 years ago
James Ashmore ★ 3.1k

I want to create pseudo-replicates from a paired-end BAM file (Both mates from a pair should be placed in the same BAM file). I wrote the following script to achieve this, but wondered if there was software available which already provides this functionality, or avoids loading all the BAM file into memory (which my script does during the read shuffling section of the code). I thought it would be great if samtools implemented a pseudo command to deal with this.

#!/bin/bash

set -e
set -u
set -o pipefail

# Command arguments
FILENAME=$1 THREADS=$2

# Global variables
FILEPATH="${PWD}/${FILENAME}"
BASENAME=${FILEPATH%.*} # Count mapped reads PAIRS=$(samtools view -c -f 2 "${FILEPATH}") MATES=$((PAIRS / 2))

# If mates is odd, round down to even number
if [[ $((MATES % 2)) -eq 0 ]]; then LINES=$MATES;
else LINES=$((MATES - 1)); fi # Shuffle and split file samtools sort -n -T "${BASENAME}_NAMESORT" -@ "${THREADS}" "${FILEPATH}" | samtools view -f 2 - | awk '{printf("%s%s",$0,(NR%2==0)?"\n":"\0")}' | shuf | tr "\0" "\n" | split -d -l "${LINES}" - "${BASENAME}" samtools view -H "${FILEPATH}" > "${BASENAME}_HEAD.sam" # Replace header cat "${BASENAME}_HEAD.sam" "${BASENAME}00" | samtools sort -@ "${THREADS}" -T "${BASENAME}_PR1" -o "${BASENAME}_PR1.bam" -
cat "${BASENAME}_HEAD.sam" "${BASENAME}01" | samtools sort -@ "${THREADS}" -T "${BASENAME}_PR2" -o "${BASENAME}_PR2.bam" - # Clean up rm -f "${BASENAME}_HEAD.sam"
rm -f "${BASENAME}00" rm -f "${BASENAME}01"
rm -f "\${BASENAME}02"

bam pseudo-replicates • 1.7k views
0
Entering edit mode

seqtk can sub-sample files. You will need to convert your BAM to fastq though.

0
Entering edit mode

Hi, I tried your code but right after the first sorting step I get the following error from samtools view

0
Entering edit mode

Without looking at your BAM file I can only guess that it's because you have a different version of samtools. Since writing this script there has been a few updates and I believe the way you pass in data through standard input may have changed. You don't specify which line the script fails on, if it's the first instance of samtools view (line 27) try replacing "samtools view -" with "samtools view - -", if it's the second instance of samtools view (line 28) it seems that your original BAM file doesn't have a header or is corrupted in someway.

0
Entering edit mode

Thanks James, it is the first instance (line 27). I will try your fix later and keep you posted. The same BAM has been used subsequently for other BASH and R based analyses without problems. That been said I am using samtools-0.1.19 which also generated the known "EOF marker is absent" bug which I ignored since my hexdump -C output is correct.

0
Entering edit mode

Hi James, as guessed it was a compatibility issue with samtools, it runs smoothly now. I have another question though. I notice a significant size difference between the original BAM file and the two pseudo-reps. For example an original BAM may have a size of 6 Gb and the sum of the two pseudo-rep BAMs may total to 4.5 Gb. My original mapping is performed with tophat (I an doing RNA-seq analysis) with a concordant read percentage up to 90%, therefore the discorcordant reads cannot simply account for this difference. Have you observed something similar? Is there a certain bias that I should keep in mind? I will perform severral diagnostics on my side I just wanted to have your opinion about it.

Once again thank for your time

0
Entering edit mode

Glad to heard it still works! The script only extracts properly paired reads. Have a look how many reads are classed as properly paired in your original BAM file, and then double check the number in each of the pseudo BAM files.

0
Entering edit mode

Yeap the maths do sum up now thanks James!

0
Entering edit mode
3.8 years ago
GenoMax 104k

Give reformat.sh from BBMap suite a try. It should be able to do this. Take a look at the sampling parameters and SAM/BAM specific options.

0
Entering edit mode
3.8 years ago

I would say that this is an example where the simplest solution is writing a script (python/perl) that reads the file and randomly keeps some pairs to keep at the cost of storing the read names.

You'd avoid the extra sorting, shuffling etc.

Alternatively, a less generic but likely faster method would be to look for a pattern (spot ids, counts, coordinates) in the read names and subsample by those patterns.