Shuffle Reads Prior to Alignment With BWA-mem
2
1
Entering edit mode
7.2 years ago
greenstick ▴ 10

Apologies if this has already been asked; I've looked around and haven't been able to get a clear, definitive answer to this question.

I have a number of interleaved paired-end FASTQ files that are sorted. I read in a (somewhat dated) GATK tutorial that the reads in these files should be sorted randomly (keeping the pairs together, one would assume) prior to alignment with BWA-mem, lest a bias be introduced. Is this correct?

Assuming it is, is there a tool that you know of / can recommend to sort interleaved paired-end FASTQs in such a way? I know there are scripts that can be written, but I'd prefer to stand on the shoulders of a (verified, preferably open-source) giant.

Many thanks!

alignment genome sequencing • 2.5k views
ADD COMMENT
0
Entering edit mode

GATK tutorial that the reads in these files should be sorted randomly

it's true if the fastq have been generated from a previously-ordered bam: the order of the reads is non-random and so there is a bias in the estimation of the average fragment length.

ADD REPLY
0
Entering edit mode

That is the case with the files in question; they have been reverted from aligned BAMs to FASTQs. Is there a software you could recommend to shuffle them by any chance? Thanks!

ADD REPLY
2
Entering edit mode
7.2 years ago
GenoMax 141k

shuffle.sh from BBMap suite.

Description:  Reorders reads randomly, keeping pairs together.

Usage:  shuffle.sh in=<file> out=<file>
ADD COMMENT
0
Entering edit mode

Thanks, this is perfect : )

ADD REPLY
2
Entering edit mode
7.2 years ago

linearize, add a random number, sort on this random number, remove this first column convert back to fastq, compress

   gunzip -c input.fastq.gz | paste - - - - - - - - | awk -F '\t' '{printf("%f\t%s\n",rand(),$0);}' | LC_ALL=C sort -t $'\t' -k1,1g  | cut -f 2- | tr "\t" "\n" | gzip > suffled.fastq.gz
ADD COMMENT
0
Entering edit mode

I haven't tested this though it does look like it would work. That said, accepting the BBMap answer because it's a documented suite of tools with the specified functionality. Regardless, thank you for clarifying that the shuffling step is necessary.

ADD REPLY

Login before adding your answer.

Traffic: 1994 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6