Tutorial: Trim & align paired-end reads in a single pass
gravatar for dariober
4.0 years ago by
WCIP | Glasgow | UK
dariober11k wrote:

I thought about sharing this little script to:

  • trim,
  • align,
  • sort,
  • clip overlap

paired-end fastq files in a single stream, i.e. without writing intermediate files.

This is to save time and space on disk, maybe useful to others (but also to check I'm not doing anything wrong or weird...):

paste <(zcat reads_R1.fq.gz | paste - - - -) \
      <(zcat reads_R2.fq.gz | paste - - - -) \
| tr '\t' '\n' \
| cutadapt --interleaved -a AGATCGGAAGAGC -A AGATCGGAAGAGC - \
| bwa mem -p genome.fa - \
| samtools sort -o - -O bam -T aln.tmp.bam \
| bam clipOverlap --in -.bam --out -.bam > aln.bam &&
samtools index aln.bam

The main trick is to convert paired-end fastq files to interleaved format (paste and tr commands), then use cutadapt to read and write interleaved reads and bwa mem to align reads from stdin.

I deliberately left out any optional parameter to each program (I usually set -M in bwa mem). Note that if the two reads in a pair are completely overlapping, clipOverlap will set the cigar string of one of the two reads to fully soft clipped (e.g. 75S). Picard will complain in such cases unless you set the VALIDATION_STRINGENCY to LENIENT or SILENT.

This was with:

cutadapt Version 1.9

bwa Version: 0.7.12-r1039

samtools Version: 1.1 (using htslib 1.1)

bam clipOverlap Version: 1.0.12

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by dariober11k

check cutadapt doesn't remove the whole sequence from your reads.

ADD REPLYlink written 4.0 years ago by Pierre Lindenbaum128k

Hi Pierre, what do you mean? If a read partially matches the adapter at 3'end, only the overlapping part is trimmed.

ADD REPLYlink written 4.0 years ago by dariober11k

I see. I remember when I used cutadapt, I got some empty reads at the end (and bwa doesn't like this).

ADD REPLYlink written 4.0 years ago by Pierre Lindenbaum128k

Yes, cutadapt can give empty reads if the whole read is trimmed, i.e. it is only adapter sequence. In this cases bwa mem seems to do the right thing by reporting the empty reads as unmapped and with no sequence (older versions might crash). Anyway, cutadapt has -m option to discard reads shorter than INT bases (default is 0) so empty reads never occur. I usually set -m to around 15 or above.

ADD REPLYlink written 4.0 years ago by dariober11k
gravatar for chen
4.0 years ago by
chen2.0k wrote:

Try AfterQC (https://github.com/OpenGene/AfterQC), this tool can do trimming and pair-end overlap analysis in fastq level.

ADD COMMENTlink written 4.0 years ago by chen2.0k

Thanks, but I usually don't (actually never) do paired-end overlap at fastq level. If I have to, I clip paired-end overlaps after alignment using bam clipOverlap, which nicely works with streams (in fact I'm going to edit my post to include it). Does AfterQC work on streams? The nice thing about the cutadapt + bwa mem pipeline above is that you don't need to dump massive (and time consuming) trimmed fastq files on disk.

ADD REPLYlink written 4.0 years ago by dariober11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1816 users visited in the last hour