I thought about sharing this little script to:
- clip overlap
paired-end fastq files in a single stream, i.e. without writing intermediate files.
This is to save time and space on disk, maybe useful to others (but also to check I'm not doing anything wrong or weird...):
paste <(zcat reads_R1.fq.gz | paste - - - -) \ <(zcat reads_R2.fq.gz | paste - - - -) \ | tr '\t' '\n' \ | cutadapt --interleaved -a AGATCGGAAGAGC -A AGATCGGAAGAGC - \ | bwa mem -p genome.fa - \ | samtools sort -o - -O bam -T aln.tmp.bam \ | bam clipOverlap --in -.bam --out -.bam > aln.bam && samtools index aln.bam
The main trick is to convert paired-end fastq files to interleaved format (
tr commands), then use cutadapt to read and write interleaved reads and bwa mem to align reads from stdin.
I deliberately left out any optional parameter to each program (I usually set -M in
bwa mem). Note that if the two reads in a pair are completely overlapping, clipOverlap will set the cigar string of one of the two reads to fully soft clipped (e.g.
75S). Picard will complain in such cases unless you set the VALIDATION_STRINGENCY to LENIENT or SILENT.
This was with:
cutadapt Version 1.9
bwa Version: 0.7.12-r1039
samtools Version: 1.1 (using htslib 1.1)
bam clipOverlap Version: 1.0.12