I'm trying to pipe my paired end, transcription factor ChIP-seq fastq files (about 25-30GB each) from bowtie2 straight through samtools such that I can get to the point of running MAC2 without creating a bunch of huge intermediate SAM files. I'm only slightly experienced with command line interface so it's been a lot of trial and error so far. Just discovered pipes yesterday. Given how long it can take to run just one set of paired end reads through, I don't want to waste a bunch of time waiting to see if my piped commands work correctly, I'd be very grateful if someone could comment on my command and let me know if it's A) correct syntax B) efficient and most importantly C) correct with respect to typical transcription factor chip-seq workflow.
bowtie2 --mm -p12 -x /mnt/Storage-2/Indexes/mm10BT2index -1 /mnt/Storage-2/raw-data/ChIP-Seq/S4_R1_001.fastq -2 /mnt/Storage-2/raw-data/ChIP-Seq/S4_R2_001.fastq 2> S4-bowtie.report | samtools view -b -u -q 30 - | samtools sort -@ 4 -m 8G - ; samtools index *.bam
Starting with bowtie, I am under the impression that -mm helps speed things up? -p12 because I have 16 threads available. I have the stderr of bowtie going to a file (necessary?) and then the stdout should be piped in to samtools view.
For samtools view I have -b for BAM, -u for uncompressed because why waste time compressing the output, -q 30 to ignore weaker maps.
For sorting, I'm under the impression that Sorting is necessary prior to MACS2? I allocated 4 more threads to sorting along with 8GB of memory per thread, the sorted BAM should then be piped into the samtools index command. Do I need -o in sort to actually save a bam file to then sort? Or will piping as is, work?
So far it's been running 2+ hours and it's only generated eight ~1GB tmp.0000.bam files, which from searching biostars, I believe is normal?
Is there anything I could do to speed it up as far as my command as I have it written? Is there some option i've used that's redundant? or am I lacking a particular input or output option? Or is my attempt at piping doomed to fail as it is written?
(I'm not terribly restricted on hardware, I'm running in Ubuntu 18.04, my cpu is 8C/16T @ 4GHz, I have 64GB of memory, 2TB of solid state storage).
tl;dr I'd be very grateful if someone could comment on my command and let me know if it's A) correct syntax B) efficient and most importantly C) correct with respect to typical transcription factor chip-seq workflow.