Question: Not sure if pipe from bowtie2 to samtools is working?
0
gravatar for ejwright
6 weeks ago by
ejwright0
ejwright0 wrote:

I'm trying to pipe my paired end, transcription factor ChIP-seq fastq files (about 25-30GB each) from bowtie2 straight through samtools such that I can get to the point of running MAC2 without creating a bunch of huge intermediate SAM files. I'm only slightly experienced with command line interface so it's been a lot of trial and error so far. Just discovered pipes yesterday. Given how long it can take to run just one set of paired end reads through, I don't want to waste a bunch of time waiting to see if my piped commands work correctly, I'd be very grateful if someone could comment on my command and let me know if it's A) correct syntax B) efficient and most importantly C) correct with respect to typical transcription factor chip-seq workflow.

bowtie2 --mm -p12 -x /mnt/Storage-2/Indexes/mm10BT2index -1 /mnt/Storage-2/raw-data/ChIP-Seq/S4_R1_001.fastq -2 /mnt/Storage-2/raw-data/ChIP-Seq/S4_R2_001.fastq 2> S4-bowtie.report | samtools view -b -u -q 30 - | samtools sort -@ 4 -m 8G - ; samtools index *.bam

Starting with bowtie, I am under the impression that -mm helps speed things up? -p12 because I have 16 threads available. I have the stderr of bowtie going to a file (necessary?) and then the stdout should be piped in to samtools view.

For samtools view I have -b for BAM, -u for uncompressed because why waste time compressing the output, -q 30 to ignore weaker maps.

For sorting, I'm under the impression that Sorting is necessary prior to MACS2? I allocated 4 more threads to sorting along with 8GB of memory per thread, the sorted BAM should then be piped into the samtools index command. Do I need -o in sort to actually save a bam file to then sort? Or will piping as is, work?

So far it's been running 2+ hours and it's only generated eight ~1GB tmp.0000.bam files, which from searching biostars, I believe is normal?

Is there anything I could do to speed it up as far as my command as I have it written? Is there some option i've used that's redundant? or am I lacking a particular input or output option? Or is my attempt at piping doomed to fail as it is written?

(I'm not terribly restricted on hardware, I'm running in Ubuntu 18.04, my cpu is 8C/16T @ 4GHz, I have 64GB of memory, 2TB of solid state storage).

tl;dr I'd be very grateful if someone could comment on my command and let me know if it's A) correct syntax B) efficient and most importantly C) correct with respect to typical transcription factor chip-seq workflow.

ADD COMMENTlink modified 6 weeks ago by ATpoint30k • written 6 weeks ago by ejwright0

Just a quick comment while I am pressed for time.

I have, coincidentally, been running exactly this for the past few days and it should work. Here was my code (a loop, which loops over single-end FASTQ files in a text file, fastq.list):

paste fastq.list | while read fastq ;
do
  echo -e "--input file is ""${fastq}" ;
  outfile=$(echo "${fastq}" | sed 's/\.\/raw\///g' | sed 's/\/[0-9A-Za-z]*\.fastq\.gz$//g') ;
  echo -e "--output file is Aligned/""${outfile}"".bam\n" ;

  /Programs/bowtie2-2.2.6/bowtie2 \
    --threads 2 \
    -x /ReferenceMaterial/1000Genomes/hg19/bowtie2_idx \
    -U "${fastq}" | /Programs/samtools-1.9/samtools view -bS - > Aligned/"${outfile}".bam ;
done ;

Doing the sort step in the pipe sequence is overkill, as that [samtools sort] breaks up the BAM into chunks.... but how can it do that while the final BAM is still being generated?

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Kevin Blighe55k

sort only produces chunks if the allocated memory is full (controlled by the -m parameter). These sorted chunks will be spilled to disk and then later merged once all reads are processed. So essentially this is sorting followed by merging and the merged output is then again sent to stdout or captured to a file with -o. Sorting is a bottleneck, no doubt about that. You are lucky if you are spoiled with large-memory servers and everything can be done in RAM :) If OP needs indexed files sorting is inevitable though.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by ATpoint30k
0
gravatar for ATpoint
6 weeks ago by
ATpoint30k
Germany
ATpoint30k wrote:

Currently, you are not capturing the output of sort to a file. Output currently will be send to stdout so to screen and this will crash everything. You need the option -o. like -o output_sorted.bam. You can simply go through view without converting to bam as sort can do that as well. You can even pipe into index and save the sorted file via tee:

bowtie2 (...) | samtools view -h -q30 | samtools sort -O BAM | tee output_sorted.bam | samtools index - output_sorted.bam.bai

In addition, bowtie2 will accept gzip-compressed files so to save space you do not need to use decompress the fastq files.

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by ATpoint30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1258 users visited in the last hour