Question

Common practice for aligning hundreds of FASTQ files? Would the resulting BAM files need to be merged into one? Is there any benefit to align 1000 samples at the same time rather than in batches?

0

Entering edit mode

2.1 years ago

' ▴ 300

I have a few questions regarding alignment with tools such as STAR, Kallisto, etc.

Assuming samples are from the same experiment.

What do people commonly do when they have, for example, 1000 paired-end samples (i.e. 2000 FASTQ files)? Do they break the run into batches and call STAR several times with very small batches of samples (for example, only 10 files at a time)? Is there a downside to this?
Is there a preference or a difference if I run STAR with 1000 samples simultaneously or if I instead call STAR 100 times each time with just 10 samples? Obviously, I will end up with 100 BAM files as opposed to only one BAM file. But are my alignment results different now?
Does merging the BAM files from 100 separate runs into one BAM file give me the same result as if I ran STAR with exactly 1000 samples simultaneously?
Is there a scenario where it would be 100% preferable to run STAR or Kallisto with 1000 samples at once rather than breaking the run into hundreds of separate batches?

fastq alignment bam rna-seq STAR • 1.0k views

ADD COMMENT • link 2.1 years ago by ' ▴ 300

1

Entering edit mode

First of all, what kind of computational power or infrastructure do you have? Is a HPC available, or a powerful workstation? Please share the number of CPUs and memory that is available, because it really depends. Ideally on a HPC you would just spawn one job per file pair and let the scheduler take care of the execution, with something like Nextflow or Snakemake orchestrating the whole thing.

ADD REPLY • link 2.1 years ago by ATpoint 82k

0

Entering edit mode

ATpoint The answer to that question is really complicated in my situation. But in short, cloud is available, so I can use Google Cloud or Microsft Azure with as much memory and CPU as needed. However, without going into too much detail, there are certain other limits that I have, such as being limited to 250GB disk space at a time (so essentially this would mean 100GB worth of FASTQ to leave enough space for temp files and the BAM file that STAR/Kallisto writes to disk). Memory availability is around 128GB and CPU, 16 cores.

ADD REPLY • link 2.1 years ago by ' ▴ 300

0

Entering edit mode

Then it seems you have to do it batch-wise respective the storage space available. Something like kallisto will be way faster while not consuming lots of space, because it does not produce BAM files. As long as software version and index is the same I see no reason why batch splitting should be bad. Seems you have to do that anyway it seems.

ADD REPLY • link 2.1 years ago by ATpoint 82k