Efficient way for Multiple Sample Mapping with STAR ?
1
0
Entering edit mode
7 weeks ago

Hello, everyone !

Studying STAR manual, I learned that multiple samples can be mapped at once with parameters.

--readFilesIn sample1read1.fq,sample2read1.fq sample1read2.fq, sample2read2.fq


But I have done multiple mapping with "for" loop until now. That is, mapping have been done one by one.

Multiple samples mapping at once using parameter

vs

Multiple samples mapping one by one using "for" loop

if I use the same number of Threads, Which way is more efficient?

star • 863 views
0
Entering edit mode

I think the STAR developer is best positioned to answer this, as he'd have run tests (most probably). In any case, I think a gain will be caused by the genome loaded in shared memory in the former use case, although that could be enforced in the latter case too.

The latter case, when modified to run one sample per node, allows for better parallelization. A loop is the least efficient way to do things IMO.

3
Entering edit mode
7 weeks ago

I would suggest avoiding calling this multiple mapping. Multi-mapping typically means something else entirely. I am making this point for those readers of the future that get here via a google search on multiple mapping :-)

This is a question on the advantages of listing files at once or separately. There is no "multiple mapping" here, every sample is mapped only once.

If you think about there will be a

1. fixed cost of starting the mapping
2. then you align N1 + N2 + N3 + ... reads with T threads

in both cases the work done in stage 2 is the same, by the end, you have mapped N1+N2+N3 reads with T threads, so that time won't change.

What will change is the fixed startup cost will be added in each loop. This may or may not be a substantial addition to the total runtime.

Long story short, listing all samples at once is probably more advantageous.

0
Entering edit mode

Thank you sooo much !! : )