Question

STAR aligner very slow for human genome index generation

0

Entering edit mode

4.4 years ago

moustafa_abohawya ▴ 20

Hello all

So I have a problem with STAR aligner.

It is very slow for human genome index generation step, it has been running for days now and it never ends.

I have access to 230 G RAM and 20 core and I even changed limit RAM to 210 G and it still only uses 10% of it.

I have access to 20 cores

Here is the command I am using ./STAR --runMode genomeGenerate --sjdbGTFfile /home/moustafa/RNAseq/Homo_sapiens.GRCh38.98.gtf --sjdbOverhang 74--limitGenomeGenerateRAM 210000000000 --genomeDir /home/moustafa/RNAseq/Human/ --genomeFastaFiles /home/moustafa/RNAseq/Homo_sapiens.GRCh38.dna.primary_assembly.fa

Here is the free -h output

$ free -h 

              total        used        free      shared  buff/cache   available
Mem:           230G         20G        204G        1.0M        5.1G        208G    
Swap:          8.0G          0B        8.0G

I hope you help me with this.

RNA-Seq STAR genomeindex • 4.6k views

ADD COMMENT • link updated 4.4 years ago by i.sudbery 19k • written 4.4 years ago by moustafa_abohawya ▴ 20

0

Entering edit mode

If you want to use premade STAR indexes you can download them from Alex Dobin's (author of STAR) site here.

ADD REPLY • link 4.4 years ago by GenoMax 142k

0

Entering edit mode

Interestingly, it seems he was able to do the whole job in like 20 min! Something is quite wrong Indeed because it is always taking days! I wanna figure out where the problem is because I believe this might happen again in the mapping step.

ADD REPLY • link 4.4 years ago by moustafa_abohawya ▴ 20

GenoMax · Answer 1 · 2019-12-23

0

Entering edit mode

4.4 years ago

i.sudbery 19k

If you want STAR to use more than one CPU, you need to tell it that with the --runThreadN option, followed by the number of CPU cores you would like it to use. e.g. --runThreadN 20 would run with 20 cores.

It won't ever use more memory than is required to fit the genome index in RAM.

ADD COMMENT • link 4.4 years ago by i.sudbery 19k

0

Entering edit mode

So I did it before with --runThreadN 11 and it took also 4 days and never finished! It generated 13 G of files but over a very long time. and just never stopped.

What do you mean that it won't use more memory than required? The required memory for the human genome as I read should be more than 32 and it is only using 20 or something even if I increased the limit to 210 G RAM?

ADD REPLY • link 4.4 years ago by moustafa_abohawya ▴ 20

0

Entering edit mode

Something is not right. With 11 cores the process should not need 4+ days. It may need 6-8 h but that should be on top end. Aside from memory, if you don't have enough disk space (check on this, STAR will create _STARtemp directory (and files) when it is working) then that could be a problem.

You are correct in that STAR should need 30-40G of RAM for human genome. I don't recollect it needing more for genomeGenerate step.

ADD REPLY • link 4.4 years ago by GenoMax 142k

0

Entering edit mode

I am trying to figure this out but it doesn't seem to be the disk space Here is my df -h df -h

Filesystem      Size  Used Avail Use% Mounted on

udev            116G     0  116G   0% /dev

tmpfs            24G  1.1M   24G   1% /run

/dev/sda2       251G   68G  171G  29% /

tmpfs           116G     0  116G   0% /dev/shm

tmpfs           5.0M     0  5.0M   0% /run/lock

tmpfs           116G     0  116G   0% /sys/fs/cgroup

/dev/loop0       90M   90M     0 100% /snap/core/8268

/dev/loop1       89M   89M     0 100% /snap/core/7270

tmpfs            24G     0   24G   0% /run/user/1000

ADD REPLY • link updated 4.4 years ago by GenoMax 142k • written 4.4 years ago by moustafa_abohawya ▴ 20

0

Entering edit mode

Unclear where the directory is that you are working in but it is likely on /. If this is a spinning hard-drive then it is going to be significantly slower than solid state drives. If you want to move on with your analysis and have enough internet bandwidth then you could download Alex's pre-made indexes.

ADD REPLY • link 4.4 years ago by GenoMax 142k

0

Entering edit mode

The point is that I think mapping will have the same issue, I need to make sure that the issue is solved before moving to mapping. The issue is pretty weird.

ADD REPLY • link 4.4 years ago by moustafa_abohawya ▴ 20

0

Entering edit mode

I seem to remember it takes about 4 hours on 4 cores, so even if its only a single core, if shouldn't take longer than 1 day, unless they are really slow cores. You could run with time -v and that might give us some idea if the process is CPU, memory or I/O bounded.

One possibility I guess is that your home directory is mounted over a really slow network connection? Or perhaps something in the OS is limiting the speed? A machine sat on your desk, or is it a remote server of some sort? Do you know if its baremetal or are you using a VM of some sort?

ADD REPLY • link 4.4 years ago by i.sudbery 19k

0

Entering edit mode

I am actually using a VM with ubuntu on this remote server! It is worth mentioning that it is quite slow to connect or work there! I guess the step that takes much time is writing the suffix index on the disk! Do you have any idea how could this might cause the problem? Also, do you have any idea how might I fix it?

ADD REPLY • link 4.4 years ago by moustafa_abohawya ▴ 20

0

Entering edit mode

If you have a VM assigned with not enough resources then you are not going to be able to complete this job. You will need to figure out what exact resources your VM has. Having the VM run on a server with 20 cores and 210 G of RAM does not give you access to all those resources automatically.

ADD REPLY • link 4.4 years ago by GenoMax 142k