Question: STAR aligner index generating issues
1
gravatar for Uday Rangaswamy
17 days ago by
Indian Institute of Technology, Madras, India
Uday Rangaswamy150 wrote:

Hi,

I am running the following command to generate genome index using STAR aligner :

Softwares/STAR/bin/Linux_x86_64/STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /scratch/urangasw/data/hg38_index --genomeFastaFiles /home/urangasw/data/genome/human/Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile /home/urangasw/data/annotations/Homo_sapiens.GRCh38.102.gtf --sjdbOverhang 74 --limitGenomeGenerateRAM 25000000000

However, the process is getting killed at this particular step :

Jan 09 07:02:25 ..... started STAR run
!!!!! WARNING: Could not move Log.out file from ./Log.out into /scratch/urangasw/data/hg38_index/Log.out. Will keep ./Log.out

Jan 09 07:02:26 ... starting to generate Genome files
Jan 09 07:03:28 ..... processing annotations GTF
Jan 09 07:04:08 ... starting to sort Suffix Array. This may take a long time...
Jan 09 07:04:31 ... sorting Suffix Array chunks and saving them to disk...
Killed

I have tried varying the number of threads as well as limitGenomeGenerateRAM but that didn't help. Available RAM is as follows :

             total        used        free      shared  buff/cache   available
Mem:       65778304    21084068    34436756      357952    10257480    43839268
Swap:      33031164       54140    32977024

Please help me understand what this issue is about and how to go about it. Thanks in advance :)

rna-seq star alignment genome • 118 views
ADD COMMENTlink written 17 days ago by Uday Rangaswamy150

You're running out of disk space, not RAM. See if you have write permissions and sufficient space in the location you're trying to write into.

ADD REPLYlink written 17 days ago by _r_am32k

Hi _r_am,

I have 5TB of disk space to myself in the scratch section. I do have the write permission of the output folder :

[urangasw@login1 data]$ ls -al
total 12
drwxr-xr-x 3 urangasw fsg 4096 Jan  8 11:17 .
drwxr-xr-x 3 urangasw fsg 4096 Jan  8 11:16 ..
drwxrwxrwx 2 urangasw fsg 4096 Jan  8 11:20 hg38_index

I do have other text files generated in the targeted output folder such as chrLength.txt etc which I don't think would be possible if I didn't have the write permission. Any alternate thoughts/suggestions please.

ADD REPLYlink modified 17 days ago • written 17 days ago by Uday Rangaswamy150
1

That's odd indeed. Unless ./hg38_index is mounted on a different point, you should not have disk space related problems. Maybe check with your admin - they might have something set up to kill jobs if they exceed certain parameters. This could be the case if you're using login nodes and not specially designated compute nodes on a cluster.

ADD REPLYlink written 17 days ago by _r_am32k

Will check with the admin and get back to you. Thanks for your time.

ADD REPLYlink written 17 days ago by Uday Rangaswamy150

You seem to be limiting your RAM request to 25GB. Can you either remove that or increase the number to say 35000000000 and check.

ADD REPLYlink written 17 days ago by GenoMax95k

Hi Genomax,

I checked for both conditions. The process is still getting killed at the same step.

ADD REPLYlink written 17 days ago by Uday Rangaswamy150

Are you the only user on this machine? Can you increase the number to 40GB? I have not done this recently but this page seems to indicate that should work.

ADD REPLYlink written 17 days ago by GenoMax95k

No, it's a shared cluster. However, there are no other processes running in parallel (checked using top command). I tried it using 40GB as well, but unfortunately, it still gets killed at the same point.

ADD REPLYlink written 17 days ago by Uday Rangaswamy150
1

If this is a cluster and you are using a job scheduler then you need to make sure that your scheduler command wrapper takes into consideration this additional request for RAM. Since you posted just the bare STAR command above all of the diagnostic advice you have received so far is just to account for that command.

ADD REPLYlink modified 17 days ago • written 17 days ago by GenoMax95k

Oh alright. I was under the impression that the command would distribute its load onto the available resources of the cluster by itself. So basically, I need to write a job script containing explicit resource allocation apart from the one contained in the command itself while using such shared clusters. Is that correct?

ADD REPLYlink written 17 days ago by Uday Rangaswamy150

You should find out which job scheduler your cluster uses (e.g. SLURM, PBS, LSF etc). If this is a shared cluster then almost certainly it will. Every job scheduler has a different syntax for requesting resources which is done outside of program you are trying to run. Ask fellow users/admins.

ADD REPLYlink written 17 days ago by GenoMax95k

Our cluster is using SLURM for scheduling. The following resource request worked for me :

sbatch --ntasks=1 --cpus-per-task=32 --mem=32000mb --partition=regular1 --time=05:00:00 genome_generate.sh

Content of genoem_generate.sh :

#!/bin/sh

STAR --runThreadN 32 --runMode genomeGenerate --genomeDir /scratch/urangasw/data/hg38_index --genomeFastaFiles /home/urangasw/data/genome/human/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--sjdbGTFfile /home/urangasw/data/annotations/Homo_sapiens.GRCh38.102.gtf --sjdbOverhang 74

Thanks for the help :)

ADD REPLYlink written 15 days ago by Uday Rangaswamy150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1667 users visited in the last hour
_