STAR aligner index generating issues
1
1
Entering edit mode
3.3 years ago
bioinfo456 ▴ 150

Hi,

I am running the following command to generate genome index using STAR aligner :

Softwares/STAR/bin/Linux_x86_64/STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /scratch/urangasw/data/hg38_index --genomeFastaFiles /home/urangasw/data/genome/human/Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile /home/urangasw/data/annotations/Homo_sapiens.GRCh38.102.gtf --sjdbOverhang 74 --limitGenomeGenerateRAM 25000000000

However, the process is getting killed at this particular step :

Jan 09 07:02:25 ..... started STAR run
!!!!! WARNING: Could not move Log.out file from ./Log.out into /scratch/urangasw/data/hg38_index/Log.out. Will keep ./Log.out

Jan 09 07:02:26 ... starting to generate Genome files
Jan 09 07:03:28 ..... processing annotations GTF
Jan 09 07:04:08 ... starting to sort Suffix Array. This may take a long time...
Jan 09 07:04:31 ... sorting Suffix Array chunks and saving them to disk...
Killed

I have tried varying the number of threads as well as limitGenomeGenerateRAM but that didn't help. Available RAM is as follows :

             total        used        free      shared  buff/cache   available
Mem:       65778304    21084068    34436756      357952    10257480    43839268
Swap:      33031164       54140    32977024

Please help me understand what this issue is about and how to go about it. Thanks in advance :)

RNA-Seq alignment star genome • 2.6k views
ADD COMMENT
0
Entering edit mode

You're running out of disk space, not RAM. See if you have write permissions and sufficient space in the location you're trying to write into.

ADD REPLY
0
Entering edit mode

Hi _r_am,

I have 5TB of disk space to myself in the scratch section. I do have the write permission of the output folder :

[urangasw@login1 data]$ ls -al
total 12
drwxr-xr-x 3 urangasw fsg 4096 Jan  8 11:17 .
drwxr-xr-x 3 urangasw fsg 4096 Jan  8 11:16 ..
drwxrwxrwx 2 urangasw fsg 4096 Jan  8 11:20 hg38_index

I do have other text files generated in the targeted output folder such as chrLength.txt etc which I don't think would be possible if I didn't have the write permission. Any alternate thoughts/suggestions please.

ADD REPLY
1
Entering edit mode

That's odd indeed. Unless ./hg38_index is mounted on a different point, you should not have disk space related problems. Maybe check with your admin - they might have something set up to kill jobs if they exceed certain parameters. This could be the case if you're using login nodes and not specially designated compute nodes on a cluster.

ADD REPLY
0
Entering edit mode

Will check with the admin and get back to you. Thanks for your time.

ADD REPLY
0
Entering edit mode

You seem to be limiting your RAM request to 25GB. Can you either remove that or increase the number to say 35000000000 and check.

ADD REPLY
0
Entering edit mode

Hi Genomax,

I checked for both conditions. The process is still getting killed at the same step.

ADD REPLY
0
Entering edit mode

Are you the only user on this machine? Can you increase the number to 40GB? I have not done this recently but this page seems to indicate that should work.

ADD REPLY
0
Entering edit mode

No, it's a shared cluster. However, there are no other processes running in parallel (checked using top command). I tried it using 40GB as well, but unfortunately, it still gets killed at the same point.

ADD REPLY
2
Entering edit mode
3.3 years ago
GenoMax 141k

If this is a cluster and you are using a job scheduler then you need to make sure that your scheduler command wrapper takes into consideration this additional request for RAM. Since you posted just the bare STAR command above all of the diagnostic advice you have received so far is just to account for that command.

ADD COMMENT
0
Entering edit mode

Oh alright. I was under the impression that the command would distribute its load onto the available resources of the cluster by itself. So basically, I need to write a job script containing explicit resource allocation apart from the one contained in the command itself while using such shared clusters. Is that correct?

ADD REPLY
0
Entering edit mode

You should find out which job scheduler your cluster uses (e.g. SLURM, PBS, LSF etc). If this is a shared cluster then almost certainly it will. Every job scheduler has a different syntax for requesting resources which is done outside of program you are trying to run. Ask fellow users/admins.

ADD REPLY
0
Entering edit mode

Our cluster is using SLURM for scheduling. The following resource request worked for me :

sbatch --ntasks=1 --cpus-per-task=32 --mem=32000mb --partition=regular1 --time=05:00:00 genome_generate.sh

Content of genoem_generate.sh :

#!/bin/sh

STAR --runThreadN 32 --runMode genomeGenerate --genomeDir /scratch/urangasw/data/hg38_index --genomeFastaFiles /home/urangasw/data/genome/human/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--sjdbGTFfile /home/urangasw/data/annotations/Homo_sapiens.GRCh38.102.gtf --sjdbOverhang 74

Thanks for the help :)

ADD REPLY

Login before adding your answer.

Traffic: 2366 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6