STAR - genome indexes generation, genome file not created
Entering edit mode
7.9 years ago ▴ 70

Hi All,

I am trying to generate genome indexes using STAR (v2.5.0a), to do so I use this command:

'STAR --runThreadN 24 --runMode genomeGenerate --genomeDir /path/genomeDir --genomeFastaFiles /path/Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile /path/Homo_sapiens.GRCh38.86.gtf --sjdbOverhang 74'

Both fa and gtf files are from ENSEMBL.

The generation seems to work (no error is displayed, neither in the command line nor in the log file) but when I look at the files generated I do not have any genome file as I should but only: chrLength.txt, chrNameLength.txt, chromeName.txt, chrStart.txt and genomeParameters.txt.

In the Log.out file, for the genome files generation, which I found a bit odd because of the high chr numbers, I have this:

Nov 17 09:20:27 ... starting to generate Genome files
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 0  "1" chrStart: 0
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 1  "10" chrStart: 249036800
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 2  "11" chrStart: 382992384
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 3  "12" chrStart: 518258688
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 4  "13" chrStart: 651689984
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 5  "14" chrStart: 766246912
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 6  "15" chrStart: 873463808
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 7  "16" chrStart: 975699968
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 8  "17" chrStart: 1066139648
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 9  "18" chrStart: 1149501440
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 10  "19" chrStart: 1229979648
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 11  "2" chrStart: 1288699904
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 12  "20" chrStart: 1530920960
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 13  "21" chrStart: 1595408384
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 14  "22" chrStart: 1642332160
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 15  "3" chrStart: 1693188096
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 16  "4" chrStart: 1891631104
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 17  "5" chrStart: 2081947648
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 18  "6" chrStart: 2263613440
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 19  "7" chrStart: 2434531328
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 20  "8" chrStart: 2593914880
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 21  "9" chrStart: 2739142656
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 22  "MT" chrStart: 2877554688
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 23  "X" chrStart: 2877816832
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 24  "Y" chrStart: 3034054656
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 25  "KI270728.1" chrStart: 3091464192
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 26  "KI270727.1" chrStart: 3093561344
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 27  "KI270442.1" chrStart: 3094085632
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 191  "KI270423.1" chrStart: 3137601536
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 192  "KI270392.1" chrStart: 3137863680
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 193  "KI270394.1" chrStart: 3138125824
Number of SA indices: 5891698134
Nov 17 09:21:39 ... starting to sort Suffix Array. This may take a long time...

I tried changing both fasta and gif files and also STAR version, without success, I cannot seem to figure out how to make this work properly, any idea?

Thank you, L.

RNA-Seq star • 9.0k views
Entering edit mode

STAR needs about ~30Gb of RAM for human genome. Did you ensure that sufficient RAM was available?

How long did you wait? With GRCh37 (Ensembl fasta and GTF) and 8 threads and 40Gb RAM, it took me ~50mins to generate the index.

Entering edit mode

There is 192Gb of available RAM on the linux server I am using. It took an hour to complete the generation step.

Entering edit mode

Do you need the alt contigs/haplotypes? Otherwise you could take those out of the reference and generate the index.

Alex has some ready made genome indexes available here (Does include GRCh38).

Entering edit mode

I haven't worked on the GRCh38 ver. of the genome. But by 'high car numbers' did you mean chr? Those are alt. contigs. and in case of GRCh37_primary_assembly there are >80. Here is the snippet of the log file, in case it helps -

Finished loading and checking parameters
Mar 25 21:46:26 ... Starting to generate Genome files
/mnt/lustre/scratch/amitm/wrk_dir/GRCh37/GRCh37_Ens73_primary_assembly_karyotype_order.fa : chr # 0  "1" chrStart: 0
/mnt/lustre/scratch/amitm/wrk_dir/GRCh37/GRCh37_Ens73_primary_assembly_karyotype_order.fa : chr # 1  "2" chrStart: 249298944
/mnt/lustre/scratch/amitm/wrk_dir/GRCh37/GRCh37_Ens73_primary_assembly_karyotype_order.fa : chr # 23  "Y" chrStart: 3039559680
/mnt/lustre/scratch/amitm/wrk_dir/GRCh37/GRCh37_Ens73_primary_assembly_karyotype_order.fa : chr # 24  "MT" chrStart: 3099066368
/mnt/lustre/scratch/amitm/wrk_dir/GRCh37/GRCh37_Ens73_primary_assembly_karyotype_order.fa : chr # 25  "GL000191.1" chrStart: 3099328512
/mnt/lustre/scratch/amitm/wrk_dir/GRCh37/GRCh37_Ens73_primary_assembly_karyotype_order.fa : chr # 83  "GL000249.1" chrStart: 3115057152
Number of SA indices: 5729570440
Mar 25 21:48:11 ... starting to sort  Suffix Array. This may take a long time...
Number of chunks: 27;   chunks size limit: 1857700600 bytes
Mar 25 21:48:33 ... sorting Suffix Array chunks and saving them to disk...
Writing 1582872280 bytes into /mnt/lustre/scratch/amitm/wrk_dir/GRCh37_STAR_index_w_Anno_sjOH130/SA_1 ; empty space on disk = 145871004508160 bytes ... done
Writing 709672768 bytes into /mnt/lustre/scratch/amitm/wrk_dir/GRCh37_STAR_index_w_Anno_sjOH130/SA_26 ; empty space on disk = 145825114284032 bytes ... done
Mar 25 22:20:19 ... loading chunks from disk, packing SA...
Mar 25 22:22:55 ... Finished generating suffix array
Mar 25 22:22:55 ... Generating Suffix Array index
Mar 25 22:27:17 ... Completed Suffix Array index
Mar 25 22:27:17 ..... Processing annotations GTF
Processing sjdbGTFfile=/mnt/lustre/scratch/amitm/wrk_dir/GRCh37_GTF/Homo_sapiens.GRCh37.73_Only-REF.gtf, found:
                195565 transcripts
                1193676 exons (non-collapsed)
                343963 collapsed junctions
Mar 25 22:27:33 ..... Finished GTF processing
Mar 25 22:27:33   Loaded database junctions from the GTF file: /mnt/lustre/scratch/amitm/wrk_dir/GRCh37_GTF/Homo_sapiens.GRCh37.73_Only-REF.gtf: 343963 total junctions

Mar 25 22:27:34   Finished preparing junctions
Mar 25 22:27:34 ..... Inserting junctions into the genome indices
Mar 25 22:29:16   Finished SA search: number of new junctions=343885, old junctions=0
Mar 25 22:30:51   Finished sorting SA indicesL nInd=178819900
Mar 25 22:32:24   Finished inserting junction indices
Mar 25 22:33:01   Finished SAi
Mar 25 22:33:01 ..... Finished inserting junctions into genome
Mar 25 22:33:01 ... writing Genome to disk ...
Writing 3205073281 bytes into /mnt/lustre/scratch/amitm/wrk_dir/GRCh37_STAR_index_w_Anno_sjOH130/Genome ; empty space on disk = 145869759139840 bytes ... done
SA size in bytes: 24372110156
Mar 25 22:33:13 ... writing Suffix Array to disk ...
Writing 24372110156 bytes into /mnt/lustre/scratch/amitm/wrk_dir/GRCh37_STAR_index_w_Anno_sjOH130/SA ; empty space on disk = 145866566606848 bytes ... done
Mar 25 22:34:33 ... writing SAindex to disk
Writing 8 bytes into /mnt/lustre/scratch/amitm/wrk_dir/GRCh37_STAR_index_w_Anno_sjOH130/SAindex ; empty space on disk = 145842171707392 bytes ... done
Writing 120 bytes into /mnt/lustre/scratch/amitm/wrk_dir/GRCh37_STAR_index_w_Anno_sjOH130/SAindex ; empty space on disk = 145842171707392 bytes ... done
Writing 1565873491 bytes into /mnt/lustre/scratch/amitm/wrk_dir/GRCh37_STAR_index_w_Anno_sjOH130/SAindex ; empty space on disk = 145842171707392 bytes ... done
Mar 25 22:34:39 ..... Finished successfully
DONE: Genome generation, EXITING
Entering edit mode

May be an obvious question but do you have enough space available on disk (and/or in /tmp)? I am not sure if STAR uses /tmp to temporarily hold files/data.

Entering edit mode

Currently I have approximately 600Gb of free space, I guess that should be more than enough for the generation step.

Entering edit mode

My suspicion is that the sorting process (last line of log file) is getting killed by Kernel itself. The kernel can kill any erratic and resource hungry process without the process having any chance to grab and report the error signal. You may try these (in your order of preference) :

1) Could you re-run restricting the memory usage by using these param: --genomeSAindexNbases 12 (or even 10) --genomeSAsparseD 3 (see manual) and if that doesn't work, also try changing limitGenomeGenerateRAM to a lower limit (25GB?)

--limitGenomeGenerateRAM 25000000000

1) Even if your machine has huge RAM and disk space, you might be bound by personal quotas. Could you paste output of ulimit -a and quota commands here?

2) If you have permissions, check the kernel log message (grep STAR /var/log/kern.log) and system log message (grep STAR /var/log/syslog) for any unusual words like killed or aborted.

Entering edit mode

Also, just to be sure, is that your complete log file?

Entering edit mode

Thanks for the advices. I will try all your suggestions and see how it goes.

for ulimit -a I obtain:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 256
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 709
virtual memory          (kbytes, -v) unlimited

and for quota I get 'none'.

It is just an extract of the log file.

Entering edit mode

Looks everything normal here. Please also post the last lines 20-30 of log-file.

Entering edit mode

Open files/stack size may be things to follow-up on (increase both) in case you are not able to find anything else.

You may want to check with your sys admins to see if they are able to look in kernel/system logs for any other clues as suggested by @Santosh before.

Entering edit mode

Stack limit is not the culprit. It has the same (default) value on my box where I can easily do indexing. On second thoughts, open files could be if the sort or other process creating too many tmp files. Kernel / syslog might be the way to go. You may also post this in STAR mailing list. Alex is usually very responsive. And please post the answer when you get it. It's a curious case!

Entering edit mode

I tried your 1), without success so far. Apparently there has been a mixup in the installed versions and may be using 2.4.1c rather than 2.5.0a. I will try to change it and see how it goes.

Entering edit mode

what the last lines of logs says in two cases? same??

Entering edit mode

Same thing yes. With the 2.5.0a version it worked. Thanks for the help.

Entering edit mode

Thank you. Good to know that it worked finally.


Login before adding your answer.

Traffic: 810 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6