STAR genomeGenerate hangs at sorting Suffix Array chunks
0
0
Entering edit mode
16 months ago
typist001 • 0

I'm trying to generate a custom genome of all protein coding genes in hg38. The fasta file I generated is structured in a way were each gene is an individual record instead of each chromosome. Overall, it's like I have 25k different "chromosomes". I then wanted to create a custom genome from this fasta file to use for mapping. To do this I've used the command below:

STAR --runThreadN 2 --runMode genomeGenerate --genomeDir ./star --genomeFastaFiles refseq_uniqgenes.fasta --genomeSAindexNbases 14 --genomeSAsparseD 3 --genomeChrBinNbits 17 --limitGenomeGenerateRAM 16357785866

I calculated both the --genomeSAindexNbases and --genomeChrBinNbits based on the STAR manual. However, whenever I try to run genomeGenerate, STAR always hangs up at 'sorting Suffix Array chunks and saving them to disk' and just sits there for days. Is there something that I'm doing wrong? Any suggestions on how I may be able to get this to work? I'm desperate for any tips. I've been trying for days to get this to work, including trying to increase RAM to 128GB and Threads to 12 by running in on aws batch. Here's the beginning and end of the log.out file:

STAR version=2.7.10b
STAR compilation time,server,dir= :/Users/distiller/project/STARcompile/source
STAR git: On branch master ; commit c6f8efc2c7043ef83bf8b0d9bed36bbb6b9b1133 ; diff files: 
##### Command Line:
STAR --runThreadN 2 --runMode genomeGenerate --genomeDir ./star --genomeFastaFiles refseq_uniqgenes.fasta --genomeSAindexNbases 14 --genomeSAsparseD 3 --genomeChrBinNbits 17 --limitGenomeGenerateRAM 16357785866
##### Initial USER parameters from Command Line:
###### All USER parameters from Command Line:
runThreadN                    2     ~RE-DEFINED
runMode                       genomeGenerate        ~RE-DEFINED
genomeDir                     ./star     ~RE-DEFINED
genomeFastaFiles              refseq_uniqgenes.fasta        ~RE-DEFINED
genomeSAindexNbases           14     ~RE-DEFINED
genomeSAsparseD               3     ~RE-DEFINED
genomeChrBinNbits             17     ~RE-DEFINED
limitGenomeGenerateRAM        16357785866     ~RE-DEFINED
##### Finished reading parameters from all sources 

...

ZYG11A  54236
ZYG11B  102884
ZYX 11767
ZZEF1   140586
ZZZ3    123012
Genome sequence total length = 3564669569
Genome size with padding = 6134169600
Estimated genome size with padding and SJs: total=genome+SJ=6335169600 = 6134169600 + 201000000
GstrandBit=33
Number of SA indices: 2209505227
Dec 28 10:48:41 ... starting to sort Suffix Array. This may take a long time...
Number of chunks: 15;   chunks size limit: 1226833936 bytes
Dec 28 10:48:58 ... sorting Suffix Array chunks and saving them to disk...
alignment genome STAR • 1.5k views
ADD COMMENT
0
Entering edit mode

This sounds like an "off-label" application of STAR. Only the developer may be able to offer a definite answer (I see that you have already created an issue there so wait and see what Alex says).

That said, if you want to use protein coding genes then why not try salmon instead with a transcriptome reference?

ADD REPLY
0
Entering edit mode

I'll give salmon a try. My main mapping/processing pipeline has always relied on STAR in the the past so I was originally try to stick with it, but if I definitely make a switch if I need to. Do you have a suggestion on how I may get this to work? Do you think the main issue is the number of "chromosome" (or genes in my case)?

ADD REPLY
0
Entering edit mode

Your use case seems to be along the lines of what rsem-prepare-reference does. Maybe check that out?

ADD REPLY
0
Entering edit mode

Alex Dobin (STAR author) responded: https://github.com/alexdobin/STAR/issues/1731

ADD REPLY

Login before adding your answer.

Traffic: 1069 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6