I'm trying to generate a custom genome of all protein coding genes in hg38. The fasta file I generated is structured in a way were each gene is an individual record instead of each chromosome. Overall, it's like I have 25k different "chromosomes". I then wanted to create a custom genome from this fasta file to use for mapping. To do this I've used the command below:
STAR --runThreadN 2 --runMode genomeGenerate --genomeDir ./star --genomeFastaFiles refseq_uniqgenes.fasta --genomeSAindexNbases 14 --genomeSAsparseD 3 --genomeChrBinNbits 17 --limitGenomeGenerateRAM 16357785866
I calculated both the --genomeSAindexNbases and --genomeChrBinNbits based on the STAR manual. However, whenever I try to run genomeGenerate, STAR always hangs up at 'sorting Suffix Array chunks and saving them to disk' and just sits there for days. Is there something that I'm doing wrong? Any suggestions on how I may be able to get this to work? I'm desperate for any tips. I've been trying for days to get this to work, including trying to increase RAM to 128GB and Threads to 12 by running in on aws batch. Here's the beginning and end of the log.out file:
STAR version=2.7.10b
STAR compilation time,server,dir= :/Users/distiller/project/STARcompile/source
STAR git: On branch master ; commit c6f8efc2c7043ef83bf8b0d9bed36bbb6b9b1133 ; diff files:
##### Command Line:
STAR --runThreadN 2 --runMode genomeGenerate --genomeDir ./star --genomeFastaFiles refseq_uniqgenes.fasta --genomeSAindexNbases 14 --genomeSAsparseD 3 --genomeChrBinNbits 17 --limitGenomeGenerateRAM 16357785866
##### Initial USER parameters from Command Line:
###### All USER parameters from Command Line:
runThreadN 2 ~RE-DEFINED
runMode genomeGenerate ~RE-DEFINED
genomeDir ./star ~RE-DEFINED
genomeFastaFiles refseq_uniqgenes.fasta ~RE-DEFINED
genomeSAindexNbases 14 ~RE-DEFINED
genomeSAsparseD 3 ~RE-DEFINED
genomeChrBinNbits 17 ~RE-DEFINED
limitGenomeGenerateRAM 16357785866 ~RE-DEFINED
##### Finished reading parameters from all sources
...
ZYG11A 54236
ZYG11B 102884
ZYX 11767
ZZEF1 140586
ZZZ3 123012
Genome sequence total length = 3564669569
Genome size with padding = 6134169600
Estimated genome size with padding and SJs: total=genome+SJ=6335169600 = 6134169600 + 201000000
GstrandBit=33
Number of SA indices: 2209505227
Dec 28 10:48:41 ... starting to sort Suffix Array. This may take a long time...
Number of chunks: 15; chunks size limit: 1226833936 bytes
Dec 28 10:48:58 ... sorting Suffix Array chunks and saving them to disk...
This sounds like an "off-label" application of STAR. Only the developer may be able to offer a definite answer (I see that you have already created an issue there so wait and see what Alex says).
That said, if you want to use protein coding genes then why not try
salmon
instead with a transcriptome reference?I'll give salmon a try. My main mapping/processing pipeline has always relied on STAR in the the past so I was originally try to stick with it, but if I definitely make a switch if I need to. Do you have a suggestion on how I may get this to work? Do you think the main issue is the number of "chromosome" (or genes in my case)?
Your use case seems to be along the lines of what rsem-prepare-reference does. Maybe check that out?
Alex Dobin (STAR author) responded: https://github.com/alexdobin/STAR/issues/1731