STAR index file for GRCH37
12 months ago

Hello everyone

I am trying to generate the index file for STAR alignment using hg19 genome. I used the following commad

STAR  --runThreadN 30    --runMode genomeGenerate  --genomeDir /data/shilpia2/STAR.index/ --genomeFastaFiles /data/shilpia2/STAR.index/GRCh37.primary_assembly.genome.fa --sjdbGTFfile /data/shilpia2/gff/gencode.v24.basic.annotation.gtf  --sjdbOverhang 100 --limitGenomeGenerateRAM 30000000000  --outFileNamePrefix /data/shilpia2/STAR.index/hg19


However, the program stops after a while without giving any error and without generating the index file. Could anyone suggest me what could be the reason or is there any problem in my command.

Thanks

I would drop the --limitGenomeGenerateRAM and --outFileNamePrefix flags You could reduce --runThreadN to say, 8, (it might be a resource issue with your cluster). Also make sure that the --genomeDir exists. Let me know how you get on

How much memory do you have? You need at least 30G+ RAM for the index generation.

Thank you so much for your response. I used 30GM RAM to run my program and run it for 3 days but it still did not generate the file. Do you think i should run for longer time.

Did you have 30 cores available for the job? Did you get anything in log/error log?

Alex has pre-made hg19/GRCh37 indexes available at this link, if you can't make them.

I do have 30 cores available. The log file generated does not show any error. The running of STAR terminates after reading of the gtf file. I tried to use the index file from the link you provided. But it shows some error in the genome file.

This is what it appears in the log file.

 ..... processing annotations GTF
!!!!! WARNING: while processing sjdbGTFfile=/data/shilpia2/gff/gencode.v24.basic.annotation.gtf, line:
chr3    HAVANA  exon    198024658   198024788   .   +   .   gene_id "ENSG00000185621.11"; transcript_id "ENST00000482695.5"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "LMLN"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "LMLN-002"; exon_number 15; exon_id "ENSE00003689636.1"; level 2; protein_id "ENSP00000418324.1"; tag "basic"; transcript_support_level "1"; tag "appris_alternative_2"; havana_gene "OTTHUMG00000155375.2"; havana_transcript "OTTHUMT00000339702.1";
exon end = 198024788 is larger than the chromosome chr3 length = 198022430 , will skip this exon

https://www.gencodegenes.org/human/release_24.html has a file named 'gencode.v24.basic.annotation.gtf'

There are all for hg38 not hg37/hg19.

The hg37/hg19 versions are here: https://www.gencodegenes.org/human/release_24lift37.html

From the link you provided should i download Comprehensive gene annotation file for GTF and Genome sequence, primary assembly (GRCh37) files ?

It's up to you and depends on the goals of your study. I primarily use annotations from ENSEMBL and am thus not familiar with the basic vs comprehensive gene annotations. I think you should probably be fine undertaking standard differential gene expression analysis with the basic set but some features could be missing.

I just have another question. Which is better for alignment. I know people have been recommending to use STAR, but what if I use Bowtie. I was just trying to compare both the tools and see how much is the difference. I was looking for your suggestion. I have to do simple differential gene analysis. So it is ok if I use Bowtie?

No, Bowtie is used for genomic alignments (i.e. DNA), for transcriptomic alignments (RNA) most would recommend a splice-aware aligner like STAR but you could also use TopHat2 (which uses bowtie under the hood).

Ok. Thank you so much for your response.

12 months ago
GenoMax

Are you mixing/matching sequences/annotations by any chance? They are all for the same build?

Hi

I did mix the annotation which caused the problem. I got the index file generated using the right GTF file. Thank you so much for your response.