Question: Stuck creating reference genome with STAR
0
gravatar for nash.claire
2.4 years ago by
nash.claire280
Canada
nash.claire280 wrote:

Hi again,

I want to use STAR to run my RNA-seq analysis however I'm having issues at the first hurdle trying to generate a reference genome.

I want to use the newest rat rn6 build but keep getting errors with genomeGenerate. here is my command :

--runMode genomeGenerate --genomeDir /path/to/directory --genomeFastaFiles ~/path/to/directory/rn6_chr1.fa rn6_chr2.fa rn6_chr3.fa rn6_chr4.fa rn6_chr5.fa rn6_chr6.fa rn6_chr7.fa rn6_chr8.fa rn6_chr9.fa rn6_chr10.fa rn6_chr11.fa rn6_chr12.fa rn6_chr13.fa rn6_chr14.fa rn6_chr15.fa rn6_chr16.fa rn6_chr17.fa rn6_chr18.fa rn6_chr19.fa rn6_chr20.fa rn6_chrMT.fa rn6_chrX.fa rn6_chrY.fa --sjdbGTFfile ~/path/to/directory/rn6.gtf --sjdbOverhang 49 --runThreadN 12 --outFileNamePrefix /path/to/directory/rn6

and here is my error

EXITING because of INPUT ERROR: could not open genomeFastaFile: path/to/directory/rn6_chr1.fa

So here are some points and errors I've already covered after reading posts and forums

- I'm using separate chromsome fasta files as I read that using toplevel.dna files is not good and there isn't a primary.dna file for rn6 yet. I tried toplevel fa file with no success.

-I've gone through and checked that every directory where my files are stored and my output directories etc are fully writable, readable and executable with chmod.

- my genomeDir is completely empty and is situated on a RAID with tons of free space.

- my fasta files and gtf file was downloaded from ensembl and both look fine.

- I'm running this on a Mac Pro which has a 12 core processor and 64gb of RAM and have played with the thread settings which had no effect.

- my reads are 50 bp in length and paired end hence me using the 49 sjdbOverhang setting

I'm completely stuck and lost guys. The manual isn't helping and I've exhausted all the STAR google group and biostars posts relating to this. Can anyone help??

rna-seq genome • 3.2k views
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by nash.claire280
2
gravatar for harold.smith.tarheel
2.4 years ago by
United States
harold.smith.tarheel4.1k wrote:

That error is returned when the path is incorrect. Are the genomeFastaFiles nested in your home directory (~) as indicated, or should the path be from the top level like --genomeDir? You can check the path from the desired directory using 'pwd'.

ADD COMMENTlink written 2.4 years ago by harold.smith.tarheel4.1k
0
gravatar for Constantine
2.4 years ago by
Constantine210
Germany
Constantine210 wrote:

Check your home directory as harold.smith.tarheel said. If you are still experiencing problems then your fasta files might be corrupted.

Download the Illumina iGenome for rn6 here:

ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Rattus_norvegicus/UCSC/rn6/Rattus_norvegicus_UCSC_rn6.tar.gz

Then run on your cluster

STAR --runMode genomeGenerate \
--genomeDir /path/to/directory  \
--genomeFastaFiles /path/to/directory/Rattus_norvegicus/UCSC/rn6/Sequence/WholeGenomeFasta/genome.fa \
--runThreadN 12 --outFileNamePrefix /path/to/directory/rn6

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Constantine210
0
gravatar for Michael Dondrup
2.4 years ago by
Bergen, Norway
Michael Dondrup44k wrote:

In addition:

  •  my reads are 50 bp in length and paired end hence me using the 49 sjdbOverhang setting

sjdbOverhang should be 99 as of mate length -1, that's 2*read length for paired end, afaik, just check with the documentation

  • Why do you want to break down the full fasta file, it just makes things more complicated? There are other ways to save memory, and I am not sure if that way reduces memory requirements at all.
  • if you still want to have per chromosome files, each one of them needs to have the correct path set, not just the first one, as in ~/path/to/directory/rn6_chr1.~/path/to/directory/rn6_chr2.fa  ... ~/path/to/directory/rn6_chrY

not ~/path/to/directory/rn6_chr1 rn6_chr2.fa  ... rn6_chrY

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Michael Dondrup44k

For anyone else reading this thread, ... sjdbOverhang of 49 seems right to me. Here's a quote from the STAR manual:

--sjdbOverhang specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina 2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the ideal value is max(ReadLength)-1. In most cases, a generic value of 100 will work as well as the ideal value.

ADD REPLYlink written 9 weeks ago by skhan10
0
gravatar for nash.claire
2.4 years ago by
nash.claire280
Canada
nash.claire280 wrote:

Hi guys,

Thanks so much for the help. I'll try playing around with the file path later and see if that works and I'll change the Overhang setting as suggested. The reason I have the separate chromosome files is because I started off with the toplevel.dna.fa file from Ensembl and genomeGenerate wasn't working. I read that we shouldn't use toplevel fasta files as they contain all the haplotype data etc etc and that it can cause issues. Since there is no primary.dna.fasta file available on Ensembl, I went for the separate chromosome files instead. However, I'd appreciate your opinion on the matter.....

ADD COMMENTlink written 2.4 years ago by nash.claire280
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1353 users visited in the last hour