Question: Awkward Chromosome numbering during Generating Genome in STARS
0
gravatar for chet
24 months ago by
chet0
chet0 wrote:

Dear All,

I have the intention to use STAR for the mapping of my Iontorrent RNA-Seq data. I am new to RNA_seq data analysis and I am confused regarding the way the log.out file content in STAR.

Primarily I am trying to generate the mouse genome indexes using appropriate data downloaded from Ensambl (GRCm38.90) I am using the single-fasta-file (not individual chromosome sets) to build the genome index (2.8GB un-tarred size). I have the matching gtf annotation file in the same directory as well.

My command-line is as follows (12 core CPU & 32 GB RAM):

STAR --runThreadN 20 --runMode genomeGenerate --genomeDir ~/Documents/Mus_Musculus_Genome --genomeFastaFiles Mus_musculus.GRCm38.dna.chromosome.1.fa --sjdbGTFfile Mus_musculus.GRCm38.90.gtf --sjdbOverhang 100

STAR seem to generate a genome index successfully and the terminal window remarks the following:

Nov 22 19:34:11 ..... started STAR run Nov 22 19:34:11 ... starting to generate Genome files Nov 22 19:34:50 ... starting to sort Suffix Array. This may take a long time... Nov 22 19:34:59 ... sorting Suffix Array chunks and saving them to disk... Nov 22 19:49:30 ... loading chunks from disk, packing SA... Nov 22 19:51:30 ... finished generating suffix array Nov 22 19:51:30 ... generating Suffix Array index Nov 22 19:54:23 ... completed Suffix Array index Nov 22 19:54:23 ..... processing annotations GTF Nov 22 19:54:30 ..... inserting junctions into the genome indices Nov 22 19:56:53 ... writing Genome to disk ... Nov 22 19:57:03 ... writing Suffix Array to disk ... Nov 22 19:58:00 ... writing SAindex to disk Nov 22 19:58:04 ..... finished successfully

However, the Log.out file depicts the following which was confusing for me:

Finished loading and checking parameters Nov 22 19:34:11 ... starting to generate Genome files Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 0 "1" chrStart: 0 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 1 "10" chrStart: 195559424 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 2 "11" chrStart: 326369280 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 3 "12" chrStart: 448528384 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 4 "13" chrStart: 568852480 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 5 "14" chrStart: 689438720 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 6 "15" chrStart: 814481408 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 7 "16" chrStart: 918552576 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 8 "17" chrStart: 1016856576 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 9 "18" chrStart: 1112014848 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 10 "19" chrStart: 1202978816 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 11 "2" chrStart: 1264582656 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 12 "3" chrStart: 1446772736 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 13 "4" chrStart: 1606942720 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 14 "5" chrStart: 1763704832 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 15 "6" chrStart: 1915748352 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 16 "7" chrStart: 2065694720 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 17 "8" chrStart: 2211184640 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 18 "9" chrStart: 2340683776 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 19 "MT" chrStart: 2465464320 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 20 "X" chrStart: 2465726464 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 21 "Y" chrStart: 2636906496 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 22 "JH584299.1" chrStart: 2728656896 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 23 "GL456233.1" chrStart: 2729705472 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 24 "JH584301.1" chrStart: 2730229760 ................... Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 60 "GL456382.1" chrStart: 2739666944 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 61 "GL456359.1" chrStart: 2739929088 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 62 "GL456396.1" chrStart: 2740191232 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 63 "GL456368.1" chrStart: 2740453376 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 64 "JH584292.1" chrStart: 2740715520 Mus_musculus.GRCm38.dna.chromosome.1.fa : chr # 65 "JH584295.1" chrStart: 2740977664 Number of SA indices: 5305567000

I wonder if those awkward "chr#" chromosome numbers are erroneous

I will be grateful if you may also guide me weather if using a single fasta genome file is fine or not. I am not using the TOPLEVEL files as I have read in several discussions that it might not be the best choice.

Best,

Chet

ADD COMMENTlink written 24 months ago by chet0

One fasta containing the whole genome is fine. Rather than try to troubleshoot who knows what, try aligning something to it.

ADD REPLYlink written 23 months ago by swbarnes27.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 858 users visited in the last hour