Question: Warnings while generating STAR indices
0
gravatar for skhan
11 months ago by
skhan10
skhan10 wrote:

I'd like to align 2x75b TruSeq RNA Seq data collected on an Illumina instrument to the rat reference genome using STAR, for downstream differential expression analysis. I obtained the reference genome through iGenome, and ran the following command to generate STAR indices:

STAR --runMode genomeGenerate \
--genomeDir STAR_indices \
--genomeFastaFiles iGenome/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/WholeGenomeFasta/genome.fa \
--sjdbGTFfile iGenome/Rattus_norvegicus/Ensembl/Rnor_6.0/Annotation/Genes/genes.gtf \
--sjdbOverhang 74 \
--runThreadN 8

I get the following warmings:

WARNING: while processing sjdbGTFfile=iGenome/Rattus_norvegicus/Ensembl/Rnor_6.0/Annotation/Genes/genes.gtf: chromosome 'AABR07022620.1' not found in Genome fasta files for line:
AABR07022620.1  ensembl exon    122 427 .   -   .   exon_id "ENSRNOE00000544043"; exon_number "1"; exon_version "1"; gene_biotype "protein_coding"; gene_id "ENSRNOG00000058846"; gene_name "AABR07022620.1"; gene_source "ensembl"; gene_version "1"; p_id "P25520"; transcript_biotype "protein_coding"; transcript_id "ENSRNOT00000091897"; transcript_name "AABR07022620.1-201"; transcript_source "ensembl"; transcript_version "1"; tss_id "TSS27633";

I get about 800 warnings of this type. Turns out the iGenome .fa file only lists chromosomes 1-20 + MT + X + Y and nothing else (so 23 in total), while the iGenome .gtf has hundreds of listing for "chromosomes" in addition to those 23. One such example is "chromosome" AABR07022620.1, which is found in the .gtf file but not the .fa file.

Should I be concerned about this? Or can I ignore these warnings, and be confident in the differential expression results I get while using the iGenome files?

rna-seq star • 475 views
ADD COMMENTlink modified 11 months ago by Devon Ryan89k • written 11 months ago by skhan10
3
gravatar for Devon Ryan
11 months ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

If you want to be on the safe side you can download the genome and annotation file from Ensembl and rerun your analysis with that. Those won't produce the warnings that you've seen. I'd be surprised if there were any tangible change in the results, but "better safe than sorry" as the saying goes.

For reference, the primary issue with omitting those contigs from the reference genome is that it encourages false-positive alignments of reads originating from those contigs to other areas of the genome. In mouse and human this isn't a high risk, but it's >0 and I assume it's a higher risk still in the rat genome, which isn't going to be quite as high quality. So I would personally reprocess everything with a more comprehensive genome.

ADD COMMENTlink written 11 months ago by Devon Ryan89k

Thank you, Devon. I switched over to these two reference files directly from Ensemble:

ftp://ftp.ensembl.org/pub/release-86/fasta/rattus_norvegicus/dna/Rattus_norvegicus.Rnor_6.0.dna_sm.toplevel.fa.gz
ftp://ftp.ensembl.org/pub/release-86/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_6.0.86.gtf.gz

The only remaining warnings are the following 14 (which I get when using the above two files):

WARNING: long repeat for junction # 14982 : 1 197264488 197264790; left shift = 41; right shift = 255
WARNING: long repeat for junction # 43096 : 11 61465604 61510671; left shift = 255; right shift = 255
WARNING: long repeat for junction # 46600 : 12 5574666 5575140; left shift = 62; right shift = 255
WARNING: long repeat for junction # 56758 : 13 83101461 83101841; left shift = 255; right shift = 45
WARNING: long repeat for junction # 63259 : 14 72889778 72948202; left shift = 2; right shift = 255
WARNING: long repeat for junction # 105009 : 2 232136350 232136965; left shift = 255; right shift = 255
WARNING: long repeat for junction # 142601 : 5 69246706 69247220; left shift = 72; right shift = 255
WARNING: long repeat for junction # 144259 : 5 109501377 109501940; left shift = 52; right shift = 255
WARNING: long repeat for junction # 144261 : 5 109501888 109502451; left shift = 255; right shift = 2
WARNING: long repeat for junction # 169613 : 7 120763801 120764138; left shift = 31; right shift = 255
WARNING: long repeat for junction # 169876 : 7 122160547 122162660; left shift = 2; right shift = 255
WARNING: long repeat for junction # 195236 : X 22418971 22419698; left shift = 255; right shift = 31
WARNING: long repeat for junction # 197712 : X 84667620 84667985; left shift = 39; right shift = 255
WARNING: long repeat for junction # 199658 : X 153065750 153066229; left shift = 255; right shift = 86

I'm guessing I can ignore these?

ADD REPLYlink modified 11 months ago • written 11 months ago by skhan10
1

I think you can ignore those warnings.

ADD REPLYlink written 11 months ago by Devon Ryan89k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2443 users visited in the last hour