Question

Mouse Genome Build And Gtf Conflict In Tophat Downloads

2

Entering edit mode

10.2 years ago

k.nirmalraman ★ 1.1k

Dear All,

I have been using the sequences and annotations for mouse from tophat website here and specifically the UCSC version mm10.

I see that the sequences (genome.fa) contains Chr1- Chr19, ChrX, ChrY and ChrM... while the corresponding GTF file contains features for these random chromosome sequences such as Chr4_JH584294_random.

I was trying to calculate rpkm values for my alignment (tophat bam files that used this genome sequences) using the GTF.

 library(IRanges)
 library(Rsamtools)
 library(GenomicFeatures)
 library(GenomicRanges)
 aligns<- readBamGappedAlignments(filepath\accepted_hits.bam)`
 txdb<-makeTranscriptDbFromGFF("genes.gtf", format="gtf",species="Mus musculus",dataSource="http://tophat.cbcb.umd.edu/igenomes.shtml")
exonRanges.gene<-exonsBy(txdb,"gene")`

seqlevels(aligns)
[1] "chr1"  "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr2"  "chr3"  "chr4"  "chr5"  "chr6" 
[17] "chr7"  "chr8"  "chr9"  "chrM"  "chrX"  "chrY" 

seqlevels(exonRanges.gene)
[1] "chr13"                "chr9"                 "chr6"                 "chrX"                 "chr17"               
[6] "chr2"                 "chr7"                 "chr18"                "chr8"                 "chr4"                
[11] "chr19"                "chr5"                 "chr16"                "chr11"                "chr10"               
[16] "chr14"                "chr1"                 "chr3"                 "chr15"                "chr12"               
[21] "chrY"                 "chrX_GL456233_random" "chr5_JH584299_random" "chr5_JH584298_random" "chr4_GL456216_random"
[26] "chr4_GL456350_random" "chr4_JH584294_random" "chr4_JH584293_random" "chr5_GL456354_random" "chr7_GL456219_random"
[31] "chr5_JH584296_random" "chr5_JH584297_random" "chr4_JH584292_random" "chr1_GL456221_random" "chrUn_JH584304"

Because they have different Chromosomes (extra chrN_XXXXX_random), I get the following error/warnings.

1: In .deduceExonRankings(exs, format = "gtf") :
  Infering Exon Rankings.  If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName
2: In matchCircularity(chroms, circ_seqs) :
  None of the strings in your circ_seqs argument match your seqnames.
3: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chrX_GL456233_random, chr5_JH584299_random, chr5_JH584298_random, chr4_GL456216_random, chr4_GL456350_random, chr4_JH584294_random, chr4_JH584293_random, chr5_GL456354_random, chr7_GL456219_random, chr5_JH584296_random, chr5_JH584297_random, chr4_JH584292_random, chr1_GL456221_random, chrUn_JH584304
  - in 'y': chrM
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).

I understand that this is due to conflict in Chromosomes and I can handle this by deleting the ChrN_XXXXXX_random features from .gtf file or by ` aligning to genome with these sequences. Is there any advantages or disadvantages (right/wrong) for the former or later.

Any other work around for this? additionally, I would like to know, if there is any genome/gtf build that can be downloaded which is coherent in these respects for mouse, rat and human genome. Thanks in advance!!

chromosome • 4.5k views

ADD COMMENT • link updated 10.2 years ago by Devon Ryan 104k • written 10.2 years ago by k.nirmalraman ★ 1.1k

score 2 · Answer 1 · 2014-02-26

Since the XXX_random scaffolds likely exist somewhere in the genome, it's probably better to align against a reference containing them. I should note that there's not much of a difference in practical terms (they're only ~5.3 megabases total), so you're results aren't likely to be influenced much.

You might have better luck with the Ensembl annotations and genome. There, the scaffolds present in the GTF are all in the fasta file (though the reverse isn't always true, which is appropriate). The Ensembl annotation has some other benefits, like not causing DEXSeq to choke due to gene IDs being in multiple orientations on multiple chromosomes.

Edit: I should note that I have no experience with the premade indices from iGenomes, so I can't say whether the Ensembl versions are any good or not.