Question: Mouse Genome Build And Gtf Conflict In Tophat Downloads
2
gravatar for k.nirmalraman
4.6 years ago by
k.nirmalraman930
Germany
k.nirmalraman930 wrote:

Dear All,

I have been using the sequences and annotations for mouse from tophat website here and specifically the UCSC version mm10.

I see that the sequences (genome.fa) contains Chr1- Chr19, ChrX, ChrY and ChrM... while the corresponding GTF file contains features for these random chromosome sequences such as Chr4_JH584294_random.

I was trying to calculate rpkm values for my alignment (tophat bam files that used this genome sequences) using the GTF.

 library(IRanges)
 library(Rsamtools)
 library(GenomicFeatures)
 library(GenomicRanges)
 aligns<- readBamGappedAlignments(filepath\accepted_hits.bam)`
 txdb<-makeTranscriptDbFromGFF("genes.gtf", format="gtf",species="Mus musculus",dataSource="http://tophat.cbcb.umd.edu/igenomes.shtml")
exonRanges.gene<-exonsBy(txdb,"gene")`

seqlevels(aligns)
[1] "chr1"  "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr2"  "chr3"  "chr4"  "chr5"  "chr6" 
[17] "chr7"  "chr8"  "chr9"  "chrM"  "chrX"  "chrY" 

seqlevels(exonRanges.gene)
[1] "chr13"                "chr9"                 "chr6"                 "chrX"                 "chr17"               
[6] "chr2"                 "chr7"                 "chr18"                "chr8"                 "chr4"                
[11] "chr19"                "chr5"                 "chr16"                "chr11"                "chr10"               
[16] "chr14"                "chr1"                 "chr3"                 "chr15"                "chr12"               
[21] "chrY"                 "chrX_GL456233_random" "chr5_JH584299_random" "chr5_JH584298_random" "chr4_GL456216_random"
[26] "chr4_GL456350_random" "chr4_JH584294_random" "chr4_JH584293_random" "chr5_GL456354_random" "chr7_GL456219_random"
[31] "chr5_JH584296_random" "chr5_JH584297_random" "chr4_JH584292_random" "chr1_GL456221_random" "chrUn_JH584304"

Because they have different Chromosomes (extra chrN_XXXXX_random), I get the following error/warnings.

1: In .deduceExonRankings(exs, format = "gtf") :
  Infering Exon Rankings.  If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName
2: In matchCircularity(chroms, circ_seqs) :
  None of the strings in your circ_seqs argument match your seqnames.
3: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chrX_GL456233_random, chr5_JH584299_random, chr5_JH584298_random, chr4_GL456216_random, chr4_GL456350_random, chr4_JH584294_random, chr4_JH584293_random, chr5_GL456354_random, chr7_GL456219_random, chr5_JH584296_random, chr5_JH584297_random, chr4_JH584292_random, chr1_GL456221_random, chrUn_JH584304
  - in 'y': chrM
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).

I understand that this is due to conflict in Chromosomes and I can handle this by deleting the ChrN_XXXXXX_random features from .gtf file or by ` aligning to genome with these sequences. Is there any advantages or disadvantages (right/wrong) for the former or later.

Any other work around for this? additionally, I would like to know, if there is any genome/gtf build that can be downloaded which is coherent in these respects for mouse, rat and human genome. Thanks in advance!!

chromosome • 3.0k views
ADD COMMENTlink modified 4.6 years ago by Devon Ryan85k • written 4.6 years ago by k.nirmalraman930
2
gravatar for Devon Ryan
4.6 years ago by
Devon Ryan85k
Freiburg, Germany
Devon Ryan85k wrote:

Since the XXX_random scaffolds likely exist somewhere in the genome, it's probably better to align against a reference containing them. I should note that there's not much of a difference in practical terms (they're only ~5.3 megabases total), so you're results aren't likely to be influenced much.

You might have better luck with the Ensembl annotations and genome. There, the scaffolds present in the GTF are all in the fasta file (though the reverse isn't always true, which is appropriate). The Ensembl annotation has some other benefits, like not causing DEXSeq to choke due to gene IDs being in multiple orientations on multiple chromosomes.

Edit: I should note that I have no experience with the premade indices from iGenomes, so I can't say whether the Ensembl versions are any good or not.

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Devon Ryan85k

Thanks!! That sounds promising.. I should nevertheless try Ensembl Annotations!!

ADD REPLYlink written 4.6 years ago by k.nirmalraman930
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1693 users visited in the last hour