How To Get Length Data For Goseq
2
1
Entering edit mode
10.5 years ago
hoangtv ▴ 30

Hi all I am woking with my RNA-seq data on GOSeq, however I got stuck now. There is support for mm10 genome but not Gene ID ( geneSymbol). I am trying to get length information by following the Goseq manual but I still dont understand. So, could you please show me how to get the length information for mm10 genome and geneID geneSymbol ?

genes = as.integer(all.genes %in% F.genes) names(genes) = all.genes head(genes) Cryba1 Cryba4 Cryga Crygb Crygc Crygd 1 1 1 1 1 1 pwf=nullp(genes,"mm10", "geneSymbol") Can't find mm10/geneSymbol length data in genLenDataBase... Trying to download from UCSC. This might take a couple of minutes. Error in value[3L] : Length information for genome mm10 and gene ID geneSymbol is not available. You will have to specify bias.data manually. Thank you so much

rna-seq • 7.0k views
ADD COMMENT
3
Entering edit mode
10.5 years ago

Before getting to your question, it's generally a bad idea to count by gene symbol (or even UCSC ID). They're not actually unique, so you end up counting multiple things (the Ensembl annotations make much more sense!).

This is more of a work around than a direct answer, since I've never looked to see where the ucscTableQuery() function actually sources its data. Having said that, once it gets the data from UCSC, the getlength() function in goseq just adds up the exons. Assuming you have a GTF and fasta file for mm10 sitting around somewhere (quite likely if you're doing RNAseq), then something like the following will work. This also outputs %GC content, in case you ever need to use the cqn package. The original version of this used gene_id to calculate everything over, so I've quickly changed the instances of that to gene_name. I've not debugged the version pasted below, but it should more or less work.

library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)

GTFfile = "something.gtf"
FASTAfile = "something.fa"

#Load the annotation and reduce it
GTF <- import.gff(GTFfile, format="gtf", genome="mm10", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_name))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_name <- rep(names(grl), elementLengths(grl))

#Open the fasta file
FASTA <- FaFile(FASTAfile)
open(FASTA)

#Add the GC numbers
elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1]
elementMetadata(reducedGTF)$widths <- width(reducedGTF)

#Create a list of the ensembl_id/GC/length
calc_GC_length <- function(x) {
    nGCs = sum(elementMetadata(x)$nGCs)
    width = sum(elementMetadata(x)$widths)
    c(width, nGCs/width)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_name), calc_GC_length))
colnames(output) <- c("Length", "GC")

write.table(output, file="GC_lengths.tsv", sep="\t")
ADD COMMENT
0
Entering edit mode

How would you load the GC_lengths.tsv file into goseq?

ADD REPLY
0
Entering edit mode

read.delim(), as you would any other tab-separated file.

ADD REPLY
0
Entering edit mode

Would you be able to share more of your code how you used goseq because here it seem they did it differently but their code did not work for me?

ADD REPLY
0
Entering edit mode

They just did some extra work because they don't know how to use R, it's the same otherwise. For goseq, just follow its manual.

ADD REPLY
0
Entering edit mode

Unfortunately, in the manual I found only the explanation for : * 5.1 Length data format (your code is working for it) * 5.2 Category mapping format but not how to actually use it for an non model species.

ADD REPLY
0
Entering edit mode

You put the lengths in nullp() as the bias.data= parameter.

ADD REPLY
0
Entering edit mode

Should I merge sp0000001-mRNA-x:exon:xxx together in order to get sp0000001 which would represent the gene length?

"Length"    "GC"
"sp0000001-mRNA-1:exon:204" 17  0.529411764705882
"sp0000001-mRNA-1:exon:205" 631 0.393026941362916
"sp0000001-mRNA-1:exon:206" 146 0.417808219178082
"sp0000001-mRNA-1:exon:207" 187 0.379679144385027
"sp0000001-mRNA-1:exon:208" 2046    0.389540566959922
"sp0000001-mRNA-1:exon:209" 20  0.45
"sp0000001-mRNA-1:exon:210" 4   0.5
"sp0000002-mRNA-1:exon:211" 333 0.54954954954955
"sp0000003-mRNA-1:exon:212" 216 0.458333333333333
"sp0000004-mRNA-1:exon:213" 357 0.560224089635854
"sp0000005-mRNA-1:exon:258" 200 0.6
"sp0000005-mRNA-1:exon:259" 82  0.390243902439024
"sp0000005-mRNA-1:exon:260" 65  0.4
"sp0000005-mRNA-1:exon:261" 98  0.428571428571429
"sp0000005-mRNA-1:exon:262" 137 0.401459854014599
"sp0000005-mRNA-1:exon:263" 68  0.485294117647059
"sp0000005-mRNA-1:exon:264" 64  0.40625
"sp0000005-mRNA-1:exon:265" 369 0.414634146341463
"sp0000006-mRNA-1:exon:266" 333 0.501501501501502
"sp0000006-mRNA-1:exon:267" 37  0.459459459459459
"sp0000006-mRNA-1:exon:268" 214 0.546728971962617
"sp0000006-mRNA-1:exon:269" 203 0.433497536945813
"sp0000006-mRNA-1:exon:270" 100 0.32
"sp0000006-mRNA-1:exon:271" 123 0.48780487804878
"sp0000006-mRNA-1:exon:272" 143 0.496503496503497
"sp0000006-mRNA-1:exon:273" 390 0.433333333333333
"sp0000006-mRNA-2:exon:274" 553 0.455696202531646
"sp0000006-mRNA-3:exon:275" 165 0.460606060606061
"sp0000006-mRNA-3:exon:276" 120 0.483333333333333
"sp0000007-mRNA-1:exon:1386"    65  0.446153846153846
"sp0000007-mRNA-1:exon:1387"    272 0.639705882352941
"sp0000007-mRNA-1:exon:1388"    93  0.612903225806452
"sp0000007-mRNA-1:exon:1389"    195 0.584615384615385
"sp0000007-mRNA-1:exon:1390"    294 0.693877551020408
"sp0000007-mRNA-1:exon:1391"    114 0.517543859649123
"sp0000007-mRNA-1:exon:1392"    896 0.646205357142857

My gff file looks like this:

lcl|ScwjSwM_6572    maker    gene    14106    17663    .    -    .    ID=sp0000006;Name=sp0000006
lcl|ScwjSwM_6572    maker    mRNA    14106    17663    .    -    .    ID=sp0000006-mRNA-1;Parent=sp0000006;Name=sp0000006-mRNA-1;_AED=0.16;_eAED=0.16;_QI=299|0.71|0.62|1|0.57|0.37|8|395|282
lcl|ScwjSwM_6572    maker    mRNA    14106    17663    .    -    .    ID=sp0000006-mRNA-2;Parent=sp0000006;Name=sp0000006-mRNA-2;_AED=0.16;_eAED=0.16;_QI=299|0.66|0.57|1|0.66|0.57|7|415|282
lcl|ScwjSwM_6572    maker    mRNA    14106    17663    .    -    .    ID=sp0000006-mRNA-3;Parent=sp0000006;Name=sp0000006-mRNA-3;_AED=0.19;_eAED=0.19;_QI=299|1|0.85|1|0.83|0.71|7|395|235
lcl|ScwjSwM_6572    maker    exon    14752    14871    .    -    .    ID=sp0000006-mRNA-3:exon:276;Parent=sp0000006-mRNA-3
lcl|ScwjSwM_6572    maker    exon    15581    15745    .    -    .    ID=sp0000006-mRNA-3:exon:275;Parent=sp0000006-mRNA-3
...
lcl|ScwjSwM_6572    maker    exon    17152    17188    .    -    .    ID=sp0000006-mRNA-1:exon:267;Parent=sp0000006-mRNA-1,sp0000006-mRNA-2,sp0000006-mRNA-3
lcl|ScwjSwM_6572    maker    exon    17331    17663    .    -    .    ID=sp0000006-mRNA-1:exon:266;Parent=sp0000006-mRNA-1,sp0000006-mRNA-2,sp0000006-mRNA-3
lcl|ScwjSwM_6572    maker    five_prime_UTR    17365    17663    .    -    .    ID=sp0000006-mRNA-1:five_prime_utr;Parent=sp0000006-mRNA-1
lcl|ScwjSwM_6572    maker    CDS    17331    17364    .    -    0    ID=sp0000006-mRNA-1:cds;Parent=sp0000006-mRNA-1
...
lcl|ScwjSwM_6572    maker    CDS    14752    14874    .    -    0    ID=sp0000006-mRNA-1:cds;Parent=sp0000006-mRNA-1
lcl|ScwjSwM_6572    maker    CDS    14521    14658    .    -    0    ID=sp0000006-mRNA-1:cds;Parent=sp0000006-mRNA-1
lcl|ScwjSwM_6572    maker    three_prime_UTR    14516    14520    .    -    .    ID=sp0000006-mRNA-1:three_prime_utr;Parent=sp0000006-mRNA-1
lcl|ScwjSwM_6572    maker    three_prime_UTR    14106    14495    .    -    .    ID=sp0000006-mRNA-1:three_prime_utr;Parent=sp0000006-mRNA-1
lcl|ScwjSwM_6572    maker    five_prime_UTR    17365    17663    .    -    .    ID=sp0000006-mRNA-2:five_prime_utr;Parent=sp0000006-mRNA-2
lcl|ScwjSwM_6572    maker    CDS    17331    17364    .    -    0    ID=sp0000006-mRNA-2:cds;Parent=sp0000006-mRNA-2
lcl|ScwjSwM_6572    maker    CDS    17152    17188    .    -    2    ID=sp0000006-mRNA-2:cds;Parent=sp0000006-mRNA-2
...
lcl|ScwjSwM_6572    maker    CDS    14521    14658    .    -    0    ID=sp0000006-mRNA-2:cds;Parent=sp0000006-mRNA-2
lcl|ScwjSwM_6572    maker    three_prime_UTR    14106    14520    .    -    .    ID=sp0000006-mRNA-2:three_prime_utr;Parent=sp0000006-mRNA-2
lcl|ScwjSwM_6572    maker    five_prime_UTR    17365    17663    .    -    .    ID=sp0000006-mRNA-3:five_prime_utr;Parent=sp0000006-mRNA-3
ADD REPLY
0
Entering edit mode

My code was made for GTF files, you'll need to change it to match yours.

ADD REPLY
0
Entering edit mode

Would you be able to share few lines of your GTF file and few lines of the output file created by your script?

ADD REPLY
0
Entering edit mode

Download any mammalian GTF file from Ensembl for a good example.

ADD REPLY
0
Entering edit mode

Apparently, nullp expects a vector for bias.data, not a data.frame. What is the best way to convert it to a vector after I read it with read.delim()?

ADD REPLY
0
Entering edit mode

Give it the appropriate vector from the dataframe (e.g., d$length). I should note that you might have to either (A) ensure the order of the dataframe/length-vector matches the rest of the data you input and/or (B) ensure you vector has labels that match the rest of the data (in case nullp handles this). I think (A) suffices, but it's been a while so if you get an error message then you probably need (B) too.

ADD REPLY
0
Entering edit mode
9.0 years ago
rahilsethi • 0

Thanks a lot Ryan, That was very helpful. I am trying to use GTF file from cuffmerge and trying to do normalization w.r.t GC content and Gene length. For getting lengths and GC content of all genes generated by cufflinks (XLOC ids) I had to rely on the script that you provided. I know you mentioned you used reduce function to take care of overlapping exons which is vital for calculating gene lengths. Can you please explain roughly how does it take care of overlapping regions? Does it ignore overlapping regions or just counts overlapping region only once? I assume it is here where the overlapping exon regions are taken care of:

grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))

I am also thinking of doing transcript level normalization and expression for all the transcript assemblies that Cuffmerge output contains (TCONS id). For that I will substitute $gene_id with $transcript_id because that is the name its lists in GTF viewed with head(GTF). I am using HTSseq count for gene level, but for transcript level, perhaps I would have to switch to featureCounts since it prevents losing overlapping reads with isoforms of genes.

Thanks again, it worked fine by removing the option genome= in import.gff since I used Ensembl database as reference and this uses UCSC to look for genome information.

Rahil

ADD COMMENT

Login before adding your answer.

Traffic: 2457 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6