Can anybody help me to find the dataset for s. pombe on BioMart? And also some help on how to use makeTranscriptDbFromBiomart to create TranscriptDB?
cheers,
Can anybody help me to find the dataset for s. pombe on BioMart? And also some help on how to use makeTranscriptDbFromBiomart to create TranscriptDB?
cheers,
Looks like you figured out another way of getting what you needed, but, for the record, here is the answer to your question:
S pombe is at http://fungi.ensembl.org/index.html
The biomart is here: http://fungi.ensembl.org/biomart/martview/248a3d2deec76fa7be1e94e32b3972df
Access it using BioConductor's GenomicFeatures as follows. Note the warnings....
library(GenomicFeatures)
library(biomaRt)
txdb<-makeTranscriptDbFromBiomart(
,biomart ="fungi_mart_22"
,dataset = "spombe_eg_gene"
,host="fungi.ensembl.org"
)
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TranscriptDb object ... OK
Warning messages:
1: In .normargSplicings(splicings, transcripts_tx_id) :
no CDS information for this TranscriptDb object
2: In .normargChrominfo(chrominfo, transcripts$tx_chrom, splicings$exon_chrom) :
chromosome lengths and circularity flags are not available for this TranscriptDb object
> transcriptsBy(txdb)
GRangesList of length 7017:
$SPAC1002.01
GRanges with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] I [1798347, 1799015] + | 510 SPAC1002.01.1
$SPAC1002.02
GRanges with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
[1] I [1799061, 1800053] + | 511 SPAC1002.02.1
$SPAC1002.03c
GRanges with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
[1] I [1799915, 1803141] - | 2075 SPAC1002.03c.1
...
<7014 more elements>
---
seqlengths:
I II III MT MTR AB325691
NA NA NA NA NA NA
It's strange, yesterday I tried these commands and it built the TranscriptDB but today I am receiving an error! Do you see any problem?
> txdb<-makeTranscriptDbFromBiomart(biomart="fungi_mart_22", dataset="spombe_eg_gene", host="fungi.ensembl.org")
Error in useDataset(mart = mart, dataset = dataset, verbose = verbose) :
The given dataset: spombe_eg_gene , is not valid. Correct dataset names can be obtained with the listDatasets function.
I don't know that it's in Biomart, given that it's not in Ensembl. Just download the GTF or GFF file from pombase and then use makeTranscriptDbFromGFF()
from GenomicFeatures.
Edit: I take that back, it is in Ensembl. Here's an example biomart query.
Yes I saw that, thanks! But does it need to set a lot of parameters? I am new to this field and it is very complex at this point for me, when I check the parameters. Is there a straightforward script for it or should I go all through the arguments and choose carefully?
Do you mean parameters for makeTranscriptDbFromGFF()
? It only needs the file name.
Yes because when I checked the ?makeTranscriptDbFromGFF it gives a lot of option. That's why I asked! However when try with the file name only I end up with errors for both GFF3 and GTF format.
> makeTranscriptDbFromGFF("Schizosaccharomyces_pombe.ASM294v2.22.gff3")
extracting transcript information
Error in .prepareGFF3TXS(data, useGenesAsTranscripts) :
No Transcript information found in gff file
> makeTranscriptDbFromGFF("Schizosaccharomyces_pombe.ASM294v2.21.gtf")
Error in .parse_attrCol(attrCol, file, colnames) :
Some attributes do not conform to 'tag=value' format
txdb <- makeTranscriptDbFromGFF("Schizosaccharomyces_pombe.ASM294v2.22.gtf", format="gtf")
works. I'd have to look into why it doesn't like the gff3 file.