I am looking to build a mapping file between the gene names of C. elegans and its transcript names. To do so, I use the Bioconductor packages biomaRt, that I freshly reinstalled. I have also freshed downloaded the latest transcriptome of C. elegans from Ensembl here: ftp://ftp.ensembl.org/pub/release-86/fasta/caenorhabditis_elegans/cdna/
Here is the code:
Download C. elegans cDNA file from www.ensembl.org
download.file(paste0('ftp://ftp.ensembl.org/pub/release-', ensemblRelease, '/fasta/caenorhabditis_elegans/cdna/Caenorhabditis_elegans.WBcel235.cdna.all.fa.gz'), 'output/transcriptome/sequence/celegans.fa.gz') system('gunzip output/transcriptome/sequence/celegans.fa.gz')
Create a mapping file containing gene names in the first
column and the associated transcript name in the second
column. There should be only one name in each cell. Gene
names can occur more than once and be associated with more
than one associated transcript name but only one transcript
name per line.
martWorm <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL", dataset = "celegans_gene_ensembl", host = 'ensembl.org') g2t <- biomaRt::getBM(attributes = c('ensembl_gene_id', 'ensembl_transcript_id'), mart = martWorm) write.table(g2t, 'output/counts/rsem/ref/geneToTxMapping.txt', quote = FALSE, row.names = FALSE)
However, there is a problem. In my FASTA transcriptome (cDNA) file, I have the following transcript ID: F52H2.2. It is not found in my mapping table, although F52H2.2a and F52H2.2b are found. Vice-versa, F52H2.2a is not found in the FASTA file. This causes problems in my downstream analysis. Does anybody know what causes this? Is there a way maybe to download my transcriptome from within R using the biomaRt package that would make it compatible with its database?