Question: Use BioMart to get gene names for transcripts with isoform ID (e.g. NM_000014.6)
0
gravatar for dk0319
8 weeks ago by
dk031920
dk031920 wrote:

Is there a workaround to using BiomaRT to extract gene names for transcripts with the isoform identifier in the name, for example NM_000014.6? I can run it with out the decimal and that works, but, ideally I would prefer not to edit my identifiers for consistency. Alternatively, if someone can recommend a good human transcriptome in fasta format with gene names I would really appreciate it. Cheers

rna-seq • 299 views
ADD COMMENTlink modified 8 weeks ago by vkkodali2.4k • written 8 weeks ago by dk031920
1

Not that I am aware, but you should be able to match the biomaRt output back to the original data via simple functions.

With regard to a FASTA transcriptome, I may recommend those provided by GENCODE https://www.gencodegenes.org/

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Kevin Blighe69k
2
gravatar for vkkodali
8 weeks ago by
vkkodali2.4k
United States
vkkodali2.4k wrote:

You can use NCBI Datasets for this. Specifically, you can use the command line tool for this as shown below:

$ cat accs.txt
XM_006719056.3
NM_001347425.2
NM_000014.6
NM_001347424.2
NM_001347423.2

$ datasets download gene accession --inputfile accs.txt --exclude-gene --exclude-protein
Downloading: ncbi_dataset.zip    12.3kB done

$ unzip ncbi_dataset.zip 
Archive:  ncbi_dataset.zip
inflating: README.md               
inflating: ncbi_dataset/data/rna.fna  
inflating: ncbi_dataset/data/data_report.jsonl  
inflating: ncbi_dataset/data/data_table.tsv  
inflating: ncbi_dataset/data/dataset_catalog.json  

$ grep -A1 '>' ncbi_dataset/data/rna.fna
>NM_001347423.2 A2M [organism=Homo sapiens] [GeneID=2] [transcript=2]
ATACAAGAGATGTGAGAAGCACCATAAAAGGCGTTGTGAGGAGTTGTGGGGGAGTGAGGGAGAGAAGAGG
--
>XM_006719056.3 A2M [organism=Homo sapiens] [GeneID=2] [transcript=X1]
ATAAAGCCCAGTTGCTTTGGGAAGTGTTTGGGACCAGATGGATTGTAGGGAGTAGGGTACAATACAGTCT
--
>NM_001347425.2 A2M [organism=Homo sapiens] [GeneID=2] [transcript=4]
GGGACCAGATGGATTGTAGGGAGTAGGGTACAATACAGTCTGTTCTCCTCCAGCTCCTTCTTTCTGCAAC
--
>NM_000014.6 A2M [organism=Homo sapiens] [GeneID=2] [transcript=1]
GGGACCAGATGGATTGTAGGGAGTAGGGTACAATACAGTCTGTTCTCCTCCAGCTCCTTCTTTCTGCAAC
--
>NM_001347424.2 A2M [organism=Homo sapiens] [GeneID=2] [transcript=3]
GGGACCAGATGGATTGTAGGGAGTAGGGTACAATACAGTCTGTTCTCCTCCAGCTCCTTCTTTCTGCAAC

The data_table.tsv and the data_report.jsonl files include additional useful information that can be parsed, if needed.

ADD COMMENTlink written 8 weeks ago by vkkodali2.4k

So I was able to use this for individual ID's but not a .txt file with a list of ID's.

Here is my error 'Error: Internal Server Error (939B3A4AC5BE35B500003E0B2C4C4B0C.1) No valid gene identifiers - Exiting'.

And this is my format in the .txt input file
NM_001322239.1
NM_001322240.2
NM_001322242.2

Any ideas how to resolve this?

Cheers

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by dk031920

Just tested the original solution. Works for me. Of the three example accessions you provided above one generates an error other two work.

Some of the accessions (NM_001322239.1) you provided are not currently in NCBI Gene or do not have an associated NCBI GeneID

ADD REPLYlink written 8 weeks ago by GenoMax95k

I was able to get it to run using an input file but with only two id's. For some reason it runs into an error when I run all the id's

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by dk031920

What OS are you running this on? Do you always get a "Server error"? Wonder if the line endings are a problem (PC/Unix difference).

ADD REPLYlink written 8 weeks ago by GenoMax95k

I am using a Mac, I was able to get it to run using an input file but with only two id's. For some reason it runs into an error when I run all the id's

ADD REPLYlink written 8 weeks ago by dk031920

I just tried the accessions @vkkodali posted in the original answer on a mac and did not have any issues. You may have some non-printable characters in your data. You should only see $ at end of each line if you try the following.

$ cat -vet acc.txt
XM_006719056.3$
NM_001347425.2$
NM_000014.6$
NM_001347424.2$
NM_001347423.2$
ADD REPLYlink written 8 weeks ago by GenoMax95k

I tried this all the lines look good, I don't know if its a length issue I am running 161323 lines essentially the whole transcriptome. I am going cut some lines and see if that corrects it.

ADD REPLYlink written 8 weeks ago by dk031920

That may likely be the issue. Let us see if @vkkodali has any recommendations on that end.

ADD REPLYlink written 8 weeks ago by GenoMax95k

Got it to work for 1000 IDs, I am guessing its something to do with runtime

ADD REPLYlink written 8 weeks ago by dk031920

Got it to work for 1000 IDs, I am guessing its something to do with runtime

I am glad you were able to get it to work with shorter lists. That may be the workaround for now. I have been able to reproduce the Internal Server Error you are seeing by using an input list that's ~160k accessions. I will post here if I learn anything new.

I am running 161323 lines essentially the whole transcriptome.

If it is the entire transcriptome that you are interested in, why not just download the entire thing? You can use datasets download genome command to download the entire transcriptome of a taxon in FASTA format. Performance wise, this will be the better option if you have hundreds of thousands of accessions: download transcriptome in FASTA format and use something like seqkit to extract only the data you are interested in.

ADD REPLYlink written 8 weeks ago by vkkodali2.4k

That maybe an option, I appreciate your help

ADD REPLYlink written 8 weeks ago by dk031920
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1999 users visited in the last hour
_