Question

Use BioMart to get gene names for transcripts with isoform ID (e.g. NM_000014.6)

0

Entering edit mode

3.4 years ago

dk0319 ▴ 70

Is there a workaround to using BiomaRT to extract gene names for transcripts with the isoform identifier in the name, for example NM_000014.6? I can run it with out the decimal and that works, but, ideally I would prefer not to edit my identifiers for consistency. Alternatively, if someone can recommend a good human transcriptome in fasta format with gene names I would really appreciate it. Cheers

rna-seq • 1.4k views

ADD COMMENT • link updated 3.4 years ago by vkkodali_ncbi ★ 3.7k • written 3.4 years ago by dk0319 ▴ 70

1

Entering edit mode

Not that I am aware, but you should be able to match the biomaRt output back to the original data via simple functions.

With regard to a FASTA transcriptome, I may recommend those provided by GENCODE https://www.gencodegenes.org/

ADD REPLY • link 3.4 years ago by Kevin Blighe 87k

score 2 · Accepted Answer · 2020-11-29

2

Entering edit mode

3.4 years ago

vkkodali_ncbi ★ 3.7k

You can use NCBI Datasets for this. Specifically, you can use the command line tool for this as shown below:

$ cat accs.txt
XM_006719056.3
NM_001347425.2
NM_000014.6
NM_001347424.2
NM_001347423.2

$ datasets download gene accession --inputfile accs.txt --exclude-gene --exclude-protein
Downloading: ncbi_dataset.zip    12.3kB done

$ unzip ncbi_dataset.zip 
Archive:  ncbi_dataset.zip
inflating: README.md               
inflating: ncbi_dataset/data/rna.fna  
inflating: ncbi_dataset/data/data_report.jsonl  
inflating: ncbi_dataset/data/data_table.tsv  
inflating: ncbi_dataset/data/dataset_catalog.json  

$ grep -A1 '>' ncbi_dataset/data/rna.fna
>NM_001347423.2 A2M [organism=Homo sapiens] [GeneID=2] [transcript=2]
ATACAAGAGATGTGAGAAGCACCATAAAAGGCGTTGTGAGGAGTTGTGGGGGAGTGAGGGAGAGAAGAGG
--
>XM_006719056.3 A2M [organism=Homo sapiens] [GeneID=2] [transcript=X1]
ATAAAGCCCAGTTGCTTTGGGAAGTGTTTGGGACCAGATGGATTGTAGGGAGTAGGGTACAATACAGTCT
--
>NM_001347425.2 A2M [organism=Homo sapiens] [GeneID=2] [transcript=4]
GGGACCAGATGGATTGTAGGGAGTAGGGTACAATACAGTCTGTTCTCCTCCAGCTCCTTCTTTCTGCAAC
--
>NM_000014.6 A2M [organism=Homo sapiens] [GeneID=2] [transcript=1]
GGGACCAGATGGATTGTAGGGAGTAGGGTACAATACAGTCTGTTCTCCTCCAGCTCCTTCTTTCTGCAAC
--
>NM_001347424.2 A2M [organism=Homo sapiens] [GeneID=2] [transcript=3]
GGGACCAGATGGATTGTAGGGAGTAGGGTACAATACAGTCTGTTCTCCTCCAGCTCCTTCTTTCTGCAAC

The data_table.tsv and the data_report.jsonl files include additional useful information that can be parsed, if needed.

ADD COMMENT • link 3.4 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

So I was able to use this for individual ID's but not a .txt file with a list of ID's.

Here is my error 'Error: Internal Server Error (939B3A4AC5BE35B500003E0B2C4C4B0C.1) No valid gene identifiers - Exiting'.

And this is my format in the .txt input file
NM_001322239.1
NM_001322240.2
NM_001322242.2

Any ideas how to resolve this?

Cheers

ADD REPLY • link 3.4 years ago by dk0319 ▴ 70

0

Entering edit mode

Just tested the original solution. Works for me. Of the three example accessions you provided above one generates an error other two work.

Some of the accessions (NM_001322239.1) you provided are not currently in NCBI Gene or do not have an associated NCBI GeneID

ADD REPLY • link 3.4 years ago by GenoMax 141k

0

Entering edit mode

I was able to get it to run using an input file but with only two id's. For some reason it runs into an error when I run all the id's

ADD REPLY • link 3.4 years ago by dk0319 ▴ 70

0

Entering edit mode

What OS are you running this on? Do you always get a "Server error"? Wonder if the line endings are a problem (PC/Unix difference).

ADD REPLY • link 3.4 years ago by GenoMax 141k

0

Entering edit mode

I am using a Mac, I was able to get it to run using an input file but with only two id's. For some reason it runs into an error when I run all the id's

ADD REPLY • link 3.4 years ago by dk0319 ▴ 70

0

Entering edit mode

I just tried the accessions @vkkodali posted in the original answer on a mac and did not have any issues. You may have some non-printable characters in your data. You should only see $ at end of each line if you try the following.

$ cat -vet acc.txt
XM_006719056.3$
NM_001347425.2$
NM_000014.6$
NM_001347424.2$
NM_001347423.2$

ADD REPLY • link 3.4 years ago by GenoMax 141k

0

Entering edit mode

I tried this all the lines look good, I don't know if its a length issue I am running 161323 lines essentially the whole transcriptome. I am going cut some lines and see if that corrects it.

ADD REPLY • link 3.4 years ago by dk0319 ▴ 70

0

Entering edit mode

That may likely be the issue. Let us see if @vkkodali has any recommendations on that end.

ADD REPLY • link 3.4 years ago by GenoMax 141k

0

Entering edit mode

Got it to work for 1000 IDs, I am guessing its something to do with runtime

ADD REPLY • link 3.4 years ago by dk0319 ▴ 70

0

Entering edit mode

Got it to work for 1000 IDs, I am guessing its something to do with runtime

I am glad you were able to get it to work with shorter lists. That may be the workaround for now. I have been able to reproduce the Internal Server Error you are seeing by using an input list that's ~160k accessions. I will post here if I learn anything new.

I am running 161323 lines essentially the whole transcriptome.

If it is the entire transcriptome that you are interested in, why not just download the entire thing? You can use datasets download genome command to download the entire transcriptome of a taxon in FASTA format. Performance wise, this will be the better option if you have hundreds of thousands of accessions: download transcriptome in FASTA format and use something like seqkit to extract only the data you are interested in.