Use BioMart to get gene names for transcripts with isoform ID (e.g. NM_000014.6)
1
0
Entering edit mode
9 months ago
dk0319 ▴ 30

Is there a workaround to using BiomaRT to extract gene names for transcripts with the isoform identifier in the name, for example NM_000014.6? I can run it with out the decimal and that works, but, ideally I would prefer not to edit my identifiers for consistency. Alternatively, if someone can recommend a good human transcriptome in fasta format with gene names I would really appreciate it. Cheers

rna-seq • 470 views
ADD COMMENT
1
Entering edit mode

Not that I am aware, but you should be able to match the biomaRt output back to the original data via simple functions.

With regard to a FASTA transcriptome, I may recommend those provided by GENCODE https://www.gencodegenes.org/

ADD REPLY
2
Entering edit mode
9 months ago
vkkodali ★ 2.7k

You can use NCBI Datasets for this. Specifically, you can use the command line tool for this as shown below:

$ cat accs.txt
XM_006719056.3
NM_001347425.2
NM_000014.6
NM_001347424.2
NM_001347423.2

$ datasets download gene accession --inputfile accs.txt --exclude-gene --exclude-protein
Downloading: ncbi_dataset.zip    12.3kB done

$ unzip ncbi_dataset.zip 
Archive:  ncbi_dataset.zip
inflating: README.md               
inflating: ncbi_dataset/data/rna.fna  
inflating: ncbi_dataset/data/data_report.jsonl  
inflating: ncbi_dataset/data/data_table.tsv  
inflating: ncbi_dataset/data/dataset_catalog.json  

$ grep -A1 '>' ncbi_dataset/data/rna.fna
>NM_001347423.2 A2M [organism=Homo sapiens] [GeneID=2] [transcript=2]
ATACAAGAGATGTGAGAAGCACCATAAAAGGCGTTGTGAGGAGTTGTGGGGGAGTGAGGGAGAGAAGAGG
--
>XM_006719056.3 A2M [organism=Homo sapiens] [GeneID=2] [transcript=X1]
ATAAAGCCCAGTTGCTTTGGGAAGTGTTTGGGACCAGATGGATTGTAGGGAGTAGGGTACAATACAGTCT
--
>NM_001347425.2 A2M [organism=Homo sapiens] [GeneID=2] [transcript=4]
GGGACCAGATGGATTGTAGGGAGTAGGGTACAATACAGTCTGTTCTCCTCCAGCTCCTTCTTTCTGCAAC
--
>NM_000014.6 A2M [organism=Homo sapiens] [GeneID=2] [transcript=1]
GGGACCAGATGGATTGTAGGGAGTAGGGTACAATACAGTCTGTTCTCCTCCAGCTCCTTCTTTCTGCAAC
--
>NM_001347424.2 A2M [organism=Homo sapiens] [GeneID=2] [transcript=3]
GGGACCAGATGGATTGTAGGGAGTAGGGTACAATACAGTCTGTTCTCCTCCAGCTCCTTCTTTCTGCAAC

The data_table.tsv and the data_report.jsonl files include additional useful information that can be parsed, if needed.

ADD COMMENT
0
Entering edit mode

So I was able to use this for individual ID's but not a .txt file with a list of ID's.

Here is my error 'Error: Internal Server Error (939B3A4AC5BE35B500003E0B2C4C4B0C.1) No valid gene identifiers - Exiting'.

And this is my format in the .txt input file
NM_001322239.1
NM_001322240.2
NM_001322242.2

Any ideas how to resolve this?

Cheers

ADD REPLY
0
Entering edit mode

Just tested the original solution. Works for me. Of the three example accessions you provided above one generates an error other two work.

Some of the accessions (NM_001322239.1) you provided are not currently in NCBI Gene or do not have an associated NCBI GeneID

ADD REPLY
0
Entering edit mode

I was able to get it to run using an input file but with only two id's. For some reason it runs into an error when I run all the id's

ADD REPLY
0
Entering edit mode

What OS are you running this on? Do you always get a "Server error"? Wonder if the line endings are a problem (PC/Unix difference).

ADD REPLY
0
Entering edit mode

I am using a Mac, I was able to get it to run using an input file but with only two id's. For some reason it runs into an error when I run all the id's

ADD REPLY
0
Entering edit mode

I just tried the accessions @vkkodali posted in the original answer on a mac and did not have any issues. You may have some non-printable characters in your data. You should only see $ at end of each line if you try the following.

$ cat -vet acc.txt
XM_006719056.3$
NM_001347425.2$
NM_000014.6$
NM_001347424.2$
NM_001347423.2$
ADD REPLY
0
Entering edit mode

I tried this all the lines look good, I don't know if its a length issue I am running 161323 lines essentially the whole transcriptome. I am going cut some lines and see if that corrects it.

ADD REPLY
0
Entering edit mode

That may likely be the issue. Let us see if @vkkodali has any recommendations on that end.

ADD REPLY
0
Entering edit mode

Got it to work for 1000 IDs, I am guessing its something to do with runtime

ADD REPLY
0
Entering edit mode

Got it to work for 1000 IDs, I am guessing its something to do with runtime

I am glad you were able to get it to work with shorter lists. That may be the workaround for now. I have been able to reproduce the Internal Server Error you are seeing by using an input list that's ~160k accessions. I will post here if I learn anything new.

I am running 161323 lines essentially the whole transcriptome.

If it is the entire transcriptome that you are interested in, why not just download the entire thing? You can use datasets download genome command to download the entire transcriptome of a taxon in FASTA format. Performance wise, this will be the better option if you have hundreds of thousands of accessions: download transcriptome in FASTA format and use something like seqkit to extract only the data you are interested in.

ADD REPLY
0
Entering edit mode

That maybe an option, I appreciate your help

ADD REPLY

Login before adding your answer.

Traffic: 2238 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6