Is there a workaround to using BiomaRT to extract gene names for transcripts with the isoform identifier in the name, for example NM_000014.6? I can run it with out the decimal and that works, but, ideally I would prefer not to edit my identifiers for consistency. Alternatively, if someone can recommend a good human transcriptome in fasta format with gene names I would really appreciate it. Cheers
You can use NCBI Datasets for this. Specifically, you can use the command line tool for this as shown below:
$ cat accs.txt
XM_006719056.3
NM_001347425.2
NM_000014.6
NM_001347424.2
NM_001347423.2
$ datasets download gene accession --inputfile accs.txt --exclude-gene --exclude-protein
Downloading: ncbi_dataset.zip 12.3kB done
$ unzip ncbi_dataset.zip
Archive: ncbi_dataset.zip
inflating: README.md
inflating: ncbi_dataset/data/rna.fna
inflating: ncbi_dataset/data/data_report.jsonl
inflating: ncbi_dataset/data/data_table.tsv
inflating: ncbi_dataset/data/dataset_catalog.json
$ grep -A1 '>' ncbi_dataset/data/rna.fna
>NM_001347423.2 A2M [organism=Homo sapiens] [GeneID=2] [transcript=2]
ATACAAGAGATGTGAGAAGCACCATAAAAGGCGTTGTGAGGAGTTGTGGGGGAGTGAGGGAGAGAAGAGG
--
>XM_006719056.3 A2M [organism=Homo sapiens] [GeneID=2] [transcript=X1]
ATAAAGCCCAGTTGCTTTGGGAAGTGTTTGGGACCAGATGGATTGTAGGGAGTAGGGTACAATACAGTCT
--
>NM_001347425.2 A2M [organism=Homo sapiens] [GeneID=2] [transcript=4]
GGGACCAGATGGATTGTAGGGAGTAGGGTACAATACAGTCTGTTCTCCTCCAGCTCCTTCTTTCTGCAAC
--
>NM_000014.6 A2M [organism=Homo sapiens] [GeneID=2] [transcript=1]
GGGACCAGATGGATTGTAGGGAGTAGGGTACAATACAGTCTGTTCTCCTCCAGCTCCTTCTTTCTGCAAC
--
>NM_001347424.2 A2M [organism=Homo sapiens] [GeneID=2] [transcript=3]
GGGACCAGATGGATTGTAGGGAGTAGGGTACAATACAGTCTGTTCTCCTCCAGCTCCTTCTTTCTGCAAC
The data_table.tsv
and the data_report.jsonl
files include additional useful information that can be parsed, if needed.
So I was able to use this for individual ID's but not a .txt file with a list of ID's.
Here is my error 'Error: Internal Server Error (939B3A4AC5BE35B500003E0B2C4C4B0C.1) No valid gene identifiers - Exiting'.
And this is my format in the .txt input file
NM_001322239.1
NM_001322240.2
NM_001322242.2
Any ideas how to resolve this?
Cheers
Just tested the original solution. Works for me. Of the three example accessions you provided above one generates an error other two work.
Some of the accessions (NM_001322239.1) you provided are not currently in NCBI Gene or do not have an associated NCBI GeneID
I was able to get it to run using an input file but with only two id's. For some reason it runs into an error when I run all the id's
What OS are you running this on? Do you always get a "Server error"? Wonder if the line endings are a problem (PC/Unix difference).
I am using a Mac, I was able to get it to run using an input file but with only two id's. For some reason it runs into an error when I run all the id's
I just tried the accessions @vkkodali posted in the original answer on a mac and did not have any issues. You may have some non-printable characters in your data. You should only see $
at end of each line if you try the following.
$ cat -vet acc.txt
XM_006719056.3$
NM_001347425.2$
NM_000014.6$
NM_001347424.2$
NM_001347423.2$
I tried this all the lines look good, I don't know if its a length issue I am running 161323 lines essentially the whole transcriptome. I am going cut some lines and see if that corrects it.
That may likely be the issue. Let us see if @vkkodali has any recommendations on that end.
Got it to work for 1000 IDs, I am guessing its something to do with runtime
Got it to work for 1000 IDs, I am guessing its something to do with runtime
I am glad you were able to get it to work with shorter lists. That may be the workaround for now. I have been able to reproduce the Internal Server Error you are seeing by using an input list that's ~160k accessions. I will post here if I learn anything new.
I am running 161323 lines essentially the whole transcriptome.
If it is the entire transcriptome that you are interested in, why not just download the entire thing? You can use datasets download genome
command to download the entire transcriptome of a taxon in FASTA format. Performance wise, this will be the better option if you have hundreds of thousands of accessions: download transcriptome in FASTA format and use something like seqkit
to extract only the data you are interested in.
Not that I am aware, but you should be able to match the biomaRt output back to the original data via simple functions.
With regard to a FASTA transcriptome, I may recommend those provided by GENCODE https://www.gencodegenes.org/