BLAST IDs into gene names? (Diamond)
1
0
Entering edit mode
7.5 years ago
BioBing ▴ 150

Hi all,

I have a tabular m.8 file after running a Diamond annotation on a "reference" transcriptome for a non-model species (assembled with Trinity).

How to I get gene-names from the genebank ID's?

Example from the datasheet:* gi|736186330|ref|XP_010770183.1|*

The goal is to lift an analysis of differential gene expression from transcript level to gene level - and to do that, I really would like use the gene names over the ID's given in the example.

Thank you!

The file consists of following columns (maybe that will help in imagining how the data sheet looks like):

# qseqid means Query Seq-id
# sseqid means Subject Seq-id 
# pident means Percentage of identical matches
# length means Alignment length
# mismatch means Number of mismatches
# gapopen means Number of gap openings
# qstart means Start of alignment in query
# qend means End of alignment in query
# sstart means Start of alignment in subject
# send means End of alignment in subject
# evalue means Expect value
# bitscore means Bit score
R RNA-Seq Assembly gene blast • 5.1k views
ADD COMMENT
1
Entering edit mode

You want to get the 'gene names' from your transcriptome that match the 'reference' from ensembl? Or do I have this backwards? You can parse your Trinity fasta to find the longest isoform per transcript cluster, with that being the representative of that cluster, and re-run BLAST, if you don't want to parse the BLAST result.

Please see this for how Trinity defines Genes vs. Transcripts in the output fasta.

ADD REPLY
0
Entering edit mode

I am working with a non-model species with no available genome or transcriptome. I did a very deep sequencing and did a de novo assembly with Trinity. None of the available Ensembl references are close to the species I am working on.

I ran a Diamond annotation on the assembled Trinity transcripts and do now have a list of NCBI ID's corresponding to each of the transcripts (of those that the software was able to identify with the desired e-value cutoff's etc.).

What I am looking for is some sort of tool that can do a conversion from the NCBI ID's into the corresponding gene names.

for instance convert gi|736186330|ref|XP_010770183.1| into PREDICTED: opsin-5-like for each of the transcripts (it is a long list, so there must be an easier way than doing each of them manually?)

ADD REPLY
1
Entering edit mode

Use NCBI eUtils. Something like: esearch -db protein -query "XP_010770183" | efetch -db protein -format docsum -id XP_010770183 | grep Title produces <Title>PREDICTED: opsin-5-like [Notothenia coriiceps]</Title>

Edit: If you have access to blast+ software and nr blast database then it would be easier to do blastdbcmd -db /path_to/nr -entry XP_010770183 -outfmt %t. This will produce PREDICTED: opsin-5-like [Notothenia coriiceps]

ADD REPLY
0
Entering edit mode

That sounds interesting! I will definitely check it out - do you know if it is able to run "bulk IDs" as well? or is it only one at a time?

ADD REPLY
1
Entering edit mode

You could also run the Trinotate pipeline.

ADD REPLY
0
Entering edit mode

I would love to use Trinotate! :-) But unfortunately, I am only a guest on the server that I am using for data analyses and have no permissions to install Trinotate and its dependencies (my laptop is not able to run the analyses on its own without exploding ;-) or at least it sounds like that when I try). The reason for using Diamond over Trinotate is that I the past months have tried to get Trinotate installed on the server in cooperation with one of the Bioinformaticians that runs the server - but things are going very (very) slow, and time does not allow me to be patient much longer. I am aware of better tools etc. but bottom line is that I have to use the tools that are available and that my laptop (unless it is available on the server) allows me to run in order to get the job done in time.

ADD REPLY
2
Entering edit mode
7.5 years ago
Sej Modha 5.3k

I am not sure if you are using the latest version of DIAMOND as there is following options to get the Subject title in the tabular DIAMOND output. DIAMOND version - diamond v0.8.20.82

stitle means Subject Title
ADD COMMENT
0
Entering edit mode

Thank you! That option is also available for "my" version! This is just what I needed!

ADD REPLY
0
Entering edit mode

You can accept my answer as a correct answer then!

ADD REPLY
0
Entering edit mode

I think I already did that?

ADD REPLY

Login before adding your answer.

Traffic: 2496 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6