Question: From Sequence To Gene Id
3
gravatar for Eric Normandeau
7.9 years ago by
Quebec, Canada
Eric Normandeau10k wrote:

Hi,

I have around 10000 fish EST sequences in a fasta file and want to have an Entrez gene ID for as many as possible of these sequences. The reason I want Entrez gene IDs is to facilitate gene ontology searches and analyses.

The traditional approach I used to do for these is to blast on the swissprot and nr databases, retrieve the identifiers and convert them into Entrez gene ID. However, using different tools (David, UniProt conversion...), I typically retrieve only a small percentage of these.

How could I go efficiently from the EST sequences to Entrez gene IDs?

My goal is to be able to automatize the process and get the maximum number of gene IDs possible for my gene ontology analyses. If, alternatively, you know of an approach to get another, just as useful, gene identifier that would integrate well with gene ontology tools, I am also interested.

Thanks!

VGhpcyBpcyBzdWNoIGEgbm9vYiBxdWVzdGlvbiA6KQo=

ADD COMMENTlink modified 7.9 years ago by Larry_Parnell16k • written 7.9 years ago by Eric Normandeau10k
1

Running BLAST on the NCBI website does return a hyperlink to the EntrezGene entry. Perhaps you can also see this when you run BLAST locally? That hyperlink could be parsed to pull out the ID you want.

ADD REPLYlink written 7.9 years ago by Larry_Parnell16k

Hi @Larry. I'll dig into my blastx documentation to see if I can get this link in my output. This may be the quickest way of doing it. Thanks!

ADD REPLYlink written 7.9 years ago by Eric Normandeau10k

@Larry. If you care to add your comment as an answer, I will credit you the answer. I just used another output format for blastx and, of course, there is all the info I need. Thank you!

ADD REPLYlink written 7.9 years ago by Eric Normandeau10k

Hello Eric,

Could you please help me? I am doing similar works to yours: I have more than 10000 pep seqs, local blastp to nr database, and plan to get the Entrez gene ID. However, using the option: -outfmt '6 qseqid qgi sseqid pident evalue', I did not see the Entrez gene ID from output. The output is like: Bv1_000310_ofuz XP_010669526.1 100.000 106 0 0 1 106 1 106 3.10e-69 215 Bv1_000320_hyix KMT20011.1 100.000 346 0 0 1 346 1 346 0.0 718

My command line is: blastp -query beta_extracted_fpkm_peps.txt -db /home/clingyun/CHEN/database/peps_13species_06152017.txt -num_threads 4 -num_descriptions 1 -num_alignments 1 -outfmt '6 qseqid qgi sseqid pident evalue' -out /home/clingyun/CHEN/beta_rnaseq/blastp_12species/beta_extracted_fpkm_peps_blastp_13species

I can see the hyperlink which included the 'gene id' from the website information of "XP_010669526.1". But, I did not see it from the local blastp output.

The "XP_010669526.1" is a NCBI Reference Sequence ID. I have tried to convert more than 100 of the IDs to Entrez gene id using DAVID, but none got gene ID. Could you please figure out how I can got the Entrez gene ID locally?? Thanks

Best wishes.

Chen Lingyun

ADD REPLYlink written 2.2 years ago by clingyun0
1
gravatar for Larry_Parnell
7.9 years ago by
Larry_Parnell16k
Boston, MA USA
Larry_Parnell16k wrote:

OK, at Eric's suggestion, here goes:

Running BLAST on the NCBI website does return a hyperlink to the EntrezGene entry. Perhaps you can also see this when you run BLAST locally? That hyperlink could be parsed to pull out the ID you want. An example BLASTX output for a 180-bp query (where I altered two nucleotides) is below. Note the "GENE ID" field. This has the following hyperlink: http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=80303&RID=8VAWAEY2015&log$=geneexplicitprot&blast_rank=1 and from this one could parse the EntrezGene ID at the "term=" part.

ref|NP_079478.1| EF-hand domain-containing protein D1 isoform 1 [Homo sapiens][?] Length=239[?]

GENE ID: 80303 EFHD1 | EF-hand domain family, member D1 [Homo sapiens][?] (10 or fewer PubMed links)[?]

Score = 52.4 bits (124), Expect = 6e-10[?] Identities = 58/59 (98%), Positives = 58/59 (98%), Gaps = 0/59 (0%)[?] Frame = +2[?]

Query  2    IKDLESMFKLYDVGRDGFIDlmelklmmeklGAPQTHLGLKSMIKEVDEDFDGKLSFRE  178
            IKDLESMFKLYD GRDGFIDLMELKLMMEKLGAPQTHLGLKSMIKEVDEDFDGKLSFRE
Sbjct  92   IKDLESMFKLYDAGRDGFIDLMELKLMMEKLGAPQTHLGLKSMIKEVDEDFDGKLSFRE  150
ADD COMMENTlink written 7.9 years ago by Larry_Parnell16k

@Larry Thanks. I used blastx locally including the following option: -outfmt '6 qseqid qgi sseqid pident evalue'

ADD REPLYlink written 7.9 years ago by Eric Normandeau10k

@Larry Thanks. I used blastx locally including the following option: -outfmt '6 qseqid qgi sseqid pident evalue' to get the information that I needed.

ADD REPLYlink written 7.9 years ago by Eric Normandeau10k
0
gravatar for Casey Bergman
7.9 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

This sounds like a job for blast2go: http://www.blast2go.org/

ADD COMMENTlink written 7.9 years ago by Casey Bergman18k

Hi @Casey. I am already using blats2go for gene ontology analyses. However, I am trying a new R package, called WGCNA http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA/, and I need gene IDs. They suggest Entrez gene IDs and I don't know how I can get these from blast2go. Is it possible to do it? This would integrate pretty well with my current pipeline if it dit :) Cheers

ADD REPLYlink written 7.9 years ago by Eric Normandeau10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1778 users visited in the last hour