Output File Format "6" - Blastplus - Gene Name ?
3
2
Entering edit mode
12.1 years ago

I am working on a second design for a sequence capture array (similar to a micro-array, but with genomic DNA) and I want to blast a few sequences that seem problematic in my design. The sequences are various ESTs or cDNA sequences that will help me design the probes. I am using blastplus to find the names of the genes corresponding to these sequences.

When I use blastplus with the nr database, with the format "6", I have these info, separated by tabs :

Query name, Gene accession number, % of identity, Length, Mismatch, etc...


I would like to have the same format, but with the names of the genes. It's more useful than the accession number.

Is it possible to modify the format #6 to have this information ?

Thank you very much... sometimes, it's hard to be a beginner in the world of bioinfo !

blast format • 5.0k views
0
Entering edit mode

is it a really a NCBI gene ID or is it the accession number of the EST/cDNA ?

0
Entering edit mode

Yeah, I was looking to get the NCBI gene ID, but the only info I had was the accession number. But I have a homemade python script that can help me with that, it parses the info I need and write it in a new file... I didn't write the script though and I wanted to find another solution, which Marina gave me further down. Thanks !

4
Entering edit mode
12.1 years ago
User 59 13k

Blast will only return the information it knows about from the query and the database you're querying against. The output formats are pretty immutable as far as I know.

Not every entry in nr is going to have a gene NAME, nor should it, so why would you expect BLAST to return it? It is sensible to return the accession your query matches against, so that you can look it up and do further analysis on it. It's the annotations attached to the accessions that may give you a clue to the identity of the gene.

2
Entering edit mode
12.1 years ago
Marina Manrique ★ 1.3k

I think that with XML output file (-outfmt 5 option I think) in the tag called Hit_def you get the whole FASTA header of the sequence your query had a hit.

I'm not completely sure but you could try. I blasted a sequence against uniprot having a hit with this protein A8MSE8 whose FASTA header is

>tr|A8MSE8|A8MSE8_ARATH Elongation factor 1-alpha OS=Arabidopsis thaliana GN=At5g60390 PE=3 SV=1


and I got this in the XML hit_def tag

tr|A8MSE8|A8MSE8_ARATH Elongation factor 1-alpha OS=Arabidopsis thaliana GN=At5g60390 PE=3 SV=1


So if the gene name is in the FASTA header you could get it

HTH,
Marina

0
Entering edit mode

Thanks ! I think I can work with that !

1
Entering edit mode
12.1 years ago
Darked89 4.2k

As far as I know blast does not know what part of your query fasta file header is GeneID, clone id or whatever. It simply truncates the first field. So you have choice of reformatting your query Fasta (not recommended unless you do not get unique IDs in your blast output) or try to combine information which you already have in your fasta headers with tabular blast output with a bit of scripting.

Problem is, that fasta headers are far from consistent, may not even contain gene id (i.e. ESTs) so sticking with accessions is way safer.

1
Entering edit mode

Are you sure Blast truncates the fasta header no matter what output format you choose? I think in XML format it includes the whole FASTA header but I'm not sure

1
Entering edit mode

I just have checked with blastn from blast++ 2.2.24. XML output (-outfmt 5) indeed will keep the untruncated fasta header (at least 200 characters with spaces in it). But this does not work in any tabular formats (name gets truncated on the first white space).

0
Entering edit mode

it's good to know :-)