Question: Recover fasta sequences in database from the blast output 6
1
gravatar for Darill
3 months ago by
Darill30
Darill30 wrote:

Hello everyone, I made a blast research and I got the output in the table format 6.

So I got several hits and I would like to retrieve each fasta sequences from the blast database with the start and end corresponding of the hit.

I saw that that the command blastdbcmd should be able do to the trick but I cannot find where you put the blast output to indicate where to take the sequence in the database (with start and end informations).

Thank you for your help.

output blast fasta • 259 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by Darill30

I don't think you can do exactly what you want to do with blastdbcmd, as I don't think takes blast output specifically. According to the docs, it looks like you can specify coordinates based on the identifiers though:

Extract different sequence ranges from the BLAST databases

The command below will extract two different sequences: bases 40-80 in human chromosome Y (GI 13626247) with the masked regions in lowercase characters (notice argument 30, the masking algorithm ID which is available in this BLAST database) and bases 1-10 in the minus strand of human chromosome 20 (GI 14772189).

$ printf "%s %s %s %s\n%s %s %s\n" 13626247 40-80 plus 30 14772189 1-10 minus \
| blastdbcmd -db GPIPE/9606/current/all_contig -entry_batch -
>gi|13626247|ref|NT_025975.2|:40-80 Homo sapiens chromosome Y genomic contig, GRCh37.p10 Primary Assembly
tgcattccattctattctcttctACTGCATACAatttcact
>gi|14772189|ref|NT_025215.4|:c10-1 Homo sapiens chromosome 20 genomic contig, GRCh37.p10 Primary Assembly
GCTCTAGATC
$

I'm rusty when it comes to blastdbcmd but I think broadly what you'll need to do is: - Convert blast output to just a list of IDs of interest. - Pass this file with coordinates in a manner that matches the IDs of the sequences you created the database from, and print each sequence accordingly.

For instance, in the above, you're replace the printf command with a manipulation of the columnar data from the blastoutput (see cut/awk etc).

ADD REPLYlink modified 3 months ago • written 3 months ago by jrj.healey12k

Hi everyone, I have some news concerning the idea you proposed me. The issue is in term on time because if I write a command line for each hit blast, it is about 310 000 commande lines to execute in order to recover the fasta sequences in the database corresponding to the range start and end.

So I have my file with the 310 000 blastdbcmd -db .... to execute by doing for line in read -r file_script.txt line ; do command "$line"; done and it takes a lot a time, at least 4 hours to execute all the blastdbcmd commande in the file.

So I was wondering if there where not another faster way to recover a fasta sequences in database with the range start and end ?

ADD REPLYlink modified 3 months ago • written 3 months ago by Darill30
1

Please don’t add answers unless you are actually answering the top level post.

As for your speed issue, retrieving that many hits from a blast database is never going to be especially fast, but you should be able to parallelise the process usingGNU parallel instead of a loop.

ADD REPLYlink written 3 months ago by jrj.healey12k
1

You could use a RAM disk or something (if you have plenty of RAM available) and read the database into memory. That would be about the fastest way to do this using blast.

You could also look into using pyfaidx or samtools faidx (e.g. Extract User Defined Region From An Fasta File ) with intervals to get the sequence from original database file. This would likely be the fastest way.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax65k
1
gravatar for genomax
3 months ago by
genomax65k
United States
genomax65k wrote:

Assuming your blast 6 format matches this. You will need to cut column 2, 9, 10 and then use something like blastdbcmd -db database -entry ID -range start-stop -strand (if needed) -outfmt %f for a single ID. ID will be in column 2. Start and stop values will come from columns 9 and 10. You can use cut or awk to get the values you need. You will have to wrap blastbdcmd command in a loop so you can pass in ID, Star and Stop values for each call.

Note: This may result in duplicates (if two sequences hit the same ID and range) unless you take care of removing redundancy before you run blastdbcmd.

If the database is from NCBI you could also use EntrezDirect to get this info as well.

Edit: ID comes from column 2. Changed.

ADD COMMENTlink modified 3 months ago • written 3 months ago by genomax65k

Hi, thanks for your answer. Since the ID of my database are in the column 2 should I not cut the column 2 instead and do :

awk '{print $2 $9 $10}' blast_lh_all.out > blast_lh_all_new.out

With in the blast_lh_all_new.out in the column 1 the ID of the hit blast in my database and une the 2 and 3 columns the start and end.

I guess now this commande:

blastdbcmd -db /path/makedb_genome -entry blast_lh_all_new.out -range start-stop -outfmt test.fst

Allows me to "parse" my database used for blast and take all sequences corresponding to the ID from start to end?

But I had 2 questions: 1- Have I to put the -db in fasta format or in makeblastdb format ? 2- When I run the script above I get the following error:

Error: NCBI C++ Exception:
    T0 "/build/ncbi-blast+-baP3Js/ncbi-blast+-2.2.30/c++/src/corelib/ncbistr.cpp", line 705: Error: ncbi::NStr::StringToInt8() - Cannot convert string 'start' to Int8 (m_Pos = 0)
ADD REPLYlink modified 3 months ago • written 3 months ago by Darill30

Thanks for catching the column 2 error. -entry needs to be just the ID from column 2, not the entire file. -db is the same database you searched against. You can use xargs to pass the output from your awk command to blastdbcmd. See the note on duplicates.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax65k

Ok but if in the -entry there is only the ID, where is the file blast_lh_all_new.out where are available the coordinates "end and start"

ADD REPLYlink written 3 months ago by Darill30

That is why I said that you will need to do some something like (test and let us know)

awk '{OFS="\t"}{print $2,$9,$10}' blast_lh_all.out | xargs -n 3 sh -c 'blastdbcmd -db database -entry "$0" -range "$1"-"$2" -outfmt %f'
ADD REPLYlink modified 3 months ago • written 3 months ago by genomax65k

I get error messages such as:

IDBA_scaffold_10046632436: 1: IDBA_scaffold_10046632436: Syntax error: Unterminated quoted string
IDBA_scaffold_2263239172667: 1: IDBA_scaffold_2263239172667: Syntax error: Unterminated quoted string
IDBA_scaffold_2469820637: 1: IDBA_scaffold_2469820637: Syntax error: Unterminated quoted string
IDBA_scaffold_497971217288: 1: IDBA_scaffold_497971217288: Syntax error: Unterminated quoted string
IDBA_scaffold_91814141500: 1: IDBA_scaffold_91814141500: Syntax error: Unterminated quoted string
IDBA_scaffold_12094127310: 1: IDBA_scaffold_12094127310: Syntax error: Unterminated quoted string
IDBA_scaffold_2186214351956: 1: IDBA_scaffold_2186214351956: Syntax error: Unterminated quoted string
IDBA_scaffold_796657856366: 1: IDBA_scaffold_796657856366: Syntax error: Unterminated quoted string

etc

ADD REPLYlink modified 3 months ago • written 3 months ago by Darill30

I'm not sure to understand what you mean by duplicate, here is an exemple of the blast output :

YP_006347591.1 IDBA_scaffold_19039 30.319 188 119 6 1772 1953 1083 1628 8.97e-24 80.5

YP_006347591.1  IDBA_scaffold_19039 37.500  88  54  1   1652    1738    712 975 8.97e-24    54.7
YP_006347732.1  IDBA_scaffold_19039 30.319  188 119 6   312 493 1083    1628    1.79e-24    80.5
YP_006347732.1  IDBA_scaffold_19039 37.500  88  54  1   192 278 712 975 1.79e-24    55.5
YP_006347732.1  IDBA_scaffold_10061 27.907  129 79  5   319 438 876 505 0.004   44.3
YP_006383549.1  scaffold_816    56.897  58  25  0   86  143 75036   74863   5.78e-16    76.6
YP_006383549.1  scaffold_816    50.877  57  28  0   33  89  76042   75872   1.59e-09    58.2
YP_006383565.1  IDBA_scaffold_120202    42.857  56  28  3   21  75  272 114 0.003   39.3
YP_006383565.1  IDBA_scaffold_106955    38.182  55  32  1   21  75  186 28  0.007   38.5
YP_006383578.1  IDBA_scaffold_2397  39.474  76  34  2   112 187 1192    1383    3.45e-05    50.8
YP_006383578.1  IDBA_scaffold_427   30.435  115 60  5   92  194 69542   69222   0.001   45.4
YP_009238535.1  scaffold_650    36.462  277 158 6   211 474 1779    964 2.47e-36    153
YP_009238535.1  scaffold_650    35.593  59  38  0   16  74  2208    2032    9.70e-07    56.6

So If I well understood I have ton isolate the column 2 to create a ID_file.txt -entrey ID_file.txt

and then for -db Path/makedb_genome (containing the 3 blast extensions)

but I do not understand the -range option

ADD REPLYlink written 3 months ago by Darill30

I thought you only wanted to extract the exact range of nucleotides where the hit occured on the target sequence.

I would like to retrieve each fasta sequences from the blast database with the start and end corresponding of the hit.

If you don't want to then just ID is all you need (which can be provided as a file with one ID per line). You will get the entire sequence for target ID.

Take a look at blastdbcmd -help to get detailed info on the parameters.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax65k

I edited the command above (there was an error in it). When I run

 awk '{OFS="\t"}{print $2,$9,$10}' blast_lh_all.out | xargs -n 3 sh -c 'blastdbcmd -db database -entry "$0" -range "$1"-"$2" -outfmt %f'

I get this.

blastdbcmd -db database -entry IDBA_scaffold_19039 -range 712-975 -outfmt %f
blastdbcmd -db database -entry IDBA_scaffold_19039 -range 1083-1628 -outfmt %f
blastdbcmd -db database -entry IDBA_scaffold_19039 -range 712-975 -outfmt %f
blastdbcmd -db database -entry IDBA_scaffold_10061 -range 876-505 -outfmt %f
blastdbcmd -db database -entry scaffold_816 -range 75036-74863 -outfmt %f
blastdbcmd -db database -entry scaffold_816 -range 76042-75872 -outfmt %f
blastdbcmd -db database -entry IDBA_scaffold_120202 -range 272-114 -outfmt %f
blastdbcmd -db database -entry IDBA_scaffold_106955 -range 186-28 -outfmt %f
blastdbcmd -db database -entry IDBA_scaffold_2397 -range 1192-1383 -outfmt %f
blastdbcmd -db database -entry IDBA_scaffold_427 -range 69542-69222 -outfmt %f
blastdbcmd -db database -entry scaffold_650 -range 1779-964 -outfmt %f
blastdbcmd -db database -entry scaffold_650 -range 2208-2032 -outfmt %f

NOTE: In cases where the start > end you may need to include strand directive, if you want the sequence on right strand.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax65k

Ok it works now but in cases where the start is larger than stop I guet a message error event if I use the -strand plus option

And when the start is lower than stop I guet:

blastdbcmd -db path/db_lh/makedb_lh -entry IDBA_scaffold_91814 -range 141-500 -outfmt %f

the message:

Error: IDBA_scaffold_91814: OID not found
Error: IDBA_scaffold_91814: OID not found
BLAST query/options error: Entry or entries not found in BLAST database
Please refer to the BLAST+ user manual.
ADD REPLYlink modified 3 months ago • written 3 months ago by Darrill0

Looks like IDBA_scaffold_91814 is not found in your database.

ADD REPLYlink written 3 months ago by genomax65k

And yet I checked and it does exist :

>IDBA_scaffold_91814
CGACAAAAACCTTATCCTGCATCT.....

I'v yet put in -db the makeblasdb format with the .nhr, .nin and .nsq format I used to make the blast.

ADD REPLYlink written 3 months ago by Darrill0
1

Check that your blastdb has respected the names as you have them in your input file:

blastdbcmd -db database -entry all -outfmt "%t"

Did you use -parse_seqids with your database construction command?

ADD REPLYlink modified 3 months ago • written 3 months ago by jrj.healey12k

Yes it was the issue thank you it works now. Just one last thing, because I creat several commande lines such as:

blastdbcmd -db path/makedb_lh2 -entry IDBA_scaffold_19039 -range 1083-1628 -out file.fst
blastdbcmd -db /path/makedb_lh2 -entry IDBA_scaffold_19039 -range 712-975 -out file.fst
blastdbcmd -db path/makedb_lh2 -entry IDBA_scaffold_19039 -range 1083-1628 -out file.fst
etc

each time that I'll run the blastdbcmd, the last blastdbcmd will overwrite on the file.fst instead to add the fasta sequence. Have you a solution to overcome this issue?

ADD REPLYlink written 3 months ago by Darrill0
1

I am not sure if using >>out.fst will cause the data to be appended to that file. If that does not work then use the ID vatiable, which is $0 in your command line for -out "$0".fst to create a new file for each ID.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax65k

yes the >> method works thank you very much for your help. Now there is juste the start and end size issue

When start is higher than end it says:

BLAST engine error: Failed to parse sequence range (start cannot be larger than stop)

When I do:

blastdbcmd -db path/db_lh/makedb_lh2 -entry scaffold_650 -range 6386-5643 -strand plus >> file.fst
ADD REPLYlink modified 3 months ago • written 3 months ago by Darrill0

Read the error. Your start coordinate is larger than your stop coordinate, and you’ve specified the forward/plus strand. This is impossible.

ADD REPLYlink modified 3 months ago • written 3 months ago by jrj.healey12k

with strand minus there is the same error.

ADD REPLYlink written 3 months ago by Darrill0

You will need to reverse the coordinates for hits on the reverse strand I believe. The strand argument to blastdbcmd let’s it know whether to reverse complement the sequence it returns (I think), but it stores sequences on the assumption they’re 5’->3’ forward stranded.

Someone may need to correct me on this.

Try reversing the coordinates, specify strand minus, and sanity check that you get back the sequence you expect.

ADD REPLYlink written 3 months ago by jrj.healey12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2014 users visited in the last hour