How To Get Sequence Around Blast Result?
1
5
Entering edit mode
9.7 years ago
dustar1986 ▴ 380

Hi,

I blasted a serires of inquiry sequences against the a pre-built blast-database file using blastall and the xml result was parsed by biopython.

I got the genomic coordinates information of the inquiry sequences.

Now I also want to know the 500bp sequences upstream and downstream of each inquiry sequences.

I know biopython can achieve this by extracting sequences from fasta file of each chromosome or from online database.

But I don't want to do this because I don't have enough spaces to keep fasta file of each chromosome locally. And online fetching is too slow when the inquiry file is large.

Is that possible to fetch such sequences using genomic coordinates from the pre-built blast-database file (So I can just keep this database file on the disk for each species)?

biopython • 7.1k views
21
Entering edit mode
9.7 years ago

Of course, it's very easy using BLAST+, so install it right away!

First, make sure you use the -parse_seqids parameter while creating blast database:

makeblastdb -in tair10.fa -dbtype prot -parse_seqids


Then, use blastdbcmd to fetch sequences with a specific range.

blastdbcmd -db tair10.fa -dbtype prot -entry AT1G50920.1 -range 1-10


The output is.

>AT1G50920.1:1-10 | Symbols:  | Nucleolar GTP-binding protein | chr1:18870555-18872570 FORWARD LENGTH=671
MVQYNFKRIT

1
Entering edit mode

Thanks Zielezinski. This is really helpful.

0
Entering edit mode

0
Entering edit mode

What if we have a query with 100 sequences? Manual parsing may not be feasible.

1
Entering edit mode

If you have a query with multiple accession numbers, you provide them in a text file (e.g., query.txt):

AT1G50920.1
AT1G50930.1
AT1G50940.1
AT1G50950.1


and you run the command:

blastdbcmd -db tair10.fa -dbtype prot -entry_batch query.txt


You will get FASTA sequences for the accession numbers from the query.txt file. However, when you use -batch_entry argument, you can't use the -range argument.

0
Entering edit mode

Thank you so much. What if we want to parse 50 nt flanks with the aligned sequence?