Question

Sequence extract from multifasta using blastall coordinates

0

Entering edit mode

6.0 years ago

chland • 0

Hi everyone.

This question has been asked in various forms before, however am new to the field and trying to learn and the answers posted don't seem to be working. Sorry, I need to specify that I'm trying to extract the sequence corresponding to the Subject start and subject end coordinates for each hit from a multifasta file of concatenated sequences. Thanks for the useful command though, I will make note of that for other purposes.

I ran blastall search that returned a file of hits that look like this with up to 500 hits ( -b 500) to the database of concatenated fasta files. I'm looking for a method to extract the sequence using the s start and s end for each hit along with the corresponding subject ID from the database of concatenated sequences.
Thanking you in advance. Chandly

# BLASTN 2.2.26 [Sep-21-2011]
# Query: querysequence.fasta
# Database: blast_db.fna
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
NCTC11168   NZ_FPEE01000064.1   100.00  960 0   0   1   960 16113   15154   0.0 1861
NCTC11168   NZ_PSND01000007.1   100.00  960 0   0   1   960 49840   50799   0.0 1861
NCTC11168   NZ_PIBG01000017.1   100.00  960 0   0   1   960 12161   13120   0.0 1861
NCTC11168   NZ_PHZN01000005.1   100.00  960 0   0   1   960 49939   50898   0.0 1861
NCTC11168   NZ_NFPZ01000033.1   100.00  960 0   0   1   960 16585   15626   0.0 1861

sequence blastall extract • 2.3k views

ADD COMMENT • link updated 6.0 years ago by h.mon 35k • written 6.0 years ago by chland • 0

0

Entering edit mode

Something like awk 'BEGIN{OFS="\t"}{print $2,$9,$10}' your_blast_file will get you the start and stops. Are the subject sequences in a local multi-fasta file? You can find many threads here to extract sequences from that point on.

ADD REPLY • link 6.0 years ago by GenoMax 141k

0

Entering edit mode

Thank you genomax for trying to help. I've been looking though answers posted and am not finding an answer that works and that I can understand. Yes, the sequences are in a local multi-fasta file created by running the cat command on sequence data ( complete, contigs/scaffold) from NCBI assembly.

ADD REPLY • link 6.0 years ago by chland • 0

0

Entering edit mode

BioPythons slice notation will be want you need here.

I have a gist which employs a blastfile to slice up genbanks, but you could do the same with a Rasta read in to Biopython as a SeqRecord:

ADD REPLY • link 6.0 years ago by Joe 21k

0

Entering edit mode

Thank you, I've downloaded the script from github, but by any chance would you be able to provide a usage statement example? Also, the assemblies were grabbed from refseq in case that matters. I have a database consisting of multifasta sequences called something like mydb.fasta and blastall hits as in the original post.

ADD REPLY • link 6.0 years ago by chland • 0

0

Entering edit mode

I’m not in front of a computer right now to put together the full solution.

The script will explain how it works if you run python script.py -h. But it wont work for fast as as it is. You’ll need to write something a little bespoke yourself.

The line you’ll be most interested in is 89, the slice function

ADD REPLY • link 6.0 years ago by Joe 21k

0

Entering edit mode

Thanks all, but I'm starting from the very beginning here and have not made it to any kind of scripting so it's all greek.

ADD REPLY • link 6.0 years ago by chland • 0

score 1 · Answer 1 · 2018-04-13

1

Entering edit mode

6.0 years ago

h.mon 35k

My experience (but I never seriously benchmarked) is that, for big fasta files, method 2 bellow is faster than method 1.

Method 1:

a) index the fasta file containing the original sequences:

samtools faidx file.fasta

b) extract sequences interval of interest:

samtools faidx file.fasta $(cat blast.txt | grep -v "^#" | awk 'BEGIN{OFS="\t"}{print $2":"$9"-"$10}')

Method 2:

a) extract the subject hit accessions:

cat blast.txt | grep -v "^#" | cut -f2 | sort | uniq > accessions.txt

b) extract the subject hit sequences with formatdb (I will illustrate here with blastdbcmd, which is the equivalent command for the Blast+ suite):

blastdbcmd -db nt -entry_batch accessions.txt -out subjects.fasta

c) index the subjects fasta:

samtools faidx subjects.fasta

d) extract sequences interval of interest:

samtools faidx subjects.fasta $(cat blast.txt | grep -v "^#" | awk 'BEGIN{OFS="\t"}{print $2":"$9"-"$10}')

ADD COMMENT • link 6.0 years ago by h.mon 35k

0

Entering edit mode

Spoke too soon. Thanks @h.mon. I've used the blast suite of tools and will try this method. The python script is too advanced for a complete beginner

ADD REPLY • link 6.0 years ago by chland • 0

0

Entering edit mode

Hi, Thank you for this useful method @h.mon

what if the $9>$10. As you can see above blast out. when i tried with my data i am getting

665:1927-1645 [faidx] Zero length sequence: 665:1927-1645

when there is $9>$10.

ADD REPLY • link 4.6 years ago by baurumon ▴ 30