Question

Extracting upstream and downstream bases from a Blastn hit.

0

Entering edit mode

4.1 years ago

jamie.pike ▴ 80

I have the output from BLASTN searches and want to extract 2500 bases upstream and downstream of each BLASTN hit from an assembled genome.

I have generated fastas containing each BLASTN sequence, and have a fasta for the assembled genome.

I have been trying to use pcregrep for this:

pcregrep -i -A0 -B0 -M -f Blastn_hit.fna Assembled_genome.fna > Blastn_hit_+_bases.fna

However, there is no output.

I believe this is because the Blastn_hit.fna lines are longer than those in Assembled_genome.fna, so I have to indicate a new line using (\n|.) in the BLASTN file. The only problem is I don’t know where the new lines are, and so don’t know where to enter (\n|.) in Blastn_hit.fna. Is there a way to use pcregrep without indicating where new lines are, or is there an alternative tool or script I can use that will find the BLASTN hit and print 2500 bases upstream and downstream?

I am very new to this and have very limited knowledge, so answers with more of a ‘for dummies’ approach would be appreciated.

(I know that -A and -B will print lines, not characters, but I can work out how many characters there are to a line and so know how many lines should be printed)

blastn extracting bases pcregrep • 1.4k views

ADD COMMENT • link updated 4.1 years ago by gayachit ▴ 200 • written 4.1 years ago by jamie.pike ▴ 80

0

Entering edit mode

not sure what the blast cmd is you executed but if you did not already you should work with the tab-output format.

from that format you can easily get the columns denoting the start/stop of hits, then using eg awk or such add/subtract X from it to get the coordinates of the region you want.

ADD REPLY • link 4.1 years ago by lieven.sterck 15k

1

Entering edit mode

Thank you - I have now used blast outfmt 6 and managed to create the fastas required.

ADD REPLY • link 4.1 years ago by jamie.pike ▴ 80

score 1 · Answer 1 · 2020-03-16

1

Entering edit mode

4.1 years ago

GenoMax 142k

One way of doing this reliably is to use bedtools solution with -outfmt 6 with blastn: Finding upstream or downstream sequences on BLAST on linux

This thread adds more detail on how to do this: A: Extract flanking region of -500 nt upstream and downstream of BLAST result on ge

ADD COMMENT • link 4.1 years ago by GenoMax 142k

0

Entering edit mode

Thank you for your advice - the links were very useful.

ADD REPLY • link 4.1 years ago by jamie.pike ▴ 80

score 0 · Answer 2 · 2020-03-17

0

Entering edit mode

4.1 years ago

gayachit ▴ 200

You could also try and use python code Extracting An Up Stream Or Downstream Sequence From Given Position

If you need I can tweak the code to get what you need

ADD COMMENT • link 4.1 years ago by gayachit ▴ 200

0

Entering edit mode

Thank you - I have since managed to get what I needed. But I have a question about the python code. I don't fully understand it as I am very new to python. In linked code you provided, I assume that est_fasta_file and est_mirna_file would be the files I have generated, if so, how do I know what blast format is correct?

Thank you

ADD REPLY • link 4.1 years ago by jamie.pike ▴ 80

1

Entering edit mode

Your est_fasta_file is the fasta file that you are blasting and est-mirna-file is the blast output generated. The blast output format used is tab-separated -outfmt 6

ADD REPLY • link 4.1 years ago by gayachit ▴ 200