Question

upstream and downstream regions extraction

0

Entering edit mode

9.3 years ago

harpreetmanku04 ▴ 10

Hello, I have a query related to upstream and downstream regions extraction. See I have done blastn for 30 nt small query sequences against 500 nt long db sequences now my question is after running blastn, how can I extract upstream and downstream regions for my 30 nt small query sequences?

alignment blast • 5.8k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.3 years ago by harpreetmanku04 ▴ 10

Ram · Answer 1 · 2015-07-07

1

Entering edit mode

9.3 years ago

Sam ★ 4.8k

If you have the coordinate of the alignment e.g. loc, then you can extract the sequence from the reference 500nt sequence starting from loc-30 with length 30nt+2*30nt providing that you have no insertion or deletion in the alignment. If you have the indel, then just add the corresponding length.

To actually do the extraction, you can use the substr function from most programme, e.g. R, perl, awk.

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.3 years ago by Sam ★ 4.8k

0

Entering edit mode

but i want to extract after the alignment with standalone blast and also how can i come to know that in my local database this is the reference sequence for that small sequence?

ADD REPLY • link 9.3 years ago by harpreetmanku04 ▴ 10

0

Entering edit mode

After the alignment, you should be able to get the alignment file, which should contain the sequence id for the reference sequence. You can then get the reference sequence fasta file which you can use to extract the required sequence.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.3 years ago by Sam ★ 4.8k

0

Entering edit mode

Yes I got the subject id along with query id. Thank you. And is there any R script or perl script (understandable) for extracting the sequence (particular region)?

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.3 years ago by harpreetmanku04 ▴ 10

0

Entering edit mode

I normally do it using awk, but I guess you can try the Biostrings package in R

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.3 years ago by Sam ★ 4.8k

0

Entering edit mode

hello, I'm stuck with the blast command, see when I do blastn for 300 bp long seq then it retains result but when I do blastn for 20-25 nucl seq then it retains no result. Do you have any idea why it is so? I want to blast small sequences with my local database.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.3 years ago by harpreetmanku04 ▴ 10

0

Entering edit mode

Maybe you can post a new question to get an answer to that? I am not familiar with blast so I cannot give you definitive answer. However, the most straightforward guess will be that there simply be no sequence similarity with your small sequence and your local database.

ADD REPLY • link 9.3 years ago by Sam ★ 4.8k

0

Entering edit mode

Can you post awk command?

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.3 years ago by harpreetmanku04 ▴ 10

0

Entering edit mode

Hello Sam Sir, can you operate my file to extract upstream and downstream regions using your command(s). I shall be thankful to you. actually I tried but I failed and for me it is taking too much time if you could do it or you can provide me awk command I can try that one also. please help me out here.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.3 years ago by harpreetmanku04 ▴ 10

0

Entering edit mode

Assuming your input is a fasta file containing only your target sequence, then you can use this script:

awk -v seq="" '{if(!($1~"^>")){seq=seq$0}else{print seq; seq="";}}END{print substr(seq, loc-30, 30+2*30)}' <input.fa>

Where loc is the coordinate of the alignment. If you need to do the for different sequence and different coordinates, you will need a more complicated script.

The first part of the script is essentially concatenating the whole sequence into one string and the second part (substr) is where you extract the region of interest

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by Sam ★ 4.8k

0

Entering edit mode

Actually I have excel file which contains sequence id, start and stop respectively (generated by blastn) position separately and fasta file which contain sequences separately and I want to extract region from that fasta file.

This command gives error.

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by harpreetmanku04 ▴ 10

0

Entering edit mode

I tried this command but it also generates an error.

awk 'NR==FNR{if($0~/^>/){i=substr($0,2);getline};a[i]=a[i] $0;next}{print ">" $1 ORS substr(a[$1], $2, $3-$2+1)}' file1 FS=\\t file2

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by harpreetmanku04 ▴ 10

0

Entering edit mode

Can't we connect on team viewer, if you can solve it over there? Thanku :)

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by harpreetmanku04 ▴ 10

0

Entering edit mode

If you put your excel content into a text file (tab delimited), you can do the following:

samtools faidx reference.fasta
awk '{print $1">>output.fasta";print "samtools faidx reference.fasta "$2":"$3"-"$4" >> output.fa"}' <input> |bash

Your input file need to be in the format of <id>\t<chr>\t<start>\t<end>

If you start and end position doesn't account for the flanking region, then you can use this

samtools faidx reference.fasta
awk '{print $1">>output.fa"; print "samtools faidx reference.fasta "$2":"$3-30"-"$4+30" >> output.fa"}' <input> | bash

The result will be print to output.fa do delete the file or rename it in the awk code before each run, otherwise the result will only keep appending onto the file

And for your information Extract User Defined Region From An Fasta File

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.3 years ago by Sam ★ 4.8k

0

Entering edit mode

do i need to install sam tool for this?

ADD REPLY • link 9.3 years ago by harpreetmanku04 ▴ 10

Ram · Answer 2 · 2015-08-05

0

Entering edit mode

9.3 years ago

harpreetmanku04 ▴ 10

no my problem is like this:

I have file1 is this

GL3482.1 GAACTTGAGATCCGGGGA GCAGTGGATCTCCACCAG CGGCCAGAACTGGTGCAC CTCCAGGCCAGCCTCGTC CTGCGTGTC GL3550.1 GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT GACATTTTCATTACTACCATTTTGGAGTACA GL3472.1 TTTTCCTGTTCACTGCTGCTTTTCTATAGACAGCA GCAGCAAGCAGTAAGAGAAAGTA ```

file2 is this:

seq id     start    end
GL3482.1   323100   323743
GL3550.1   41911    40888
GL3472.1   274408   272617

and I want result like this:

>GL3482.1 
TTGAGA
>GL3550.1 
GGCATTTTTGTG
>GL3472.1
TTCCTGTT

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 9.3 years ago by harpreetmanku04 ▴ 10

0

Entering edit mode

samtools faidx reference.fasta
awk '{print ">"$1">>output.fa"; print "samtools faidx reference.fasta "$2":"$3-30"-"$4+30" | tail -n +2 >> output.fa"}' <input> | bash

Your example output actually make no sense to me because:

The examples doesn't corresponds to your coordinates (I understand that you want to make the post short).
You didn't state if you want the upstream/downstream
Do you want to include the actual sequence, or just the up/downstream.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.3 years ago by Sam ★ 4.8k

0

Entering edit mode

Yes I have to include sequence between upstream and downstream region too.

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by harpreetmanku04 ▴ 10

0

Entering edit mode

Then what is the problem of the above script?

ADD REPLY • link 9.3 years ago by Sam ★ 4.8k