Question

blastn and subject sequence

0

Entering edit mode

7.1 years ago

cheng#13# • 0

Hi, I am trying to find similarities between DNA sequences against a chromosomal sequence with the following command line (version 2.6.0+):

blastn -subject chr7.fasta -query seq.fasta -out out.txt -outfmt "6 qseqid sseqid pident qlen length qstart qend sstart send"

An example of output (after pretreatment):

Accnum    Allele    Label    %Identity    Sstart    Send
Accnum1    Allele1  LL    100.000    99928     100090
Accnum1    Allele1  LP    100.000    99928     99976
Accnum1    Allele1  LVR    99.796     99928     100416
Accnum1    Allele1  VE    99.661     100083     100377
Accnum1    Allele1  VGU    99.701     100083     100416
Accnum1    Allele1  L    99.661     100083     100090
Accnum1    Allele1  VR    99.652     100091     100377
Accnum1    Allele1  R    100.000     100378     100416

Now, I would like to retrieve the subject sequence for each match, let say between 97 and 100% of identity for VE and VR labels, the %identity of the others must be equal to 100%. I added sseq as option in my outfmt and figure out that it is the aligned sequence as explained in blastn doc (with plenty '-' though).

1) Is there a possible way to get the subject sequence with the blastn command line at the same time?

2) With a command line, is there a way to blast my sequences against a specific chromosome without having it in local?

3) In this process, I used 'awk' and 'cut' to get the output (above), I removed and picked the information I need in the sequences fasta head. After that, I would like to check the Sstart (subject start position) and Send (subject end position) according to specific criteria for labels. For example:

LL(Sstart) = LP(Sstart) = LVR(Sstart)
LVR(Send) = VGU(Send) = R(Send)
L(Send) = LL(Send) = VR(Sstart - 1)
VR(Send) = VE(Send) = R(Sstart + 1) etc.

What is the best way to? Is there a faster way to do that in awk or consider to write a bash script? What about samtools?

Any help and advice will be highly appreciated.

Thanks in advance.

blast • 5.5k views

ADD COMMENT • link updated 7.0 years ago by cschu181 ★ 2.8k • written 7.1 years ago by cheng#13# • 0

0

Entering edit mode

1) Is there a possible way to get the subject sequence with the blastn command line at the same time?

Non that I am aware of, but you can easily get the subject sequence ID so that you can write a script to extract the sequence of that ID from your set of sequences. See sseqid.

2) With a command line, is there a way to blast my sequences against a specific chromosome without having it in local?

Again, not that I am aware of, but making a database on some downloaded set of sequences in FASTA format is really quick, so you could give that a try first! Even if I assume you already tried and you're asking because of a problem.

3) In this process... What is the best way to?

I would personally import the table in R, do the transposed matrix of the one you show, with %Identity Sstart Send as row names and LL, LP, LVR, ... as column names, so you can sum up numbers within a single row for each calculation you need.

ADD REPLY • link 7.1 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Thank you for your answer.

2)

Again, not that I am aware of, but making a database on some downloaded set of sequences in FASTA format is really quick, so you could give that a try first! Even if I assume you already tried and you're asking because of a problem.

In fact, I had the chromosomal sequence in local and had no specific problem for that. I was thinking about a semi-automatic bash script for these main steps:

a. retrieve the chromosomal sequence (or any from NCBI)

b. do a blasnt against my fasta sequences

c. filter the results according to my needs

3)

I would personally import the table in R, do the transposed matrix of the one you show, with %Identity Sstart Send as row names and LL, LP, LVR, ... as column names, so you can sum up numbers within a single row for each calculation you need.

The matrix you mention could also be built in awk by selecting cols 3,4,5 and 6. In fact, some sums concern different rows. For example:

Accnum    Allele    Label    %Identity    Sstart    Send
Accnum1    Allele1  LL    100.000    99928     100090
Accnum1    Allele1  VR    99.652     100091     100377

LL(Send) = VR(Sstart - 1).

LL/VR: For LL it is col 6 and VR col 5 (100090 - 1000091 = -1).

In my case, I have to keep col 1 and 2 in order not to loose the accession numbers and genes information if the same gene has different accession numbers. Did you have a specific idea when you suggested R?

ADD REPLY • link 7.1 years ago by cheng#13# • 0

0

Entering edit mode

2) ...

What I usually do is downloading the fasta file with all the sequences that fit my query. If you find a way to automatize a query to NCBI nucleotide (or protein)... I don't know one right now but I believe one exists.

Did you have a specific idea when you suggested R?

No, I'm just more comfortable with R when working with rows and columns.

ADD REPLY • link 7.1 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

1) Is there a possible way to get the subject sequence with the blastn command line at the same time?

If you only want the aligned portion, then add sseq to your outfmt.

ADD REPLY • link 7.1 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Thank you for your answer.

Yes, I wanted the aligned portion. This is why I meantionned "sseq" above.

Using sseq in outfmt is not the solution to my issue: you get the sequence with non-aligned nucleotids and not the complete sequence.

Illustrative example:

atgatgatgat--gatag-c instead of let say atgatgatgatGCgatagTc

ADD REPLY • link 7.1 years ago by cheng#13# • 0

0

Entering edit mode

I don't understand your example...

ADD REPLY • link 7.1 years ago by cschu181 ★ 2.8k

0

Entering edit mode

What you want looks like the sseq to me... How is that different?

ADD REPLY • link 7.1 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

I gave above a nucleotid sequence example to clarify my need.

Using sseq, you obtain:

atgatgatgat--gatag-c

The subject sequence expected:

atgatgatgatGCgatagTc

You do not obtain the whole subject sequence aligned, there are '-' chars in its sequence

ADD REPLY • link 7.1 years ago by cheng#13# • 0

0

Entering edit mode

Ok. Given that perhaps Blast won't put the whole sequence in the output, can't you just integrate it yourself with a line of code using the sseqid to read back your sequence from the original file?

ADD REPLY • link 7.1 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Well there are '-' chars in the subject sequence whenever there's an insert in the query seq... if you want the full sseq then just remove the '-' chars. You could also take qseq (but then you will have the gap chars whenever there is an insert in the subject seq). Where do you expect your "GC" and "T" to come from?

Or are the different parts separated by '-' supposed to be individual hsps? Then you're right, you wont have the parts aligned to a gap, but then your only option is to chain together the hsps and missing portions with a script.

ADD REPLY • link 7.1 years ago by cschu181 ★ 2.8k

0

Entering edit mode

If I don't remember wrongly, "-" in the sseq output field represent both indels and mismatches. Removing them won't work.

ADD REPLY • link 7.1 years ago by Matteo Schiavinato ★ 3.6k

2

Entering edit mode

You remember wrongly. '-' are just indels. Mismatches are not shown in qseq/sseq.

ADD REPLY • link 7.1 years ago by cschu181 ★ 2.8k

0

Entering edit mode

@cschu1981: I thought that "GC" and T" are the nucleotids replacing the "-" chars. I could just give another nt I mean It was just an illustrative example.

OK. So if I understand the real subject sequence length we are expecting to obtain with sseq (in my example above) will be 16 nt after removing the '-' chars, so simply by doing:

awk '{gsub ("[-]","") ; print $0 }'  data_file

Is it right?

ADD REPLY • link 7.0 years ago by cheng#13# • 0

score 0 · Answer 1 · 2017-04-12

0

Entering edit mode

7.0 years ago

cschu181 ★ 2.8k

That's correct.

echo "atgatgatgat--gatag-c" | awk '{gsub ("[-]","") ; print $0 }' | wc
   1       1      18

So, 17nt, but, yea.

ADD COMMENT • link 7.0 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Yup in fact! :) thank you.

For my last question (a little bit difficult for me to figure out from where to start):

The issue is that I have two numeric fields ($4 and $5) from row n to compare to the same fields ($4 and $5) in row(s) n+p, by label ($3) and by allele ($2). The rows are not systematically successive.

In a case (in bold), I have to compare between $4 (or $5) in row n and $4 (or $5) in row(s) n+p
In another case (italic), It's field $4 which is compared to $5 or vice-versa

Accnum Allele Label %Identity Sstart Send

Accnum1 Allele1 LP 100.000 99928 99976

Accnum1 Allele1 LVR 99.796 99928 100416

Accnum1 Allele1 VE 99.661 100083 100377

Accnum1 Allele1 VR 99.652 100091 100377

Accnum1 Allele1 LL 100.000 99928 100090

Accnum1 Allele1 VGU 99.701 100083 100416

Do you have any advice that helps me to start resolving this issue?

ADD REPLY • link 7.0 years ago by cheng#13# • 0

0

Entering edit mode

I'd probably do that with a Python/Perl/Ruby (whatever your preference) script. As @Macspider suggested, you might also want to try and reformat your result so that

              LP_pid LP_sstart LP_send LVR_pid LVR_sstart LVR_send etc.
Accnum Allele

That way you can use whatever (e.g. awk) to pull out rows that match your criteria by simply comparing across columns instead of having to compare across rows and different columns.

edit: reformatted table

ADD REPLY • link 7.0 years ago by cschu181 ★ 2.8k

0

Entering edit mode

OK, thank you. I am working on that.

ADD REPLY • link 7.0 years ago by cheng#13# • 0

0

Entering edit mode

Hi cheng,

Did you manage to find out how to retrieve the original subject sequence after a blast match? I am trying to do the same to extract the original subject sequence.

ADD REPLY • link 4.1 years ago by drdrnaveenkumar • 0