Question: Can I use BLASTn to only return hits which start with a perfect match to a particular string?
1
gravatar for maxwhjohn1988
19 months ago by
maxwhjohn198840 wrote:

I'm using SequenceServer to BLAST RADseq reads against a custom BLAST database constructed from a genome which I've just assembled.

However, I'm getting a lot of hits which I do not want. I only want hits with a perfect match to the cut site of the restriction endonuclease used to make the RAD libraries (EcoRI - G'AATT,C), at the start of the hit.

Is there a way to coerce BLASTn to only return hits with a perfect match to this sequence at the beginning of the hit, but which are free to vary "normally" downstream of that sequence?

All my RADseq reads start GAATTC. (Just in case anyone is wondering - I've added a G to the beginning of all my RADseq reads to ensure that I only get hits which match the whole cut site, but I'm still getting hits which begin at e.g. position 7 of the query sequence).

ADD COMMENTlink written 19 months ago by maxwhjohn198840

Have you thought about using bbduk from BBMap suite to extract reads that have that string instead of using BLAST?

ADD REPLYlink written 19 months ago by genomax63k

Hi genomax

Thanks for replying so quickly. No, I hadn't considered doing that, but thanks for the suggestion! I wouldn't do it for the reads used to make the assembly, as I am looking for RAD loci which have been filtered according a set of specific criteria, and I want to know where those loci sit in the assembly. I could do it with the assembly scaffolds, but considering that it's only a 6 character string, and some of the scaffolds are pretty big, do you not think that it would likely return a lot of undesirable sequences?

ADD REPLYlink written 19 months ago by maxwhjohn198840
0
gravatar for st.ph.n
19 months ago by
st.ph.n2.4k
Philadelphia, PA
st.ph.n2.4k wrote:

if you do a tab output, -outfmt 6, you can select the hits that have subject start pos starting at pos 1.

awk -F'\t' '{if($9==1) { print } }' output.txt > selected_hits.txt
ADD COMMENTlink written 19 months ago by st.ph.n2.4k

Hi st.ph.n

Thanks for replying. This looks like a very promising avenue!

Two things though -

Firstly I'm not at all competent with awk so I'm not 100% clear on how this solution works. What is -outfmt 6?

Secondly, this looks like it would only return hits which start at position 1. That's good, but I need hits which also have perfect 100% identity from positions 1-6.

ADD REPLYlink written 19 months ago by maxwhjohn198840

Sorry - should have googled before replying. I understand -outfmt 6 and $9==1 now, but the second point still concerns me (I have some queries with >200 hits, most of which start at position 1 but which don't contain the full cut site).

ADD REPLYlink written 19 months ago by maxwhjohn198840

In that case, you will want to use genomax's solution to extract the reads that have the strings first, and then you can blast looking for similarity downstream of the matched string for each read.

ADD REPLYlink written 19 months ago by st.ph.n2.4k

Hi st.ph.n

Sorry I couldn't reply earlier, as a new user BioStars limits the number of posts I can make every day (which includes replies!)

Either I'm misunderstanding something, or I haven't explained myself very well. I'll try to be more explicit; maybe I'll suddenly understand that what you are suggesting would be useful while doing this.

I have two sets of raw reads from the same species.

One set are RADseq reads - I'm just considering the first reads, so these all (obviously) have an EcoRI cut site at the start. I have filtered these down to a very small subset which contain SNPs of interest, and I only have 121 of them.

The other set are "normal" PE Illumina reads from which I have assembled a genome.

I am BLASTing the 121 RADseq reads against the scaffolds of the genome assembly, because I want to know where the RAD loci of interest are in the genome.

The genome is ~ 300 Mb and ~ 35% GC, so there should be about 340 EcoRI cut sites per Mb of the assembled genome. I suppose I could plausibly pull out the reads used to make the assembly which have the cut site, align those to the genome, and then see which RADseq reads align to the genome in a place where one of the subset of reads which contain the cut site align... that sounds very laborious. Alternatively I could pull out the scaffolds which have an EcoRI cut site in them, and BLAST the RADseq reads against that subset - but given that it's only a 6 bp sequence and given that my scaffolds are not tiny I don't think that this would help very much - I would expect that many of the scaffolds contain this 6 bp sequence multiple times. Besides, the issue that I'm getting with BLAST is that, for some RADseq read queries, I'm getting many (hundreds) of hits which align very well - but which have, for example, a single mismatch inside the EcoRI cut site. As the EcoRI cut sites anchor the RADseq reads to a specific location in the genome, these hits are definitely not what I'm looking for and I want to exclude them. I don't see how either approach would do anything to address that.

I am now trying to do this with ublast from USEARCH as it has a lot of extra options. I've not nailed it yet but I think I'm getting close. That said, if anyone has any ideas for doing exactly this with BLASTn, I'd still love to hear about it!

ADD REPLYlink written 19 months ago by maxwhjohn198840

I suppose I could plausibly pull out the reads used to make the assembly which have the cut site, align those to the genome, and then see which RADseq reads align to the genome in a place where one of the subset of reads which contain the cut site align... that sounds very laborious.

That is where bbduk.sh from BBMap comes in. Searching for reads that contain the EcoRI site + G should be easy. We will need to figure out how to select only those reads where this sequence is at the beginning of the read.

I thought a dedicated pipeline like stacks is meant to handle RADseq data. Have you looked at it?

ADD REPLYlink written 19 months ago by genomax63k

Hi again

Yes, I used Stacks to process my RADseq reads. I'm using pooled samples though so there's only so far I can go with Stacks.

I have actually got the results I needed from USEARCH's ublast now. I used this command:

usearch -ublast RADreads.fa -db assembly.fa -evalue 1.0e-20 -strand both -id 0.92 -mid 0.92 -maxdiffs 1 -mincols 93 -gapopen *LQ -gapext 10.0I -alnout output.aln

While this didn't do precisely what I originally wanted (which was to get a perfect match anywhere in the assembly for the first 6 bp of the query, but allow fairly "normal" variation downstream of that perfect match) it did give me something almost as good. The -maxdiffs 1 argument allowed only results with a single mismatch (this was good as my RADseq reads had been filtered to contain only those with a single SNP) and the -mincols 93 argument ensured I only got hits for my whole RADseq reads (which are 93 bp including the Gs I added at the beginning of each read to complete the EcoRI cut site). -gapopen *LQ prevents any gaps at the left end of the query, and the -gapext 10.0I argument makes it very costly to open a gap inside the alignment.

Thanks for all the help guys, I'm still perplexed that this isn't something easy to do with BLAST itself as looking for conserved sequence motifs and that sort of thing is not an uncommon endeavour. I'd still be interested if anyone has any other ideas as what I have is not 100% what I originally set out to achieve (but it's quite adequate).

FYI I recommend http://www.drive5.com/usearch/ highly - the documentation is fragmented but extensive, it's fast and powerful and has a huge variety of options which BLAST doesn't seem to have.

ADD REPLYlink written 19 months ago by maxwhjohn198840

You can further customize the output to also include the number of mismatches see blastn -help then add another condition to the awk script to filter for those columns as well.

 Options 6, 7, 10 and 17 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers for options 6, 7 and 10 are:
     ...
    pident means Percentage of identical matches
    nident means Number of identical matches
    mismatch means Number of mismatches
ADD REPLYlink modified 19 months ago • written 19 months ago by Istvan Albert ♦♦ 79k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1900 users visited in the last hour