Question: Blastn raw illumina reads all-against-all and return only non-hits
1
gravatar for Bwremjoe
4 weeks ago by
Bwremjoe20
Bwremjoe20 wrote:

Dear Biostars folk,

I have been trying to come up with an efficient way to return all the NON-hits from a blastn query. Basically, I need to blast all reads in an Illumina library against one another, and only find the reads in set A that have not hit in B. Although I have a solution, it seems to be really slow, and I wonder if you could help me out. Here's what I did so far:

makeblastdb -in ${B} -dbtype nucl
blastn -query ${A} -db ${B} > blast_results.dat
cat blast_results.dat \
  | grep 'No hits' -B 6 \
  | grep 'Query' \
  | cut -d '=' -f2 \
  | cut -d ' ' -f2 \
  > unique_read_ids.dat
cat unique_read_ids.dat \
  | while read line; do sed -n "/${line}/,/>/p" $query \
  | sed '$d'; done \
  > result.fn

Now... this totally works, but it takes about a week to finish, although the blastn-part only take 1-2 days. So that last bit, where I fish out all the non-hits, it probably not so efficient.

Is there any way to let blastn return NON-hits, instead of all queries / all hits? If so, how does that work? I've tried setting something up with the e-values, but it always seems to be a minimal value...

Thanks for your time!

blastn blast illumina reads • 213 views
ADD COMMENTlink modified 15 days ago • written 4 weeks ago by Bwremjoe20
1

Have you considered using tabular format for output instead? It would be much faster to parse. I believe -outfmt 7 includes sequence ID's that show no hits. Will check on that in a bit. You could also use -outfmt 17 to get SAM formatted alignments and use those to get unmapped reads.

Why are you using blast BTW? It may be much faster to do this using a proper NGS aligner and collects reads that don't map.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax89k

I am using blast purely because I am reproducing someone's method and trying to compare it to other approaches (like a NGS aligner, as you say). I think outfmt 7 does in fact NOT include missed hits. Not sure about 17 though.

ADD REPLYlink written 4 weeks ago by Bwremjoe20
1

A test blast with different formats. Format 7 does list an empty record (for query test1 below) when a hit is not found.

Outfmt 7

$ blastn -query query.fa -db seq_dna_one.fa -outfmt '7'
# BLASTN 2.10.1+
# Query: test1
# Database: seq_dna_one.fa
# 0 hits found
# BLASTN 2.10.1+
# Query: test2
# Database: seq_dna_one.fa
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 1 hits found
test2   test4   100.000 368     0       0       1       368     1       368     0.0     680

Outfmt 6

$ blastn -query query.fa -db seq_dna_one.fa -outfmt '6'
test2   test4   100.000 368     0       0       1       368     1       368     0.0     680

Outfmt 17 (does not include sequence when a hit is not found so you would need to find ID's that get excluded)

$ blastn -query query.fa -db seq_dna_one.fa -outfmt '17'
@HD     VN:1.2  SO:coordinate   GO:reference
@SQ     SN:Query_2      LN:368
@PG     ID:0    VN:2.10.1+      CL:blastn -query query.fa -db seq_dna_one.fa -outfmt 17         PN:blastn
BL_ORD_ID:0     0       Query_2 1       255     368M1832H       *       0       0       *       *       AS:i:368        EV:f:0  NM:i:0  PI:f:100.00     BS:f:680.687
ADD REPLYlink modified 29 days ago • written 4 weeks ago by genomax89k

I am confused as to what outfmt 7 shows... it says 0 hits found, but then it lists 1 hit, which is a perfect hit. Am I missing something?

EDIT: Oh wait, that's two queries. But it doesn't give the sequence when no hit is found, right?

ADD REPLYlink modified 29 days ago • written 29 days ago by Bwremjoe20

But it doesn't give the sequence when no hit is found, right?

That is correct. It has 0 hits found.

ADD REPLYlink modified 29 days ago • written 29 days ago by genomax89k

So, do you think the stuff I do with gripping / sed is then the only way forward? (again, given that I have to use blastn... which is unfortunately true for now)

It's very inefficient I think...

ADD REPLYlink written 29 days ago by Bwremjoe20

Blast was never designed to deal with NGS data. Try switching to the tabular output format. It should be better/faster than parsing regular blast output you have been using so far.

ADD REPLYlink written 29 days ago by genomax89k

Yeah, but that still doesn't give me the sequence data, which is what I need. In the end, the blastn part is not the bottle neck right now, it's the grepping (which takes 3.5 days...)

Anyway, thanks for your help genomax.

ADD REPLYlink written 29 days ago by Bwremjoe20
1

Grepping time should become much shorter, if you have tabular data. Test it out. Find 0 hits found in the file and go back two lines.

ADD REPLYlink written 29 days ago by genomax89k

Ah, that makes sense. Thank you.

ADD REPLYlink written 29 days ago by Bwremjoe20
1

Did you consider alternative approaches? For example, performing kmer set operations will probably be a lot faster than blasting and parsing the output.

A search for kmer set operations will return a good amount of alternatives. I have used KmerCrompressor a couple of times and it was fast and efficient.

ADD REPLYlink written 29 days ago by h.mon31k

I am definitely going to do this, but I'd like to try the blastn-approach nonetheless. It was used before, and like you, I'm not sure it's the best way to go. But in order to show this, I'd have to use blastn to repeat someone else's analysis.

But thanks for the tip, I think that tool sounds like a very good approach going forward =)

ADD REPLYlink modified 29 days ago • written 29 days ago by Bwremjoe20
1
gravatar for Bwremjoe
15 days ago by
Bwremjoe20
Bwremjoe20 wrote:

For anyone returning to this in the future, there's a biopython tutorial on doing exactly this, in a much more efficient way: https://biopython.org/wiki/Retrieve_nonmatching_blast_queries

ADD COMMENTlink written 15 days ago by Bwremjoe20
1
gravatar for fishgolden
27 days ago by
fishgolden450
fishgolden450 wrote:

1.Use -num_threads option with blastn.

2.I think

cat unique_read_ids.dat \
| while read line; do sed -n "/${line}/,/>/p" $query \
| sed '$d'; done \
> result.fn

scans whole $query file for each ids in unique_read_ids.dat. Thus

makeblastdb -in ${query} -dbtype nucl -parse_seqids
blastdbcmd -entry_batch unique_read_ids.dat -db ${query}

may take shorter time.

ADD COMMENTlink modified 27 days ago • written 27 days ago by fishgolden450

I'll give that a try, thank you =)

ADD REPLYlink written 26 days ago by Bwremjoe20

Yes, that definitely speeds things a lot, thank you. =)

ADD REPLYlink written 26 days ago by Bwremjoe20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1139 users visited in the last hour