Question

Blastn raw illumina reads all-against-all and return only non-hits

1

Entering edit mode

4.9 years ago

Bwremjoe ▴ 30

Dear Biostars folk,

I have been trying to come up with an efficient way to return all the NON-hits from a blastn query. Basically, I need to blast all reads in an Illumina library against one another, and only find the reads in set A that have not hit in B. Although I have a solution, it seems to be really slow, and I wonder if you could help me out. Here's what I did so far:

makeblastdb -in ${B} -dbtype nucl
blastn -query ${A} -db ${B} > blast_results.dat
cat blast_results.dat \
  | grep 'No hits' -B 6 \
  | grep 'Query' \
  | cut -d '=' -f2 \
  | cut -d ' ' -f2 \
  > unique_read_ids.dat
cat unique_read_ids.dat \
  | while read line; do sed -n "/${line}/,/>/p" $query \
  | sed '$d'; done \
  > result.fn

Now... this totally works, but it takes about a week to finish, although the blastn-part only take 1-2 days. So that last bit, where I fish out all the non-hits, it probably not so efficient.

Is there any way to let blastn return NON-hits, instead of all queries / all hits? If so, how does that work? I've tried setting something up with the e-values, but it always seems to be a minimal value...

Thanks for your time!

blast blastn illumina reads • 2.3k views

ADD COMMENT • link 4.8 years ago by Bwremjoe ▴ 30

1

Entering edit mode

Have you considered using tabular format for output instead? It would be much faster to parse. I believe -outfmt 7 includes sequence ID's that show no hits. Will check on that in a bit. You could also use -outfmt 17 to get SAM formatted alignments and use those to get unmapped reads.

Why are you using blast BTW? It may be much faster to do this using a proper NGS aligner and collects reads that don't map.

ADD REPLY • link 4.9 years ago by GenoMax 152k

0

Entering edit mode

I am using blast purely because I am reproducing someone's method and trying to compare it to other approaches (like a NGS aligner, as you say). I think outfmt 7 does in fact NOT include missed hits. Not sure about 17 though.

ADD REPLY • link 4.9 years ago by Bwremjoe ▴ 30

1

Entering edit mode

A test blast with different formats. Format 7 does list an empty record (for query test1 below) when a hit is not found.

Outfmt 7

$ blastn -query query.fa -db seq_dna_one.fa -outfmt '7'
# BLASTN 2.10.1+
# Query: test1
# Database: seq_dna_one.fa
# 0 hits found
# BLASTN 2.10.1+
# Query: test2
# Database: seq_dna_one.fa
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 1 hits found
test2   test4   100.000 368     0       0       1       368     1       368     0.0     680

Outfmt 6

$ blastn -query query.fa -db seq_dna_one.fa -outfmt '6'
test2   test4   100.000 368     0       0       1       368     1       368     0.0     680

Outfmt 17 (does not include sequence when a hit is not found so you would need to find ID's that get excluded)

$ blastn -query query.fa -db seq_dna_one.fa -outfmt '17'
@HD     VN:1.2  SO:coordinate   GO:reference
@SQ     SN:Query_2      LN:368
@PG     ID:0    VN:2.10.1+      CL:blastn -query query.fa -db seq_dna_one.fa -outfmt 17         PN:blastn
BL_ORD_ID:0     0       Query_2 1       255     368M1832H       *       0       0       *       *       AS:i:368        EV:f:0  NM:i:0  PI:f:100.00     BS:f:680.687

ADD REPLY • link 4.9 years ago by GenoMax 152k

0

Entering edit mode

I am confused as to what outfmt 7 shows... it says 0 hits found, but then it lists 1 hit, which is a perfect hit. Am I missing something?

EDIT: Oh wait, that's two queries. But it doesn't give the sequence when no hit is found, right?

ADD REPLY • link 4.9 years ago by Bwremjoe ▴ 30

0

Entering edit mode

But it doesn't give the sequence when no hit is found, right?

That is correct. It has 0 hits found.

ADD REPLY • link 4.9 years ago by GenoMax 152k

0

Entering edit mode

So, do you think the stuff I do with gripping / sed is then the only way forward? (again, given that I have to use blastn... which is unfortunately true for now)

It's very inefficient I think...

ADD REPLY • link 4.9 years ago by Bwremjoe ▴ 30

0

Entering edit mode

Blast was never designed to deal with NGS data. Try switching to the tabular output format. It should be better/faster than parsing regular blast output you have been using so far.

ADD REPLY • link 4.9 years ago by GenoMax 152k

0

Entering edit mode

Yeah, but that still doesn't give me the sequence data, which is what I need. In the end, the blastn part is not the bottle neck right now, it's the grepping (which takes 3.5 days...)

Anyway, thanks for your help genomax.

ADD REPLY • link 4.9 years ago by Bwremjoe ▴ 30

1

Entering edit mode

Grepping time should become much shorter, if you have tabular data. Test it out. Find 0 hits found in the file and go back two lines.

ADD REPLY • link 4.9 years ago by GenoMax 152k

0

Entering edit mode

Ah, that makes sense. Thank you.

ADD REPLY • link 4.9 years ago by Bwremjoe ▴ 30

1

Entering edit mode

Did you consider alternative approaches? For example, performing kmer set operations will probably be a lot faster than blasting and parsing the output.

A search for kmer set operations will return a good amount of alternatives. I have used KmerCrompressor a couple of times and it was fast and efficient.

ADD REPLY • link 4.9 years ago by h.mon 35k

0

Entering edit mode

I am definitely going to do this, but I'd like to try the blastn-approach nonetheless. It was used before, and like you, I'm not sure it's the best way to go. But in order to show this, I'd have to use blastn to repeat someone else's analysis.

But thanks for the tip, I think that tool sounds like a very good approach going forward =)

ADD REPLY • link 4.9 years ago by Bwremjoe ▴ 30

1

Entering edit mode

4.9 years ago

fishgolden ▴ 520

1.Use -num_threads option with blastn.

2.I think

cat unique_read_ids.dat \
| while read line; do sed -n "/${line}/,/>/p" $query \
| sed '$d'; done \
> result.fn

scans whole $query file for each ids in unique_read_ids.dat. Thus

makeblastdb -in ${query} -dbtype nucl -parse_seqids
blastdbcmd -entry_batch unique_read_ids.dat -db ${query}

may take shorter time.

ADD COMMENT • link 4.9 years ago by fishgolden ▴ 520

0

Entering edit mode

I'll give that a try, thank you =)

ADD REPLY • link 4.9 years ago by Bwremjoe ▴ 30

0

Entering edit mode

Yes, that definitely speeds things a lot, thank you. =)

ADD REPLY • link 4.9 years ago by Bwremjoe ▴ 30

score 1 · Accepted Answer · 2020-09-08

1

Entering edit mode

4.8 years ago

Bwremjoe ▴ 30

For anyone returning to this in the future, there's a biopython tutorial on doing exactly this, in a much more efficient way: https://biopython.org/wiki/Retrieve_nonmatching_blast_queries

ADD COMMENT • link 4.8 years ago by Bwremjoe ▴ 30