Question

Which NGS aligner to find inexact matches

1

Entering edit mode

5.2 years ago

Anand Rao ▴ 630

I have a set of 100 bp sequence queries (not sequence reads, but data from some analyses).

I need to report coordinates in the genome, that match each of these sequences, but filtered in at

a. >= 80% identity, and

b. >= 80% query length coverage.

I think BLAST, FASTA, BLAT - the traditional aligners may not be suitable for this. But that an NGS aligner might do the trick.

Because there are > 50 NGS aligners out there (https://www.ecseq.com/support/ngs/what-is-the-best-ngs-alignment-software), is there any recommendation for an aligner that can execute these two filters, and not require the query to be a sequence read, but JUST a DNA sequence?

By the way, had I used BLAST for these searches (which I will not), these 2 filters in BLAST would be set as -perc_identity 0.8 -qcov_hsp_perc 0.8

DNA alignment NGS aligner genome coordinates NGS • 1.5k views

ADD COMMENT • link 5.2 years ago by Anand Rao ▴ 630

0

Entering edit mode

I would use bwa mem followed by a pysam script to apply the requested filters.

ADD REPLY • link 5.2 years ago by WouterDeCoster 47k

0

Entering edit mode

I've used neither BWA not pysam, so could you share any code snippet, or links to any examples that will help me understand your suggestion correctly and quickly?

Also, am I right in interpreting your response as BWA MEM reporting not just PERFECT matches, but other matches with < 100% identity and < 100% query sequence coverage in step 1

And piping this to PYSAM for the >= 80% filters, as step 2, correct? Will there be some more parsing to report genomic coords as step 3, or will that be taken care of by the PYSAM step itself? Thank you!

ADD REPLY • link 5.2 years ago by Anand Rao ▴ 630

0

Entering edit mode

BWA MEM reporting not just PERFECT matches, but other matches with < 100% identity and < 100% query sequence coverage

Yes. BWA-MEM has myriad parameters which you can tune, including how many mismatches you allow.

ADD REPLY • link 5.2 years ago by Friederike 8.9k

0

Entering edit mode

Don't think about equivalence between BLAST and NGS alignments, With blast your are looking for local alignments. WIth NGS aligners the alignments are global/glocal due to the small size of the query sequences.

Many NGS alingers can take fasta formatted reads as query (BBMap does).

ADD REPLY • link 5.2 years ago by GenoMax 141k

0

Entering edit mode

I have to do the equivalent of -perc_identity 0.8 -qcov_hsp_perc 0.8 because it has been done in past studies, though strangely using BLAST, not accounting for its local SW algo. I agree with your observation that I need a global aligner that uses the Needleman-Wunsch algo.

However, I still need to apply those 2 filters for consistency and comparison with other studies. Problem is I can't find the appropriate equivalent flags in any NGS mapper yet...

But I think, minid=0.80 in BBmap ~ -perc_identity 0.8 in BLASTn, not sure yet. But I still have not found the equivalent of -qcov_hsp_perc 0.8 in BBmap, or in EMBOSS needle or seq-align, yet.

I am hoping some forum user may have some syntax pointers about how to apply these 2 filters with some global aligner...

ADD REPLY • link 5.2 years ago by Anand Rao ▴ 630

score 1 · Answer 1 · 2019-02-06

One possible piped solution to report inexact matches in a genome to fasta formatted queries =

BBmap (from BBtools) | sam2bed (from bedops)

See https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/

Steps and syntax that might be applicable for my case:

Step 1. Build searchable index for reference genome to be scanned for matches

bbmap.sh  ref=Reference.fasta

Step 2. Scan reference genome index with

bbmap.sh minid=0.8 idfilter=0.8 secondary=T maxsites=500 in=Query.fasta out=Query_Vs_Ref_BBmap_minid_idfilter_80per_multi.sam -noheader

Explanation of the flags

minid = is approximate, but can help accelerate BBmap (paraphrasing Brian Bushnell, BBmap's author)

idfilter = "to filter out alignments with under exactly x% identity" (quoting Brian Bushnell)

secondary = enables reporting match details for not just the first or best match

maxsites = provides a ceiling for the number of matching genomic regions for which details will be saved to the output file

noheader = I didn't want bulkier output files than required, therefore suppressing SAM file format header info

Step 3. Convert output SAM file to BED format

sam2bed < BBmap_out.sam > Coords_file.bed

Step 4. Parse BED format for match coords report

Use awk, Perl or Python script, or BBmap reporting option in step 2 itself

AFAIK, NGS mappers like BBmap were not designed with such a goal in mind, so I've still got to QC my suggested little pipeline with control cases. So this is a placeholder, and I will edit details if and when necessary.

Software versions : BBTools - BBMap version 38.35, bedops - version: 2.4.35 (typical)

Downloaded from : https://sourceforge.net/projects/bbmap/ and https://bedops.readthedocs.io/en/latest/index.html

PS. Thanks to Brian Bushnell for advising use of idfilter flag during the BBmap step.

Perhaps Brian can render the final verdict on validity of this syntax?! Though I don't see any posts from him in > 1 year now...

note: This is the solution for the problem I described in the now closed thread at BLAST run parameters and parsing advice