find positions of a short sequence in a genome
3
4
Entering edit mode
5.7 years ago
igor 12k

Is there a simple tool that will find all occurrences of a short sequence (less than 10bp) in a large FASTA file (such as a genome)? You can use BLAT for sequences of at least 20bp, but I can't seem to find anything for shorter sequences. There are also motif-finding algorithms, but those are geared towards imperfect matches.

It's possible to write a small script (and maybe that's what everyone does), but I was hoping for a one-line option.

sequence search • 9.0k views
0
Entering edit mode

any standard NGS mapper could do that

1
Entering edit mode

Not exactly. Most mappers will either report a single hit or top few hits.

3
Entering edit mode

bowtie -a seems to report all matches. http://bowtie-bio.sourceforge.net/manual.shtml#example-1--a

1
Entering edit mode

This is actually great. A true one-line solution (assuming you already have a Bowtie index).

2
Entering edit mode

BBMap will also report all matches when ambig=all option is used. You can add perfectmode flag if you only need perfect matches (and additional options like minid=0.99). It should work with a fasta file but you should confirm.

1
Entering edit mode

grep + genome.fasta ?

1
Entering edit mode

Convert the FASTA to BED in 50k or 100k chunks, with interweaving lines that straddle the window between neighboring chunks, of the same width as the search string (doesn't need to be any longer than 10 bases, say).

For instance:

chr1    0         100000     NNNNNNTTCA...
chr1    99995     100005     AAAGCCTCCG
chr1    100000    200000     CTCCGTGGTA...
chr1    199995    200005     GCTTAATACC
...

When you do a grep call, you'll get the matching interval and the string containing the match. Pipe that to Perl or Python to do a regex to get the match and position index within the ID string. Add the start position of the interval to the position offset of the match within the ID string, and write out any matches, say, in BED, or whatever useful format.

0
Entering edit mode

Nice solution, but would a regex have an added value when looking for exact matches only?

0
Entering edit mode

That gives you the matching lines, not the position.

0
Entering edit mode

what format is the genome file you intended to use?

0
Entering edit mode

Standard genome sequence format: FASTA.

0
Entering edit mode

Will work to find matches but the output isn't very informative I would say.

0
Entering edit mode

I considered to do something similar like you (but haven't yet came to it) and planned to use Hisat2 aligner for it (because of the speed), specify a very high mismatch penalty and disable splicing. Would be interesting to hear your experience with that.

0
Entering edit mode

Although BLAT does not work for short sequences, BLAST does. Check BLAST Command Line Applications User Manual and/or the Blast Settings For Short Sequences post.

0
Entering edit mode

As pointed out here, PatMatch has been offered a while on the web sites of several genomes. But there is also a stand-alone, command line-based version of the software described in the article that works with any FASTA file to find matches to sequence patterns. Find a list of the genomes offering PatMatch that I have seen, and a launchable, active Jupyter environment with the stand-alone, command line version of PatMatch installed and working at https://github.com/fomightez/patmatch-binder.

7
Entering edit mode
5.7 years ago
steve ★ 3.1k

This script will parse an input FASTA file and output coordinates of found matches of the supplied pattern

#!/usr/bin/env python
# python 2.7

from Bio import SeqIO
import re
import sys

# pattern = "TAACCACTCA"
pattern = sys.argv[1]

# file_path = "/local/data/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa"
file_path = sys.argv[2]

def search_fasta(pattern, file_path):
for record in SeqIO.parse(open(file_path, "rU"), "fasta"):
chrom = record.id
for match in re.finditer(pattern, str(record.seq)):
start_pos = match.start() + 1
end_pos = match.end() + 1
print chrom, '\t', start_pos, '\t', end_pos

search_fasta(pattern, file_path)

Assuming you save the script as find_seq.py, usage:

$./find_seq.py TAACCACTCA /local/data/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa > test_loc.txt output:$ head test_loc.txt
chrM    23  33
chrM    14789   14799
chr1    745938  745948
chr1    4134336     4134346
chr1    7907657     7907667
chr1    8671647     8671657
chr1    11256382    11256392
chr1    20426818    20426828
chr1    22886084    22886094
chr1    28288492    28288502

2
Entering edit mode

2
Entering edit mode

Just warning potential users that the coordinates returned need to have both the 'start' and 'end' subtracted by 1 to comply BED format.

1
Entering edit mode

Besides the very accurate comment of Alex, you should have a look at the SeqIO module from Biopython to parse the fasta and properly work with sequences, and to the sys module to get variables (such as the pattern to match). Like this, you would have to edit the script every time when you want to search for another pattern.

0
Entering edit mode

Thanks for the suggestions, I have updated the script to use this module

0
Entering edit mode

Nice improvement, would be interesting to find out how long it takes.

0
Entering edit mode

just timed it, it took 68s to complete the search shown using the UCSC hg19 reference genome; 3GB file size & 61,913,917 lines. Your mileage may vary, of course, based on system specs & search pattern used.

0
Entering edit mode

I'm not sure which step is the most expensive, but the pattern matching in the multiple fasta records could be ran in parallel using the multiprocessing module, in case you would like to try to get the time down.

1
Entering edit mode

For comparison, only a few seconds with Bowtie. And that includes both strands. So I guess there is a reason why specialized tools exist.

0
Entering edit mode

Wouldn't you get a lot of mismatches with Bowtie and very short sequences?

1
Entering edit mode

You can add a -v parameter to specify exactly how many mismatches you are willing to accept. If you specify 0, then only perfect matches will be allowed.

0
Entering edit mode

+1 to igor for pointing out need to consider both strands. And +1 to Alex Reynolds for bringing up overlap/match stradling ends. I fleshed it more here.

1
Entering edit mode

This is the same as grep.

0
Entering edit mode

updated to give the functionality you described

0
Entering edit mode

Comment rendered irrelevant.

5
Entering edit mode
5.7 years ago

Here's a demo Python script you can modify for your use, which suggests the rough principle:

#!/usr/bin/env python

import sys
import re

bed = """chr1\t0\t10\tABCDEFGHIJ
chr1\t5\t15\tFGHIJABCDO
chr1\t10\t20\tABCDOPABCD"""

string_to_match = sys.argv[1]
pattern = re.compile(string_to_match)

for line in bed.split("\n"):
(chr, start, stop, id) = line.split("\t")
for match in pattern.finditer(id):
sys.stdout.write("\t".join([chr, str(int(start) + match.start()), str(int(start) + match.end()), string_to_match]) + "\n")

Some sample runs:

$./test.py HIJABC chr1 7 13 HIJABC$ ./test.py HIJAB
chr1    7   12  HIJAB
$./test.py BCDEF chr1 1 6 BCDEF$ ./test.py ABCD
chr1    0   4   ABCD
chr1    10  14  ABCD
chr1    16  20  ABCD

One would prepare a BED file from the genome's FASTA (with spanning windows), and then modify the Python script to read that BED file on a line-by-line basis.

If a lot of these searches are done, this is also easily parallelizable, splitting work by chromosome or some other unit of work that matches the environment.

So as to avoid "double-hits", It might be worth piping the result to uniq, or testing the length of the input string_to_match to ensure that it is at least half as long as the interleaving spanning elements.

There are perhaps libraries that do a better job of dealing with these and similar edge cases. Still, hopefully this is a useful starting point.

0
Entering edit mode

Wouldn't it be better to compile the pattern outside of the for loop? You only need to do that once...

0
Entering edit mode

Yes, definitely. I just wrote this to test quickly.

3
Entering edit mode
4.5 years ago

here are my 2 cents for a one-liner perl solution to look for a particular motif in a genome (change the $seqMotif variable to look for your motif of interest; note that it accepts a regex pattern) that prints the positions where the motif happens to appear: time perl -ne 'BEGIN {$seqMotif = TAACCACTCA;
$seqPattern = qr/$seqMotif/;
$windowSize = length($seqMotif) - 1;
}
if (/^>(\S+)/) {
$chr =$1; $start = 0;$preSeq = ""
} else {
chomp; $seqLength = length;$seq = $preSeq.$_;
if ( $seq =~$seqPattern ) {
$pos =$start + $-[0];$pos -= $windowSize if$preSeq;
print "$chr\t$pos\n";
}
$start +=$seqLength; $preSeq = substr$seq, -\$windowSize;
}' human_hg19.fa > positions.txt

what this code does is to compile the motif pattern at the beginning, and then look for it directly from a fasta file, adding the final characters of the previous line to the current one in order to find the motif in case it's splitted in 2 lines. this particular motif was found 2512 times in the human genome and took 45 seconds on my personal computer, for benchmarking purposes. no reference index needed, no fasta file preprocessing, no dependencies, just good old pattern matching on perl.