Count repeat sequence
2
0
Entering edit mode
5.3 years ago
sacha ★ 2.1k

Hi,

I just wonder how can I quantify the amount of repeat sequence across a fasta file . For exemple : actgtcccgtaaaACTACTACTACTACTACTACTatcgtcgtcg
This sequence contains a repeat of ACT in more than 40% of the total sequence .

I m looking on fourrier transformation and entropy . Do you have other idea ?

Thanks

repeat dna fasta algorithm • 2.4k views
0
Entering edit mode

If you only want the counts then sed 's/ACT/ACT\n/gi' file | grep -c 'ACT'

0
Entering edit mode

Or shorter using perl ;-)

perl -ne 'print $_=~y/ACT//'  or if you want the percentage: perl -ne '$l=length($_);$m=$_=~y/ACT//; print$m/\$l*100'

0
Entering edit mode

This is working for this case. What I want is to know for different case. For exemple, how much repeat region they are in the human genom

0
Entering edit mode

0
Entering edit mode

yes ! But I want to it myself :D ... for the fun !

1
Entering edit mode
5.3 years ago

You can look at Shannon entropy quite quickly with BBMask, part of the BBMap package. e.g.

bbmask.sh in=genome.fa window=80 entropy=0.70 ke=5


This will tell you the fraction of the genome with entropy below 0.7 when using a window of 80 and 5-mers. The code for the entropy masking of a sequence is in /bbmap/current/jgi/BBMask.java at lines 1067-1144 and the entropy calculation over a window is very short, at 1158-1168.

0
Entering edit mode
5.3 years ago
sacha ★ 2.1k

http://www.repeatmasker.org/ is the tool for doing this. I will probably find my answer there .

1
Entering edit mode

I've never looked into how repeat masker works - i'm curious if it works by finding over-represented kmers in a genome, then applies what it found to the genome tracks (for repeat regions), or if it reads the DNA as a stream of information and then decides if something is a repeat as it goes.

For example, imagine "AACCGGTT" only turns up 10 in a whole genome, however it just so happens to occur in:

"gggggAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTggggg"

Making it a repeat in this little snippit of DNA, but not a significant kmer for the X-billion bp genome, and thus maybe not called as a repeat in this snippit either.

1
Entering edit mode

maskrepeats=f       (mr) Mask areas covered by exact repeat kmers.
kr=5                Kmer size to use for repeat detection (1-15).  Use minkr and maxkr to sweep a range of kmers.
minlen=40           Minimum length of repeat area to mask.


So, by default, if there are at least 40 consecutive bases made of a repeating 5-mer, it would be masked. In your example, which I named "x.fa":

bbmask.sh in=x.fa mr=t me=f minlen=16 k=8
java -ea -Xmx89571m -Xms89571m -cp /usr/common/jgi/utilities/bbtools/prod-v36.09/lib/BBTools.jar jgi.BBMask in=x.fa mr=t me=f minlen=16 k=8
Executing jgi.BBMask [in=x.fa, mr=t, me=f, minlen=16, k=8]

Masking repeats (to disable, set 'mr=f')
Ref Bases:                        90    0.01m bases/sec

Conversion Time:                0.000 seconds.

Total Time:                     0.067 seconds.


I changed K to 8 because the repeat is 8 bp long. The default of 5 does not mask anything in that case. But since you don't know the length of the repeat kmers ahead of time, you can define a kmer sweep, such as "k=4-9" and it will test all of them.

2
Entering edit mode

That's pretty awesome - and a much better way of doing it than the first method :) The BB suite really seems to cover all the bases (no pun intended!) :)

I'd consider this a good application for the ACGTrie method of counting kmers in a trie/DAG so you don't have to specify a kmer before hand, but it would make things a lot more complicated if you had to prune the trie for every snippit than a kmer table where the oldest kmers are dropped as new ones come in.

0
Entering edit mode

I generally try to avoid tries or linked lists because they have a lot of memory random-accesses. The method I use is brute-force and does not even use a kmer table; it's O(N*W/K) where N is the number of bases, W is the window size, and K is the kmer length. But it's very fast in practice because it does not do any memory random-access, just serial access of the bytes in the sequence. The entropy-masking does make use of a kmer table, though, dropping the oldest kmer as a new one comes in. It's O(N).

0
Entering edit mode

Yeah, trie's certainly aren't fast compared to the kmer table with all that random memory access, however they can be faster if you're doing a range of kmer sizes as you can add/check many kmers sizes in a single walk. But clearly the bruteforce method you've settled upon is the fastest which these days is probably the most important thing, since cores and memory per core is going up, and speed of the RAM access has stayed relatively the same. I feel like their could be a whole "how to do bioinformatics" tutorial series written using entirely the BB suite and a few other select tools. I think thats a really great thing :) Thank you for taking the time to write this software Brian.

0
Entering edit mode

You will also find repeatmasked versions of human genome at UCSC (and other places).