Question: Count repeat sequence
0
gravatar for sacha
3.5 years ago by
sacha1.8k
France
sacha1.8k wrote:

Hi,

I just wonder how can I quantify the amount of repeat sequence across a fasta file . For exemple : actgtcccgtaaaACTACTACTACTACTACTACTatcgtcgtcg
This sequence contains a repeat of ACT in more than 40% of the total sequence .

I m looking on fourrier transformation and entropy . Do you have other idea ?

Thanks

dna algorithm repeat fasta • 1.8k views
ADD COMMENTlink modified 3.5 years ago by Brian Bushnell17k • written 3.5 years ago by sacha1.8k

If you only want the counts then sed 's/ACT/ACT\n/gi' file | grep -c 'ACT'

ADD REPLYlink written 3.5 years ago by genomax75k

Or shorter using perl ;-)

perl -ne 'print $_=~y/ACT//'

or if you want the percentage:

perl -ne '$l=length($_);$m=$_=~y/ACT//; print $m/$l*100'
ADD REPLYlink written 3.5 years ago by nterhoeven120

This is working for this case. What I want is to know for different case. For exemple, how much repeat region they are in the human genom

ADD REPLYlink written 3.5 years ago by sacha1.8k

Hasn't that been done already (repeatmasker)?

ADD REPLYlink written 3.5 years ago by genomax75k

yes ! But I want to it myself :D ... for the fun !

ADD REPLYlink written 3.5 years ago by sacha1.8k
1
gravatar for Brian Bushnell
3.5 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

You can look at Shannon entropy quite quickly with BBMask, part of the BBMap package. e.g.

bbmask.sh in=genome.fa window=80 entropy=0.70 ke=5

This will tell you the fraction of the genome with entropy below 0.7 when using a window of 80 and 5-mers. The code for the entropy masking of a sequence is in /bbmap/current/jgi/BBMask.java at lines 1067-1144 and the entropy calculation over a window is very short, at 1158-1168.

ADD COMMENTlink written 3.5 years ago by Brian Bushnell17k
0
gravatar for sacha
3.5 years ago by
sacha1.8k
France
sacha1.8k wrote:

http://www.repeatmasker.org/ is the tool for doing this. I will probably find my answer there .

ADD COMMENTlink written 3.5 years ago by sacha1.8k
1

I've never looked into how repeat masker works - i'm curious if it works by finding over-represented kmers in a genome, then applies what it found to the genome tracks (for repeat regions), or if it reads the DNA as a stream of information and then decides if something is a repeat as it goes.

For example, imagine "AACCGGTT" only turns up 10 in a whole genome, however it just so happens to occur in:

"gggggAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTggggg"

Making it a repeat in this little snippit of DNA, but not a significant kmer for the X-billion bp genome, and thus maybe not called as a repeat in this snippit either.

ADD REPLYlink written 3.5 years ago by John12k
1

I don't know about RepeatMasker, but BBMask will call it a repeat in that snippet if you use the maskrepeats flag (as opposed to, or in addition to, the maskentropy flag). Specifically:

maskrepeats=f       (mr) Mask areas covered by exact repeat kmers.
kr=5                Kmer size to use for repeat detection (1-15).  Use minkr and maxkr to sweep a range of kmers.
minlen=40           Minimum length of repeat area to mask.

So, by default, if there are at least 40 consecutive bases made of a repeating 5-mer, it would be masked. In your example, which I named "x.fa":

bbmask.sh in=x.fa mr=t me=f minlen=16 k=8
java -ea -Xmx89571m -Xms89571m -cp /usr/common/jgi/utilities/bbtools/prod-v36.09/lib/BBTools.jar jgi.BBMask in=x.fa mr=t me=f minlen=16 k=8
Executing jgi.BBMask [in=x.fa, mr=t, me=f, minlen=16, k=8]

Loading input
Loading Time:                   0.022 seconds.

Masking repeats (to disable, set 'mr=f')
Repeat Masking Time:            0.010 seconds.
Ref Bases:                        90    0.01m bases/sec
Repeat Bases Masked:              80

Converting masked bases to N
Done Masking
Conversion Time:                0.000 seconds.

Total Bases Masked:               80/90 88.889%
Total Time:                     0.067 seconds.

I changed K to 8 because the repeat is 8 bp long. The default of 5 does not mask anything in that case. But since you don't know the length of the repeat kmers ahead of time, you can define a kmer sweep, such as "k=4-9" and it will test all of them.

ADD REPLYlink written 3.5 years ago by Brian Bushnell17k
2

That's pretty awesome - and a much better way of doing it than the first method :) The BB suite really seems to cover all the bases (no pun intended!) :)

I'd consider this a good application for the ACGTrie method of counting kmers in a trie/DAG so you don't have to specify a kmer before hand, but it would make things a lot more complicated if you had to prune the trie for every snippit than a kmer table where the oldest kmers are dropped as new ones come in.

ADD REPLYlink written 3.5 years ago by John12k

I generally try to avoid tries or linked lists because they have a lot of memory random-accesses. The method I use is brute-force and does not even use a kmer table; it's O(N*W/K) where N is the number of bases, W is the window size, and K is the kmer length. But it's very fast in practice because it does not do any memory random-access, just serial access of the bytes in the sequence. The entropy-masking does make use of a kmer table, though, dropping the oldest kmer as a new one comes in. It's O(N).

ADD REPLYlink written 3.5 years ago by Brian Bushnell17k

Yeah, trie's certainly aren't fast compared to the kmer table with all that random memory access, however they can be faster if you're doing a range of kmer sizes as you can add/check many kmers sizes in a single walk. But clearly the bruteforce method you've settled upon is the fastest which these days is probably the most important thing, since cores and memory per core is going up, and speed of the RAM access has stayed relatively the same. I feel like their could be a whole "how to do bioinformatics" tutorial series written using entirely the BB suite and a few other select tools. I think thats a really great thing :) Thank you for taking the time to write this software Brian.

ADD REPLYlink written 3.5 years ago by John12k

You will also find repeatmasked versions of human genome at UCSC (and other places).

ADD REPLYlink written 3.5 years ago by genomax75k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1142 users visited in the last hour