How to extract intervals for regions that are masked?
0
0
Entering edit mode
4.1 years ago
serpalma.v ▴ 80

Hello

I want to determine the begining and end of each interval formed only by Ns (masked region). Then I would like to split the chromosome into smaller intervals to keep the regions that are not masked.

For example:

AGGTCGTTNNNNAACTGNNAGTC

I would like to get three intervals from this sequence: AGGTCGTT -> from 1 to 8 AACTG -> from 13 to 17 AGTC -> from 20 to 23

Is there a tool I could use to do the task? I have been searching, but I cannot find the right one.

Thanks!

SNP sequencing assembly • 1.2k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

It does not give the intervals of the regions that are not masked. That is what I really need.

ADD REPLY
1
Entering edit mode

The "Genome" subtract the "masked" intervals are the "unmasked" intervals, this is what you want, right?

You can get it using bedtools complement command:

First, you need generate the Genome file for your genome

$ cat in.fa   
>fa   
AGGTCGTTNNNNAACTGNNAGTC   

$ samtools faidx in.fa    
$ cat in.fa.fai   
fa      23      4       23      24

Second, get the location of masked intervals, (answered by "shenwei356" link from previous post, see above link)

$ seqkit locate  -P -p '[Nn]+' in.fa --bed > masked.bed    
$ cat masked.bed    
fa      8       12      [Nn]+   0       +    
fa      17      19      [Nn]+   0       +

Last. get the unmasked intervals

$ bedtools complement -i in.bed -g in.fa.fai    
fa      0       8    
fa      12      17    
fa      19      23
ADD REPLY

Login before adding your answer.

Traffic: 2896 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6