How to extract intervals for regions that are masked?
0
0
Entering edit mode
20 months ago
serpalma.v ▴ 70

Hello

I want to determine the begining and end of each interval formed only by Ns (masked region). Then I would like to split the chromosome into smaller intervals to keep the regions that are not masked.

For example:

AGGTCGTTNNNNAACTGNNAGTC

I would like to get three intervals from this sequence: AGGTCGTT -> from 1 to 8 AACTG -> from 13 to 17 AGTC -> from 20 to 23

Is there a tool I could use to do the task? I have been searching, but I cannot find the right one.

Thanks!

SNP sequencing assembly • 374 views
0
Entering edit mode
0
Entering edit mode

It does not give the intervals of the regions that are not masked. That is what I really need.

1
Entering edit mode

The "Genome" subtract the "masked" intervals are the "unmasked" intervals, this is what you want, right?

You can get it using bedtools complement command:

First, you need generate the Genome file for your genome

$cat in.fa >fa AGGTCGTTNNNNAACTGNNAGTC$ samtools faidx in.fa
$cat in.fa.fai fa 23 4 23 24  Second, get the location of masked intervals, (answered by "shenwei356" link from previous post, see above link) $ seqkit locate  -P -p '[Nn]+' in.fa --bed > masked.bed
$cat masked.bed fa 8 12 [Nn]+ 0 + fa 17 19 [Nn]+ 0 +  Last. get the unmasked intervals $ bedtools complement -i in.bed -g in.fa.fai
fa      0       8
fa      12      17
fa      19      23