Question: How to create a bed file from a fasta containing all regions with clusters of >= 50 Ns and flanking 1kb regions
0
gravatar for Rubal
11 months ago by
Rubal340
Germany
Rubal340 wrote:

From a genome fasta file I would like create a bed file containing the start and end coordinates of all runs of >= 50 Ns, such as this:

ATGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAATCC

and 1kb of flanking sequence on either side. So for example if there were 100 consecutive Ns at position:

Chr1 1050 1150

Then the output region in the bed would be

Chr1 50 2150

Because it would contain the 100 Ns and the 1kb flanks on either side. I'd like to output a list of all such regions in the genome.

Could anyone suggest an approach or software for this? I've seen a few packages for identiying strings of sequence in the genome such as WordCluster, but suspect there might be an existing tool specifically for this task as I imagine it's quite a common filtering approach.

Thanks in advance for any suggestions!

fasta filtering bed genome • 420 views
ADD COMMENTlink modified 11 months ago by ATpoint45k • written 11 months ago by Rubal340
2
gravatar for ATpoint
11 months ago by
ATpoint45k
ATpoint45k wrote:

Seqkit locate together with bedtools can do that:

First you extract the coordinates, then merge overlaps and eventually extend both coordiantes (start/end) by 1000bp:

./seqkit locate -F --only-positive-strand --bed -m 0 -p NNNNN(add more N here) test.fa \
| bedtools merge -i - \
| bedtools slop -b 1000 -g chromSizes.txt > N50.bed

I tested this on a small fasta, not sure how it scales for an entire genome.

ADD COMMENTlink written 11 months ago by ATpoint45k

brilliant I will give this a try

ADD REPLYlink written 11 months ago by Rubal340

The version of seqkit I am using has no -F flag for locate, was this a typo or am I using the wrong version? I just downloaded the latest version from thttps://bioinf.shenwei.me/seqkit/download/

ADD REPLYlink written 11 months ago by Rubal340

Strange, I also downloaded the last one. Then just leave it out, it is about an index that promises to be faster, not exactly sure what it actually does to be honest :-D

ADD REPLYlink written 11 months ago by ATpoint45k

haha ok thanks for the quick response!

ADD REPLYlink written 11 months ago by Rubal340
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1634 users visited in the last hour
_