Question: How to extract all non-seqenced positions from a genome (Fasta file)?
0
gravatar for TheBabyBlueEyes
5.6 years ago by
European Union
TheBabyBlueEyes0 wrote:

Hello,

Sorry in advance for my English.

I want extract all positions of non-sequenced nucleotides for each chromosomes from a genome sequence file (Fasta).

To do that, I have a fasta file with the genome sequences, like this :

>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNtaaattgttt
taaattgtttctgtttgcagttgacatgatcttatatatagaaaacacca
ataactctgccaaaaaatttagaattcataaatgaatttagtaaagttgc
 

I want find all N positions and obtain this position in bed format, eg :

chr1 1 200 ...

I didn't found how to do that in Bedtools.

If you have any idea, could you help me?

Thank you.

sequence genome • 3.5k views
ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by TheBabyBlueEyes0

may be you're looking for http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz ?

ADD REPLYlink written 5.6 years ago by Pierre Lindenbaum130k

Thank you, I think that resolve my problem.

ADD REPLYlink written 5.6 years ago by TheBabyBlueEyes0
Thank you for your quick answers
ADD REPLYlink written 5.6 years ago by TheBabyBlueEyes0
1
gravatar for mxs
5.6 years ago by
mxs530
mxs530 wrote:

Well, this should not be a problematic task if I understood the question correctly. All you need to do is slide along your input fasta file and count the swithches N-> something and something to ->N

perl -ne 'chomp;if( />(.*)/){$head = $1; $i=0; next};@a=split("",$_); foreach(@a){$i++;if($_ eq "N" && $s ==0 ){print "$head\t$i"; $s =1}elsif($s==1 && $_ ne "N"){print "\t$i\n";$s=0}}' infile.fa

This should do the trick.

cheers

mxs

ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by mxs530

After 4.1 years...

Hi @mxs, thanks for the script. For some reason, I was getting an error in the result. I modified your script slightly. Below are the steps:

$ cat test2.txt
>NZ_CP025963.1
ATGCAGTNNNACGTGCATGACTGTACGTANGTACGTGACTGACTGACTGACTNACTGACTGATCGTACNTACGTAC

$ perl -ne 'chomp;if( />(.*)/){$head = $1; $i=0; next};@a=split("",$_); foreach(@a){$i++;if($_ eq "N" && $s ==0 ){print "$head\t$i"; $s =1}elsif($s==1 && $_ ne "N"){print "\t$i\n";$s=0}}' test2.txt
NZ_CP025963.1   8   11
NZ_CP025963.1   30  31
NZ_CP025963.1   53  54
NZ_CP025963.1   69  70

$ cat >test2.bed
NZ_CP025963.1   8   11
NZ_CP025963.1   30  31
NZ_CP025963.1   53  54
NZ_CP025963.1   69  70

$ bedtools getfasta -fi test2.txt -bed test2.bed
index file test2.txt.fai not found, generating...
>NZ_CP025963.1:8-11
NNA
>NZ_CP025963.1:30-31
G
>NZ_CP025963.1:53-54
A
>NZ_CP025963.1:69-70
T

Instead of giving me the regions of N's, oneliner have extracted other regions also. Or may be it is because bedtools is zero based system.

Modified version:

$ perl -ne 'chomp;if( />(.*)/){$head = $1; $i=0; next};@a=split("",$_); foreach(@a){$i++; if($_ eq "N" && $s ==0 ){$z=$i-1; print "$head\t$z"; $s =1}elsif($s==1 && $_ ne "N"){$j=$i-1;print "\t$j\n";$s=0}}' test2.txt
NZ_CP025963.1   7   10 ("7" here is zero based coordinate and "10" is 1 based coordinate)
NZ_CP025963.1   29  30
NZ_CP025963.1   52  53
NZ_CP025963.1   68  69

$ bedtools getfasta -fi test2.txt -bed test2.bed
>NZ_CP025963.1:7-10
NNN
>NZ_CP025963.1:29-30
N
>NZ_CP025963.1:52-53
N
>NZ_CP025963.1:68-69
N

More info on zerobased and one-based system is here.

ADD REPLYlink modified 17 months ago • written 17 months ago by Prakki Rama2.4k
2

Though this is a nice oneliner it is very slow on large files.

Although I did not search for a particular fast way to do this it took that much time and such a huge amount of memory(!) that I decided to quickly write it in python. It does use Biopython though.

#!/usr/bin/python3.6
import sys

#import the SeqIO module from Biopython
from Bio import SeqIO
with open(sys.argv[1], mode="r") as fasta_handle:
for record in SeqIO.parse(fasta_handle, "fasta"):
    start_pos=0
    counter=0
    gap=False
    gap_length = 0
    for char in record.seq:
        if char == 'N':
            if gap_length == 0:
                start_pos=counter
                gap_length = 1
                gap = True
            else:
                gap_length += 1
        else:
            if gap:
                printrecord.id + "\t" + str(start_pos) + "\t" + str(start_pos + gap_length))
                gap_length = 0
                gap = False
        counter += 1
ADD REPLYlink modified 16 months ago • written 16 months ago by ronaldnieuwenhuis20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1758 users visited in the last hour