Question: How to extract all non-seqenced positions from a genome (Fasta file)?
1
gravatar for TheBabyBlueEyes
6.0 years ago by
European Union
TheBabyBlueEyes10 wrote:

Hello,

Sorry in advance for my English.

I want extract all positions of non-sequenced nucleotides for each chromosomes from a genome sequence file (Fasta).

To do that, I have a fasta file with the genome sequences, like this :

>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNtaaattgttt
taaattgtttctgtttgcagttgacatgatcttatatatagaaaacacca
ataactctgccaaaaaatttagaattcataaatgaatttagtaaagttgc
 

I want find all N positions and obtain this position in bed format, eg :

chr1 1 200 ...

I didn't found how to do that in Bedtools.

If you have any idea, could you help me?

Thank you.

sequence genome • 3.9k views
ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by TheBabyBlueEyes10

may be you're looking for http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz ?

ADD REPLYlink written 6.0 years ago by Pierre Lindenbaum134k

Thank you, I think that resolve my problem.

ADD REPLYlink written 6.0 years ago by TheBabyBlueEyes10
Thank you for your quick answers
ADD REPLYlink written 6.0 years ago by TheBabyBlueEyes10
1
gravatar for mxs
6.0 years ago by
mxs530
mxs530 wrote:

Well, this should not be a problematic task if I understood the question correctly. All you need to do is slide along your input fasta file and count the swithches N-> something and something to ->N

perl -ne 'chomp;if( />(.*)/){$head = $1; $i=0; next};@a=split("",$_); foreach(@a){$i++;if($_ eq "N" && $s ==0 ){print "$head\t$i"; $s =1}elsif($s==1 && $_ ne "N"){print "\t$i\n";$s=0}}' infile.fa

This should do the trick.

cheers

mxs

ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by mxs530
1

After 4.1 years...

Hi @mxs, thanks for the script. For some reason, I was getting an error in the result. I modified your script slightly. Below are the steps:

$ cat test2.txt
>NZ_CP025963.1
ATGCAGTNNNACGTGCATGACTGTACGTANGTACGTGACTGACTGACTGACTNACTGACTGATCGTACNTACGTAC

$ perl -ne 'chomp;if( />(.*)/){$head = $1; $i=0; next};@a=split("",$_); foreach(@a){$i++;if($_ eq "N" && $s ==0 ){print "$head\t$i"; $s =1}elsif($s==1 && $_ ne "N"){print "\t$i\n";$s=0}}' test2.txt
NZ_CP025963.1   8   11
NZ_CP025963.1   30  31
NZ_CP025963.1   53  54
NZ_CP025963.1   69  70

$ cat >test2.bed
NZ_CP025963.1   8   11
NZ_CP025963.1   30  31
NZ_CP025963.1   53  54
NZ_CP025963.1   69  70

$ bedtools getfasta -fi test2.txt -bed test2.bed
index file test2.txt.fai not found, generating...
>NZ_CP025963.1:8-11
NNA
>NZ_CP025963.1:30-31
G
>NZ_CP025963.1:53-54
A
>NZ_CP025963.1:69-70
T

Instead of giving me the regions of N's, oneliner have extracted other regions also. Or may be it is because bedtools is zero based system.

Modified version:

$ perl -ne 'chomp;if( />(.*)/){$head = $1; $i=0; next};@a=split("",$_); foreach(@a){$i++; if($_ eq "N" && $s ==0 ){$z=$i-1; print "$head\t$z"; $s =1}elsif($s==1 && $_ ne "N"){$j=$i-1;print "\t$j\n";$s=0}}' test2.txt
NZ_CP025963.1   7   10 ("7" here is zero based coordinate and "10" is 1 based coordinate)
NZ_CP025963.1   29  30
NZ_CP025963.1   52  53
NZ_CP025963.1   68  69

$ bedtools getfasta -fi test2.txt -bed test2.bed
>NZ_CP025963.1:7-10
NNN
>NZ_CP025963.1:29-30
N
>NZ_CP025963.1:52-53
N
>NZ_CP025963.1:68-69
N

More info on zerobased and one-based system is here.

ADD REPLYlink modified 22 months ago • written 22 months ago by Prakki Rama2.4k
3

Though this is a nice oneliner it is very slow on large files.

Although I did not search for a particular fast way to do this it took that much time and such a huge amount of memory(!) that I decided to quickly write it in python. It does use Biopython though.

#!/usr/bin/python3.6
import sys

#import the SeqIO module from Biopython
from Bio import SeqIO
with open(sys.argv[1], mode="r") as fasta_handle:
for record in SeqIO.parse(fasta_handle, "fasta"):
    start_pos=0
    counter=0
    gap=False
    gap_length = 0
    for char in record.seq:
        if char == 'N':
            if gap_length == 0:
                start_pos=counter
                gap_length = 1
                gap = True
            else:
                gap_length += 1
        else:
            if gap:
                printrecord.id + "\t" + str(start_pos) + "\t" + str(start_pos + gap_length))
                gap_length = 0
                gap = False
        counter += 1
ADD REPLYlink modified 22 months ago • written 22 months ago by ronaldnieuwenhuis30

Thanks! this worked awesome!

I had to reindent it a little but thats it:

#!/usr/bin/env  python3

import sys

# import the SeqIO module from Biopython
from Bio import SeqIO

with open(sys.argv[1], mode="r") as fasta_handle:
    for record in SeqIO.parse(fasta_handle, "fasta"):
        start_pos = 0
        counter = 0
        gap = False
        gap_length = 0
    for char in record.seq:
        if char == "N":
            if gap_length == 0:
                start_pos = counter
                gap_length = 1
                gap = True
            else:
                gap_length += 1
        else:
            if gap:
                print(
                    record.id
                    + "\t"
                    + str(start_pos)
                    + "\t"
                    + str(start_pos + gap_length)
                )
                gap_length = 0
                gap = False
        counter += 1
ADD REPLYlink modified 24 days ago • written 24 days ago by gabriel.lichtenstein0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2363 users visited in the last hour
_