Question: Extract only sequences that have N nucleotides
0
gravatar for waqasnayab
4.9 years ago by
waqasnayab200
Pakistan
waqasnayab200 wrote:

Hi,

I have a fasta file. From that fasta file, I need to extract only those sequences that have Ns nucleotides.

>seq1

AGCGGCGTAACGTCGTAGTC

>seq2

ACGCGTACNNNNNNTGCGA

I want output like this:

>seq1

AGCGGCGTAACGTCGTAGTC

Regards,

Waqas

 

sequencing genome • 1.5k views
ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by waqasnayab200

Your example has no Ns, which is the exact opposite of what you say you want.

Are the sequences always on a single line?

ADD REPLYlink written 4.9 years ago by Devon Ryan97k

You mean, you want to exclude the sequences with Ns?

awk -v header="" '{ if($1~/^>/){header=$1}else if($1!~/N/){print header; print $1;}}' fastaFile

Note, this only work if you don't have multi-line fasta

ADD REPLYlink modified 11 months ago by _r_am31k • written 4.9 years ago by Sam3.3k
1
gravatar for Pierre Lindenbaum
4.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

linearize the fasta

filter with awk and a regular expression, convert back to fasta

awk -f linearizefasta.awk < input.fa | awk -F '\t' '($2 ~ /N/)' | tr "\t" "\n"
ADD COMMENTlink modified 11 months ago by _r_am31k • written 4.9 years ago by Pierre Lindenbaum131k
1
gravatar for Daniel
4.9 years ago by
Daniel3.8k
Cardiff University
Daniel3.8k wrote:

With one-line fasta and with sequence names specifically 'seq#' (i.e. no letter 'N's), you can super easily do:

grep -B 1 'N' input.fasta >output_no_Ns.fasta

But that is assuming your data is in the specific format above. Multi line fasta or N's in your headers would break it. But this is what I'd do if I had this data.

ADD COMMENTlink modified 11 months ago by _r_am31k • written 4.9 years ago by Daniel3.8k
1
gravatar for waqasnayab
4.9 years ago by
waqasnayab200
Pakistan
waqasnayab200 wrote:

Hi,

First, I linearized multi-line fasta file to single-line fasta file:

Multiline Fasta To Single Line Fasta

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file.fa

Than, @Daniel command:

grep -B 1 'N' input.fasta >output_no_Ns.fasta

I apologized the community for not explaining my question properly.

Best,
Waqas.

ADD COMMENTlink modified 11 months ago by _r_am31k • written 4.9 years ago by waqasnayab200
1

Thanks for letting us know! You can upvote and tick the helping answers, to finish the question thread.

ADD REPLYlink written 4.9 years ago by Daniel3.8k
0
gravatar for Anima Mundi
4.9 years ago by
Anima Mundi2.8k
Italy
Anima Mundi2.8k wrote:

In Python, for a file named foo.fa containing linearised FASTAs (prunes FASTAs with Ns):

header = ''

for line in open('foo.fa'):
    if '>' in line:
        header = line
    elif line == '\n':
        pass
    elif 'N' not in line.upper():
        print header,
        print line,
ADD COMMENTlink modified 11 months ago by _r_am31k • written 4.9 years ago by Anima Mundi2.8k
0
gravatar for Brian Bushnell
4.9 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

With the BBMap package:

reformat.sh in=file.fasta out=fixed.fasta maxns=0
ADD COMMENTlink modified 11 months ago by _r_am31k • written 4.9 years ago by Brian Bushnell17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1213 users visited in the last hour