Question

Removing All Empty Fasta Sequences From A File (Was: Editing The Headers Of The Fasta Format Sequence)

6

Entering edit mode

10.7 years ago

genomelover ▴ 50

Hi to all,

I have a genome sequence in FASTA FORMAT, and I need to edit and omit the headers which do not contain any sequence as I have to run it in a tool which needs the sequences in FASTA FORMAT.

My sequence is pasted below. Can anyone suggest me a script which can edit the FASTA headers in the script. I have tried using online tools but I have too many sequences, so I cannot do it manually. Can someone suggest me any script? Thanks in advance

>Throat_LANL_12_orf00001 begin=81 end=9 rf=-3 score=2.57
>Throat_LANL_14_orf00002 begin=45 end=212 rf=-3 score=4.34
CTTTACGTAGCGGAAAATTAGATACGGACAGATAAATGTTAGAAGAATTAAATATCGATC
TATCTAGCTTAAAAGCAAATAAAGTAGATAACAGAATATTAGGGGTTGTTNNTTTGACGA
TCTCCGCTCAAAAGAAAAAGAAGAGATCATTCAATGTGTTTACAATGT
>Throat_LANL_15_orf00001 begin=407 end=125 rf=-2 score=6.05
>Throat_LANL_19_orf00001 begin=394 end=7 rf=2 score=1.93
>Throat_LANL_20_orf00001 begin=651 end=445 rf=2 score=2.78
>Throat_LANL_20_orf00003 begin=455 end=393 rf=-2 score=2.95
>Throat_LANL_37_orf00001 begin=116 end=41 rf=-2 score=1.40
>Throat_LANL_47_orf00001 begin=328 end=139 rf=2 score=0.63
>Throat_LANL_52_orf00001 begin=514 end=148 rf=-1 score=0.31
>Throat_LANL_58_orf00001 begin=208 end=40 rf=2 score=0.56
>Throat_LANL_59_orf00001 begin=316 end=262 rf=2 score=5.28
>Throat_LANL_59_orf00003 begin=338 end=266 rf=-2 score=4.74
>Throat_LANL_64_orf00001 begin=619 end=602 rf=-1 score=2.92
>Throat_LANL_73_orf00002 begin=91 end=222 rf=1 score=0.18
GATCGTCAAANNAACAACCCCTAATATTCTGTTATCTACTTTATTTGCTTTTAAGCTAGA
TAGATCGATATTTAATTCTTCTAACATTTATCTGTCCGTATCTAATTTTCCGCTACGTAA
AGCGTCAAGTAA

Example of output (manual edit by moderator)

>Throat_LANL_14_orf00002 begin=45 end=212 rf=-3 score=4.34
CTTTACGTAGCGGAAAATTAGATACGGACAGATAAATGTTAGAAGAATTAAATATCGATC
TATCTAGCTTAAAAGCAAATAAAGTAGATAACAGAATATTAGGGGTTGTTNNTTTGACGA
TCTCCGCTCAAAAGAAAAAGAAGAGATCATTCAATGTGTTTACAATGT
>Throat_LANL_73_orf00002 begin=91 end=222 rf=1 score=0.18
GATCGTCAAANNAACAACCCCTAATATTCTGTTATCTACTTTATTTGCTTTTAAGCTAGA
TAGATCGATATTTAATTCTTCTAACATTTATCTGTCCGTATCTAATTTTCCGCTACGTAA
AGCGTCAAGTAA

fasta script • 16k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 10.7 years ago by genomelover ▴ 50

0

Entering edit mode

can you please add an example of the output that you would expect from the file you have posted? If I have understood it correctly, it would contain only the sequences LAN_73 and LAN14. Can you please confirm it?

ADD REPLY • link 10.7 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

thank you for responding to the query, you are right it would only contain LAN_73 and 14. I have given the detailed explanation in the answers.

ADD REPLY • link 10.7 years ago by genomelover ▴ 50

0

Entering edit mode

You have received many answers now. Please vote the ones that you consider useful.

ADD REPLY • link 10.7 years ago by Giovanni M Dall'Olio 28k

score 12 · Answer 1 · 2013-08-13

12

Entering edit mode

10.7 years ago

Frédéric Mahé ★ 3.2k

Here is a awk command that does the job:

awk 'BEGIN {RS = ">" ; FS = "\n" ; ORS = ""} {if ($2) print ">"$0}' input.fas > output.fas

With RS, we declare that the records are separated by >. If the record contains a second line $2 (i.e. is not an empty fasta record), print the record (add a > in front).

edit: just added a missing space for readability. Note that it is also possible to drop the if:

awk 'BEGIN {RS = ">" ; FS = "\n" ; ORS = ""} $2 {print ">"$0}' input.fas > output.fas

ADD COMMENT • link 10.7 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

thank you for the help

ADD REPLY • link 10.7 years ago by genomelover ▴ 50

score 4 · Answer 2 · 2013-08-13

4

Entering edit mode

10.7 years ago

JC 13k

Perl option:

#!/usr/bin/perl

use strict;
use warnings;

$/="\n>";
while (<>) {
    s/>//g;
    my ($id, $seq) = split (/\n/, $_);
    print ">$_" if (length $seq);
}

run as:

perl filterSeq.pl < FASTA > OUTPUT

ADD COMMENT • link 10.7 years ago by JC 13k

1

Entering edit mode

Here is a one-liner version:

perl -e '$/="\n>"; while (<>) { s/>//g; my ($id, $seq) = split /\n/; print ">$_" if length $seq; }' < in.fasta > out.fasta

Also, I think writing if length $seq is better because it is shorter and does exactly what you want (rejects sequences <1 not reject sequences <= 1).

ADD REPLY • link 10.7 years ago by SES 8.6k

1

Entering edit mode

well, you are right, if length $seq can be used too. btw you have missing the last '

ADD REPLY • link 10.7 years ago by JC 13k

0

Entering edit mode

Good catch. My point was that if length $seq > 1 will skip sequences of length 1. I don't think that is what you want. It's not the same as saying if length $seq;

ADD REPLY • link 10.7 years ago by SES 8.6k

0

Entering edit mode

I understand that too, I changed my previous code.

ADD REPLY • link 10.7 years ago by JC 13k

score 3 · Answer 3 · 2013-08-13

3

Entering edit mode

10.7 years ago

Giovanni M Dall'Olio 28k

You can do it with a simple BioPython script:

#!/usr/bin/env python
from Bio import *
for current_seq in SeqIO.parse("mysequences.fasta", "fasta"):
    if current_seq.seq.tostring() != '':
        print ">", current_seq.id, "\n", current_seq.seq.tostring()

If you have trouble installing Biopython, you can write a plain python script:

#!/usr/bin/env python
current_seq = ''
current_header = ''
for line in open("mysequences.fasta", "r"):
    if line.startswith(">"):
        if current_seq != '':
            print current_header, current_seq,
        current_header = line
        current_seq = ''
    else:
        current_seq += line
if current_seq != '':
    print current_header, current_seq

To run these scripts, just copy and paste them to a file, and then run them from the command line. For example, if you save the second code as "remove_empty_sequences.py", you can execute it by typing:

python remove_empty_sequences.py > outputfile

Note that the script assumes that your input sequences are stored in a file called mysequences.fasta. I'll leave you as an exercise the task of generalizing it for other file names.

ADD COMMENT • link 10.7 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

thanks a lot for the script but can you please write the whole script instead as I am new to this world of linux and bioinformatics. Any help from your side would be greatly appreciated. Thank you once again. I mean it would start with #!/usr/bin/env python and I do not know if any other thing has to be added. Sorry I am a novice to this world please bear with me.

ADD REPLY • link 10.7 years ago by genomelover ▴ 50

0

Entering edit mode

I've modified the answer, please have a look at it. The examples were already complete, I've just added the python shebang ( #!/usr/bin/env python)

ADD REPLY • link 10.7 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Hi thanks a lot for the help, I used the first script as I have biopython installed but it gives the following error Traceback (most recent call last): File "answer.py", line 3, in <module> for current_seq in SeqIO.parse("mysequences.fasta", "fasta"): NameError: name 'SeqIO' is not defined

I think I have the SeqIO module installed with biopython. Can you plz help? Well,thanks once again

ADD REPLY • link 10.7 years ago by genomelover ▴ 50

0

Entering edit mode

This is weird; are you sure that BioPython is installed correctly? Which version it is? SeqIO should be imported when we do from Bio import *.

ADD REPLY • link 10.7 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Hi sorry to ask again, I then used the second script,can you please tell which is the current_seq as according to the script the FASTA file which is to be edited is mysequences.fasta? The error which I get is Traceback (most recent call last): File "answer1.py", line 4, in <module> if current_seq != '': NameError: name 'current_seq' is not defined

ADD REPLY • link 10.7 years ago by genomelover ▴ 50

0

Entering edit mode

hi, sorry, indeed I had forgot to copy two lines of the script. Check it out now.

ADD REPLY • link 10.7 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

bravo!!!!!!!!!! the script is working indeed but just one more question can you plz let me know if I can get the output in a separate file as I have to run the output in other tool. Thanks a lot for the help.

ADD REPLY • link 10.7 years ago by genomelover ▴ 50

0

Entering edit mode

just redirect the output to a file. E.g. python remove_empty_sequences.py > outputfile

I've updated the answer and also fixed a subtle error in the code (which caused the last sequence to be not included in the output).

ADD REPLY • link 10.7 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Sorry for the late reply the script is working and I get the output in a separate file. Thanks a lot for the help

ADD REPLY • link 10.7 years ago by genomelover ▴ 50

Ram · Answer 4 · 2013-08-13

2

Entering edit mode

10.7 years ago

SRKR ▴ 180

PHP code to read sequences from a file (in.txt) and write the output to a file (out.txt)

<?php
$fr = fopen("in.txt", "r");
$text = fread($fr, filesize("in.txt"));
$fw = fopen("out.txt", "w");
while (preg_match("/>[^\n>]{1,}\n>/",$text) === 1)
    {
    $text2 = preg_replace("/>[^\n>]{1,}\n>/",">",$text);
    $text = $text2;
    }
fwrite($fw, $text);
fclose($fr);
fclose($fw);
?>

Hope this helps you too....

ADD COMMENT • link 10.7 years ago by SRKR ▴ 180

0

Entering edit mode

thank you for the help

ADD REPLY • link 10.7 years ago by genomelover ▴ 50

0

Entering edit mode

I have a php code, how can I save my output result in text file at my computer drive.

Thanks

<?php
$file = '/home/sekhwalm/result_Lu4.txt';
$searchfor = '100.00';

// the following line prevents the browser from parsing this as HTML.
header('Content-Type: text/plain');

// get the file contents, assuming the file to be readable (and exist)
$contents = file_get_contents($file);
// escape special characters in the query
$pattern = preg_quote($searchfor, '/');
// finalise the regular expression, matching the whole line
$pattern = "/^.*$pattern.*\$/m";
// search, and store all matching occurences in $matches
if(preg_match_all($pattern, $contents, $matches)){
   echo "Found matches:\n";
   echo implode("\n", $matches[0]);
}
else{
   echo "No matches found";
}

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.8 years ago by Kumar ▴ 170