Removing All Empty Fasta Sequences From A File (Was: Editing The Headers Of The Fasta Format Sequence)
5
6
Entering edit mode
8.4 years ago
genomelover ▴ 50

Hi to all,

I have a genome sequence in FASTA FORMAT, and I need to edit and omit the headers which do not contain any sequence as I have to run it in a tool which needs the sequences in FASTA FORMAT.

My sequence is pasted below. Can anyone suggest me a script which can edit the FASTA headers in the script. I have tried using online tools but I have too many sequences, so I cannot do it manually. Can someone suggest me any script? Thanks in advance

>Throat_LANL_12_orf00001 begin=81 end=9 rf=-3 score=2.57
>Throat_LANL_14_orf00002 begin=45 end=212 rf=-3 score=4.34
CTTTACGTAGCGGAAAATTAGATACGGACAGATAAATGTTAGAAGAATTAAATATCGATC
TATCTAGCTTAAAAGCAAATAAAGTAGATAACAGAATATTAGGGGTTGTTNNTTTGACGA
TCTCCGCTCAAAAGAAAAAGAAGAGATCATTCAATGTGTTTACAATGT
>Throat_LANL_15_orf00001 begin=407 end=125 rf=-2 score=6.05
>Throat_LANL_19_orf00001 begin=394 end=7 rf=2 score=1.93
>Throat_LANL_20_orf00001 begin=651 end=445 rf=2 score=2.78
>Throat_LANL_20_orf00003 begin=455 end=393 rf=-2 score=2.95
>Throat_LANL_37_orf00001 begin=116 end=41 rf=-2 score=1.40
>Throat_LANL_47_orf00001 begin=328 end=139 rf=2 score=0.63
>Throat_LANL_52_orf00001 begin=514 end=148 rf=-1 score=0.31
>Throat_LANL_58_orf00001 begin=208 end=40 rf=2 score=0.56
>Throat_LANL_59_orf00001 begin=316 end=262 rf=2 score=5.28
>Throat_LANL_59_orf00003 begin=338 end=266 rf=-2 score=4.74
>Throat_LANL_64_orf00001 begin=619 end=602 rf=-1 score=2.92
>Throat_LANL_73_orf00002 begin=91 end=222 rf=1 score=0.18
GATCGTCAAANNAACAACCCCTAATATTCTGTTATCTACTTTATTTGCTTTTAAGCTAGA
TAGATCGATATTTAATTCTTCTAACATTTATCTGTCCGTATCTAATTTTCCGCTACGTAA
AGCGTCAAGTAA

Example of output (manual edit by moderator)

>Throat_LANL_14_orf00002 begin=45 end=212 rf=-3 score=4.34
CTTTACGTAGCGGAAAATTAGATACGGACAGATAAATGTTAGAAGAATTAAATATCGATC
TATCTAGCTTAAAAGCAAATAAAGTAGATAACAGAATATTAGGGGTTGTTNNTTTGACGA
TCTCCGCTCAAAAGAAAAAGAAGAGATCATTCAATGTGTTTACAATGT
>Throat_LANL_73_orf00002 begin=91 end=222 rf=1 score=0.18
GATCGTCAAANNAACAACCCCTAATATTCTGTTATCTACTTTATTTGCTTTTAAGCTAGA
TAGATCGATATTTAATTCTTCTAACATTTATCTGTCCGTATCTAATTTTCCGCTACGTAA
AGCGTCAAGTAA
fasta script • 12k views
ADD COMMENT
0
Entering edit mode

can you please add an example of the output that you would expect from the file you have posted? If I have understood it correctly, it would contain only the sequences LAN_73 and LAN14. Can you please confirm it?

ADD REPLY
0
Entering edit mode

thank you for responding to the query, you are right it would only contain LAN_73 and 14. I have given the detailed explanation in the answers.

ADD REPLY
0
Entering edit mode

You have received many answers now. Please vote the ones that you consider useful.

ADD REPLY
12
Entering edit mode
8.4 years ago

Here is a awk command that does the job:

awk 'BEGIN {RS = ">" ; FS = "\n" ; ORS = ""} {if ($2) print ">"$0}' input.fas > output.fas

With RS, we declare that the records are separated by >. If the record contains a second line $2 (i.e. is not an empty fasta record), print the record (add a > in front).

edit: just added a missing space for readability. Note that it is also possible to drop the if:

awk 'BEGIN {RS = ">" ; FS = "\n" ; ORS = ""} $2 {print ">"$0}' input.fas > output.fas

ADD COMMENT
0
Entering edit mode

thank you for the help

ADD REPLY
4
Entering edit mode
8.4 years ago
JC 12k

Perl option:

#!/usr/bin/perl

use strict;
use warnings;

$/="\n>";
while (<>) {
    s/>//g;
    my ($id, $seq) = split (/\n/, $_);
    print ">$_" if (length $seq);
}

run as:

perl filterSeq.pl < FASTA > OUTPUT
ADD COMMENT
1
Entering edit mode

Here is a one-liner version:

perl -e '$/="\n>"; while (<>) { s/>//g; my ($id, $seq) = split /\n/; print ">$_" if length $seq; }' < in.fasta > out.fasta

Also, I think writing if length $seq is better because it is shorter and does exactly what you want (rejects sequences <1 not reject sequences <= 1).

ADD REPLY
1
Entering edit mode

well, you are right, if length $seq can be used too. btw you have missing the last '

ADD REPLY
0
Entering edit mode

Good catch. My point was that if length $seq > 1 will skip sequences of length 1. I don't think that is what you want. It's not the same as saying if length $seq;

ADD REPLY
0
Entering edit mode

I understand that too, I changed my previous code.

ADD REPLY
3
Entering edit mode
8.4 years ago

You can do it with a simple BioPython script:

#!/usr/bin/env python
from Bio import *
for current_seq in SeqIO.parse("mysequences.fasta", "fasta"):
    if current_seq.seq.tostring() != '':
        print ">", current_seq.id, "\n", current_seq.seq.tostring()

If you have trouble installing Biopython, you can write a plain python script:

#!/usr/bin/env python
current_seq = ''
current_header = ''
for line in open("mysequences.fasta", "r"):
    if line.startswith(">"):
        if current_seq != '':
            print current_header, current_seq,
        current_header = line
        current_seq = ''
    else:
        current_seq += line
if current_seq != '':
    print current_header, current_seq

To run these scripts, just copy and paste them to a file, and then run them from the command line. For example, if you save the second code as "remove_empty_sequences.py", you can execute it by typing:

python remove_empty_sequences.py > outputfile

Note that the script assumes that your input sequences are stored in a file called mysequences.fasta. I'll leave you as an exercise the task of generalizing it for other file names.

ADD COMMENT
0
Entering edit mode

thanks a lot for the script but can you please write the whole script instead as I am new to this world of linux and bioinformatics. Any help from your side would be greatly appreciated. Thank you once again. I mean it would start with #!/usr/bin/env python and I do not know if any other thing has to be added. Sorry I am a novice to this world please bear with me.

ADD REPLY
0
Entering edit mode

I've modified the answer, please have a look at it. The examples were already complete, I've just added the python shebang ( #!/usr/bin/env python)

ADD REPLY
0
Entering edit mode

Hi thanks a lot for the help, I used the first script as I have biopython installed but it gives the following error Traceback (most recent call last): File "answer.py", line 3, in <module> for current_seq in SeqIO.parse("mysequences.fasta", "fasta"): NameError: name 'SeqIO' is not defined

I think I have the SeqIO module installed with biopython. Can you plz help? Well,thanks once again

ADD REPLY
0
Entering edit mode

This is weird; are you sure that BioPython is installed correctly? Which version it is? SeqIO should be imported when we do from Bio import *.

ADD REPLY
0
Entering edit mode

Hi sorry to ask again, I then used the second script,can you please tell which is the current_seq as according to the script the FASTA file which is to be edited is mysequences.fasta? The error which I get is Traceback (most recent call last): File "answer1.py", line 4, in <module> if current_seq != '': NameError: name 'current_seq' is not defined

ADD REPLY
0
Entering edit mode

hi, sorry, indeed I had forgot to copy two lines of the script. Check it out now.

ADD REPLY
0
Entering edit mode

bravo!!!!!!!!!! the script is working indeed but just one more question can you plz let me know if I can get the output in a separate file as I have to run the output in other tool. Thanks a lot for the help.

ADD REPLY
0
Entering edit mode

just redirect the output to a file. E.g. python remove_empty_sequences.py > outputfile

I've updated the answer and also fixed a subtle error in the code (which caused the last sequence to be not included in the output).

ADD REPLY
0
Entering edit mode

Sorry for the late reply the script is working and I get the output in a separate file. Thanks a lot for the help

ADD REPLY
2
Entering edit mode
8.4 years ago
SRKR ▴ 180

PHP code to read sequences from a file (in.txt) and write the output to a file (out.txt)

<?php
$fr = fopen("in.txt", "r");
$text = fread($fr, filesize("in.txt"));
$fw = fopen("out.txt", "w");
while (preg_match("/>[^\n>]{1,}\n>/",$text) === 1)
    {
    $text2 = preg_replace("/>[^\n>]{1,}\n>/",">",$text);
    $text = $text2;
    }
fwrite($fw, $text);
fclose($fr);
fclose($fw);
?>

Hope this helps you too....

ADD COMMENT
0
Entering edit mode

thank you for the help

ADD REPLY
0
Entering edit mode

I have a php code, how can I save my output result in text file at my computer drive.

thanks

#######################

<?php
$file = '/home/sekhwalm/result_Lu4.txt';
$searchfor = '100.00';

// the following line prevents the browser from parsing this as HTML.
header('Content-Type: text/plain');

// get the file contents, assuming the file to be readable (and exist)
$contents = file_get_contents($file);
// escape special characters in the query
$pattern = preg_quote($searchfor, '/');
// finalise the regular expression, matching the whole line
$pattern = "/^.*$pattern.*\$/m";
// search, and store all matching occurences in $matches
if(preg_match_all($pattern, $contents, $matches)){
   echo "Found matches:\n";
   echo implode("\n", $matches[0]);
}
else{
   echo "No matches found";
}

##############################

 

ADD REPLY

Login before adding your answer.

Traffic: 2005 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6