Removing All Empty Fasta Sequences From A File (Was: Editing The Headers Of The Fasta Format Sequence)
5
6
Entering edit mode
8.4 years ago
genomelover ▴ 50

Hi to all,

I have a genome sequence in FASTA FORMAT, and I need to edit and omit the headers which do not contain any sequence as I have to run it in a tool which needs the sequences in FASTA FORMAT.

My sequence is pasted below. Can anyone suggest me a script which can edit the FASTA headers in the script. I have tried using online tools but I have too many sequences, so I cannot do it manually. Can someone suggest me any script? Thanks in advance

>Throat_LANL_12_orf00001 begin=81 end=9 rf=-3 score=2.57
>Throat_LANL_14_orf00002 begin=45 end=212 rf=-3 score=4.34
CTTTACGTAGCGGAAAATTAGATACGGACAGATAAATGTTAGAAGAATTAAATATCGATC
TATCTAGCTTAAAAGCAAATAAAGTAGATAACAGAATATTAGGGGTTGTTNNTTTGACGA
TCTCCGCTCAAAAGAAAAAGAAGAGATCATTCAATGTGTTTACAATGT
>Throat_LANL_15_orf00001 begin=407 end=125 rf=-2 score=6.05
>Throat_LANL_19_orf00001 begin=394 end=7 rf=2 score=1.93
>Throat_LANL_20_orf00001 begin=651 end=445 rf=2 score=2.78
>Throat_LANL_20_orf00003 begin=455 end=393 rf=-2 score=2.95
>Throat_LANL_37_orf00001 begin=116 end=41 rf=-2 score=1.40
>Throat_LANL_47_orf00001 begin=328 end=139 rf=2 score=0.63
>Throat_LANL_52_orf00001 begin=514 end=148 rf=-1 score=0.31
>Throat_LANL_58_orf00001 begin=208 end=40 rf=2 score=0.56
>Throat_LANL_59_orf00001 begin=316 end=262 rf=2 score=5.28
>Throat_LANL_59_orf00003 begin=338 end=266 rf=-2 score=4.74
>Throat_LANL_64_orf00001 begin=619 end=602 rf=-1 score=2.92
>Throat_LANL_73_orf00002 begin=91 end=222 rf=1 score=0.18
GATCGTCAAANNAACAACCCCTAATATTCTGTTATCTACTTTATTTGCTTTTAAGCTAGA
TAGATCGATATTTAATTCTTCTAACATTTATCTGTCCGTATCTAATTTTCCGCTACGTAA
AGCGTCAAGTAA


Example of output (manual edit by moderator)

>Throat_LANL_14_orf00002 begin=45 end=212 rf=-3 score=4.34
CTTTACGTAGCGGAAAATTAGATACGGACAGATAAATGTTAGAAGAATTAAATATCGATC
TATCTAGCTTAAAAGCAAATAAAGTAGATAACAGAATATTAGGGGTTGTTNNTTTGACGA
TCTCCGCTCAAAAGAAAAAGAAGAGATCATTCAATGTGTTTACAATGT
>Throat_LANL_73_orf00002 begin=91 end=222 rf=1 score=0.18
GATCGTCAAANNAACAACCCCTAATATTCTGTTATCTACTTTATTTGCTTTTAAGCTAGA
TAGATCGATATTTAATTCTTCTAACATTTATCTGTCCGTATCTAATTTTCCGCTACGTAA
AGCGTCAAGTAA

fasta script • 12k views
0
Entering edit mode

can you please add an example of the output that you would expect from the file you have posted? If I have understood it correctly, it would contain only the sequences LAN_73 and LAN14. Can you please confirm it?

0
Entering edit mode

thank you for responding to the query, you are right it would only contain LAN_73 and 14. I have given the detailed explanation in the answers.

0
Entering edit mode

12
Entering edit mode
8.4 years ago

Here is a awk command that does the job:

awk 'BEGIN {RS = ">" ; FS = "\n" ; ORS = ""} {if ($2) print ">"$0}' input.fas > output.fas

With RS, we declare that the records are separated by >. If the record contains a second line $2 (i.e. is not an empty fasta record), print the record (add a > in front). edit: just added a missing space for readability. Note that it is also possible to drop the if: awk 'BEGIN {RS = ">" ; FS = "\n" ; ORS = ""}$2 {print ">"$0}' input.fas > output.fas ADD COMMENT 0 Entering edit mode thank you for the help ADD REPLY 4 Entering edit mode 8.4 years ago JC 12k Perl option: #!/usr/bin/perl use strict; use warnings;$/="\n>";
while (<>) {
s/>//g;
my ($id,$seq) = split (/\n/, $_); print ">$_" if (length $seq); }  run as: perl filterSeq.pl < FASTA > OUTPUT  ADD COMMENT 1 Entering edit mode Here is a one-liner version: perl -e '$/="\n>"; while (<>) { s/>//g; my ($id,$seq) = split /\n/; print ">$_" if length$seq; }' < in.fasta > out.fasta


Also, I think writing if length $seq is better because it is shorter and does exactly what you want (rejects sequences <1 not reject sequences <= 1). ADD REPLY 1 Entering edit mode well, you are right, if length$seq can be used too. btw you have missing the last '

0
Entering edit mode

Good catch. My point was that if length $seq > 1 will skip sequences of length 1. I don't think that is what you want. It's not the same as saying if length$seq;

0
Entering edit mode

I understand that too, I changed my previous code.

3
Entering edit mode
8.4 years ago

You can do it with a simple BioPython script:

#!/usr/bin/env python
from Bio import *
for current_seq in SeqIO.parse("mysequences.fasta", "fasta"):
if current_seq.seq.tostring() != '':
print ">", current_seq.id, "\n", current_seq.seq.tostring()


If you have trouble installing Biopython, you can write a plain python script:

#!/usr/bin/env python
current_seq = ''
for line in open("mysequences.fasta", "r"):
if line.startswith(">"):
if current_seq != '':
current_seq = ''
else:
current_seq += line
if current_seq != '':


To run these scripts, just copy and paste them to a file, and then run them from the command line. For example, if you save the second code as "remove_empty_sequences.py", you can execute it by typing:

python remove_empty_sequences.py > outputfile


Note that the script assumes that your input sequences are stored in a file called mysequences.fasta. I'll leave you as an exercise the task of generalizing it for other file names.

0
Entering edit mode

thanks a lot for the script but can you please write the whole script instead as I am new to this world of linux and bioinformatics. Any help from your side would be greatly appreciated. Thank you once again. I mean it would start with #!/usr/bin/env python and I do not know if any other thing has to be added. Sorry I am a novice to this world please bear with me.

0
Entering edit mode

I've modified the answer, please have a look at it. The examples were already complete, I've just added the python shebang ( #!/usr/bin/env python)

0
Entering edit mode

Hi thanks a lot for the help, I used the first script as I have biopython installed but it gives the following error Traceback (most recent call last): File "answer.py", line 3, in <module> for current_seq in SeqIO.parse("mysequences.fasta", "fasta"): NameError: name 'SeqIO' is not defined

I think I have the SeqIO module installed with biopython. Can you plz help? Well,thanks once again

0
Entering edit mode

This is weird; are you sure that BioPython is installed correctly? Which version it is? SeqIO should be imported when we do from Bio import *.

0
Entering edit mode

Hi sorry to ask again, I then used the second script,can you please tell which is the current_seq as according to the script the FASTA file which is to be edited is mysequences.fasta? The error which I get is Traceback (most recent call last): File "answer1.py", line 4, in <module> if current_seq != '': NameError: name 'current_seq' is not defined

0
Entering edit mode

hi, sorry, indeed I had forgot to copy two lines of the script. Check it out now.

0
Entering edit mode

bravo!!!!!!!!!! the script is working indeed but just one more question can you plz let me know if I can get the output in a separate file as I have to run the output in other tool. Thanks a lot for the help.

0
Entering edit mode

just redirect the output to a file. E.g. python remove_empty_sequences.py > outputfile

I've updated the answer and also fixed a subtle error in the code (which caused the last sequence to be not included in the output).

0
Entering edit mode

Sorry for the late reply the script is working and I get the output in a separate file. Thanks a lot for the help

2
Entering edit mode
8.4 years ago
SRKR ▴ 180

PHP code to read sequences from a file (in.txt) and write the output to a file (out.txt)

<?php
$fr = fopen("in.txt", "r");$text = fread($fr, filesize("in.txt"));$fw = fopen("out.txt", "w");
while (preg_match("/>[^\n>]{1,}\n>/",$text) === 1) {$text2 = preg_replace("/>[^\n>]{1,}\n>/",">",$text);$text = $text2; } fwrite($fw, $text); fclose($fr);
fclose($fw); ?>  Hope this helps you too.... ADD COMMENT 0 Entering edit mode thank you for the help ADD REPLY 0 Entering edit mode I have a php code, how can I save my output result in text file at my computer drive. thanks ####################### <?php$file = '/home/sekhwalm/result_Lu4.txt';
$searchfor = '100.00'; // the following line prevents the browser from parsing this as HTML. header('Content-Type: text/plain'); // get the file contents, assuming the file to be readable (and exist)$contents = file_get_contents($file); // escape special characters in the query$pattern = preg_quote($searchfor, '/'); // finalise the regular expression, matching the whole line$pattern = "/^.*$pattern.*\$/m";
// search, and store all matching occurences in $matches if(preg_match_all($pattern, $contents,$matches)){
echo "Found matches:\n";
echo implode("\n", \$matches[0]);
}
else{
echo "No matches found";
}

##############################