Any Script To Parse Fasta Headers?
4
4
Entering edit mode
9.6 years ago
SK ▴ 110

I have a 47GB file to parse. The sequences are in following format:

>TSCS_00041 gene0EA_12345_rframe2_ORF
MLAATHYYKFAIRRLFPLLKDTICASYSISIKHHENFMALSNMPKIWEDVEVDGNNMQWTRFQTTPVMPVYFIAAGVFNLSFITNWNTKLLYRKDILPYMTFAYNVAKNIAWFLSHIRKTKITNHI
>TSCS_00044 gene0EA_12341_rframe2_ORF
MTICASYSISIKHHENFMAIKHHENFMALSNMPKIWEDV

I simply want to format this file like:

>TSCS_00041
MLAATHYYKFAIRRLFPLLKDTICASYSISIKHHENFMALSNMPKIWEDVEVDGNNMQWTRFQTTPVMPVYFIAAGVFNLSFITNWNTKLLYRKDILPYMTFAYNVAKNIAWFLSHIRKTKITNHI
>TSCS_00044 
MTICASYSISIKHHENFMAIKHHENFMALSNMPKIWEDV

Could anyone share the script.

bioinformatics fasta next-gen perl perl • 9.0k views
ADD COMMENT
5
Entering edit mode

what have you tried ? hint: 'cut'

ADD REPLY
0
Entering edit mode

can this be done with cut only? the OP seems to shorten the fasta header not the other lines

ADD REPLY
5
Entering edit mode

cut -d" " -f 1 will work as long as no spaces in sequence.

ADD REPLY
0
Entering edit mode

and that's how homework is done ;)

ADD REPLY
10
Entering edit mode
9.6 years ago

Assuming that you do not have spaces in your sequences you can try that:

sed -e '/^>/ s/ .*//' mybigfile

Awk should work too (even if your sequences contain spaces):

awk '{print /^>/ ? $1 : $0}' mybigfile

and just for fun, a pure bash version:

while read l ; do echo "${l%% *}" ; done < mybigfile

but a simple cut command would be faster:

cut -d " " -f 1 mybigfile
ADD COMMENT
0
Entering edit mode

nice, can you explain the bash version?

ADD REPLY
2
Entering edit mode

Sure, "${l%% *}" is a parameter expansion. It says remove the longest suffix string containing a space from the variable l. So, for all lines in mybigfile, remove anything right of a space (if any) and write the result on the standard output.

ADD REPLY
0
Entering edit mode

cheers .

ADD REPLY
5
Entering edit mode
9.6 years ago
Kenosis ★ 1.3k

Try the following:

use strict;
use warnings;

while (<>) {
    s/>\S+\K.+//;
    print;
}

Results:

>TSCS_00041
MLAATHYYKFAIRRLFPLLKDTICASYSISIKHHENFMALSNMPKIWEDVEVDGNNMQWTRFQTTPVMPVYFIAAGVFNLSFITNWNTKLLYRKDILPYMTFAYNVAKNIAWFLSHIRKTKITNHI
>TSCS_00044
MTICASYSISIKHHENFMAIKHHENFMALSNMPKIWEDV

Usage: perl script.pl inFile >outFile

As the fasta inFile is read line-by-line, the substituting regex will keep the > and the non-whitespace characters up to the first whitespace. The >outFile notation directs the printing to the file outFile.

Hope this helps!

ADD COMMENT
6
Entering edit mode

the equivalent perl-one-liner:

  perl -pe 's/>\S+\K.+//' < file.fasta > new_file.fasta
ADD REPLY
2
Entering edit mode

Actually I am not sure why you always use -plane. -p and -n contradict each other, while -a enables autosplit, which causes overhead and is not necessary in this case. -l is useless here, too. The proper one should be perl -pe 's/^>(\S+).*/>$1/' or something similar. Sed also works.

ADD REPLY
0
Entering edit mode

You (as usual) are right, I removed the unnecessary flags. Thanks.

ADD REPLY
3
Entering edit mode
9.6 years ago
KCC ★ 4.0k

Put the following code in a file called parse.py

import sys

for line in open(sys.argv[1]):
    if line.startswith(">"):
        line = line.split()
        print line[0]
    else:
        print line.strip()

Then assuming your file is named "myfile.fa", you type:

python parse.py myfile.fa
ADD COMMENT
1
Entering edit mode

This is good!

Can also use a substutition w/capture here:

import sys, re

for line in open(sys.argv[1]):
    print re.sub(r'(>\S+)\s+.+', r'\1', line) or line,

Sorry, George, about deleting my earlier posting. I didn't know what I was doing...

ADD REPLY
0
Entering edit mode

if you remove "if line.startswith(">"):" from the code, the script will still work.

ADD REPLY
1
Entering edit mode

Fair enough. A little more slick than I was going for, I guess. I guess I would argue the program as written is more robust, explicit and easier to tweak if you need something slightly different done later.

ADD REPLY
0
Entering edit mode

If you keep the "if line.startswith('>'), you can remove the first strip() function. You do not need to remove the newline character (or any trailing spaces) since you only output the first item.

ADD REPLY
0
Entering edit mode

Thanks. I modified it.

ADD REPLY
0
Entering edit mode
9.6 years ago

The Biopieces www.biopieces.org) solution:

read_fasta -i in.fna | split_vals -k SEQ_NAME -d ' ' | rename_keys -k SEQ_NAME_0,SEQ_NAME | write_fasta -o out.fna -x
ADD COMMENT

Login before adding your answer.

Traffic: 1853 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6