Remove part of headers in FASTA file
1
1
Entering edit mode
2.2 years ago

I want to delete the part starting after ">" until "cds_". And the characters after the accession. In the example "_1" and "_2". In the original file this counts up to 784, so "_784". Can someone help me with a solution? Would be great.

From this

>lcl|NC_002712.2_cds_NP_268424.1_1 [gene=dnaZA]
ATGACTG
>lcl|NC_002721.2_cds_NP_268453.1_2 [gene=dnaGC]
ATGTTCG

To

>NP_268424.1 [gene=dnaZA]
ATGACTG
>NP_268453.1 [gene=dnaGC]
ATGTTCG
fasta header counter unix • 2.7k views
ADD COMMENT
2
Entering edit mode

Having spaces in fasta headers may look visually appealing. Keep in mind that if you were to use this file for alignments etc most aligners will drop all text after they encounter first space in a fasta header. So you will lose that gene name.

ADD REPLY
0
Entering edit mode

Thanks for the heads up! I downloaded this from GenBank. I will include tr " " "_" in my unix command.

ADD REPLY
0
Entering edit mode

On the same train of thought as @genomax, I would say that also brackets, pipes and equals could be avoided. Not because they could damage you now, but because you'll never know what you'll need these data for in the future. Those characters could easily mess up your future pipelines!

Something like:

>NP_268424.1_dnaZA

Would probably be the best way to ensure no problems in the future while retaining all the necessary info.

ADD REPLY
0
Entering edit mode

I wanted to remove the name between fasta files from command line and merge them together. for example

> PHNY00000001.1 Astraceae varex cultivar OL2 scaffold000392, whole genome shotgun sequence 
agttaaacataattaatatatgttattaaatttgatatttatgaggggtaattcagtaatttcaaatgaataaattgtct caaggaaccccctagttgctctattatatatataatagatgtgtgtgtgtataatatatgtattatattcaaatttggtt aaaaaaattataaaatttaatctttGTTGCCCTTTTGTAATCGTTGATAAATTGGTCCGTTGCATATATTAGTACTAGTT GATATTAGTATTATTGTAGTATTTATATAACTGTTCAAAATATTGGTGATTGTTTGACAACTTAATTagtaatatttaat tgatataaattCATGTATTTATTCATGTCTCAAGTAAGAATACAACTATGATAGCgcaaccaattttttttttaaatcaa atgttaaatttgaattgaCACTAAAATTACAAGAATGACAATAACCTTCAACAGGGCCATCAGGGGTGTCCCTTACTATA GGCGAATTAATCTTGCGCGTGGGGTTCCGAACACTTAACAACccccaaaaatttcaaatctcaTATACATAAAGTATAAA ATGTAGTTTAGacatatgataatttgatagaATGTATGACGTCAATACTGAACATTATCTTTCTCTCATGAAAATAAGTTTT 

> PHNY00000001.2 Astraceae varex cultivar OL2 scaffold000393, whole genome shotgun sequence
GATATTAGTATTATTGTAGTATTTATATAACTGTTCAAAATATTGGTGATTGTTTGACAACTTAATTagtaatatttaat tgatataaattCATGTATTTATTCATGTCTCAAGTAAGAATACAACTATGATAGCgcaaccaattttttttttaaatcaa atgttaaatttgaattgaCACTAAAATTACAAGAATGACAATAACCTTCAACAGGGCCATCAGGGGTGTCCCTTACTATA GGCGAATTAATCTTGCGCGTGGGGTTCCGAACACTTAACAACccccaaaaatttcaaatctcaTATACATAAAGTATAAA ATGTAGTTTAGacatatgataatttgatagaATGTATGACGTCAATACTGAACATTATCTTTCTCTCATGAAAATAAGTTTT

> PHNY00000001.3 Astraceae varex cultivar OL2 scaffold000394, whole genome shotgun sequence
GATATTAGTATTATTGTAGTATTTATATAACTGTTCAAAATATTGGTGATTGTTTGACAACTTAATTagtaatatttaat tgatataaattCATGTATTTATTCATGTCTCAAGTAAGAATACAACTATGATAGCgcaaccaattttttttttaaatcaa atgttaaatttgaattgaCACTAAAATTACAAGAATGACAATAACCTTCAACAGGGCCATCAGGGGTGTCCCTTACTATA GGCGAATTAATCTTGCGCGTGGGGTTCCGAACACTTAACAACccccaaaaatttcaaatctcaTATACATAAAGTATAAA ATGTAGTTTAGacatatgataatttgatagaATGTATGACGTCAATACTGAACATTATCTTTCTCTCATGAAAATAAGTTTT

as:

> PHNY00000001.1 Astraceae varex cultivar OL2 scaffold000392, whole genome shotgun sequence
agttaaacataattaatatatgttattaaatttgatatttatgaggggtaattcagtaatttcaaatgaataaattgtct caaggaaccccctagttgctctattatatatataatagatgtgtgtgtgtataatatatgtattatattcaaatttggtt aaaaaaattataaaatttaatctttGTTGCCCTTTTGTAATCGTTGATAAATTGGTCCGTTGCATATATTAGTACTAGTT GATATTAGTATTATTGTAGTATTTATATAACTGTTCAAAATATTGGTGATTGTTTGACAACTTAATTagtaatatttaat tgatataaattCATGTATTTATTCATGTCTCAAGTAAGAATACAACTATGATAGCgcaaccaattttttttttaaatcaa atgttaaatttgaattgaCACTAAAATTACAAGAATGACAATAACCTTCAACAGGGCCATCAGGGGTGTCCCTTACTATA GGCGAATTAATCTTGCGCGTGGGGTTCCGAACACTTAACAACccccaaaaatttcaaatctcaTATACATAAAGTATAAA ATGTAGTTTAGacatatgataatttgatagaATGTATGACGTCAATACTGAACATTATCTTTCTCTCATGAAAATAAGTTTT GATATTAGTATTATTGTAGTATTTATATAACTGTTCAAAATATTGGTGATTGTTTGACAACTTAATTagtaatatttaat tgatataaattCATGTATTTATTCATGTCTCAAGTAAGAATACAACTATGATAGCgcaaccaattttttttttaaatcaa atgttaaatttgaattgaCACTAAAATTACAAGAATGACAATAACCTTCAACAGGGCCATCAGGGGTGTCCCTTACTATA GGCGAATTAATCTTGCGCGTGGGGTTCCGAACACTTAACAACccccaaaaatttcaaatctcaTATACATAAAGTATAAA ATGTAGTTTAGacatatgataatttgatagaATGTATGACGTCAATACTGAACATTATCTTTCTCTCATGAAAATAAGTTTT GATATTAGTATTATTGTAGTATTTATATAACTGTTCAAAATATTGGTGATTGTTTGACAACTTAATTagtaatatttaat tgatataaattCATGTATTTATTCATGTCTCAAGTAAGAATACAACTATGATAGCgcaaccaattttttttttaaatcaa atgttaaatttgaattgaCACTAAAATTACAAGAATGACAATAACCTTCAACAGGGCCATCAGGGGTGTCCCTTACTATA GGCGAATTAATCTTGCGCGTGGGGTTCCGAACACTTAACAACccccaaaaatttcaaatctcaTATACATAAAGTATAAA ATGTAGTTTAGacatatgataatttgatagaATGTATGACGTCAATACTGAACATTATCTTTCTCTCATGAAAATAAGTTTT

How can i make it using sed or another tool? Any help is appreciated! Thanks

ADD REPLY
0
Entering edit mode

This is a different question, so please post it as such and not as a comment underneath another question. Besides, please have a look whether your desired output is exactly as you want it (e.g. there are whitespaces which I believe you don't want in there).

ADD REPLY
5
Entering edit mode
2.2 years ago
Macspider ★ 3.4k
cat $filename | sed 's/>.*cds_/>/'  | sed -e 's/_[0-9]* \[/ \[/'

First sed

Each character (.*) that is contained between > and cds_ is replaced by only >. This leaves the FASTA header intact but removes everything until NP_.

Second sed

Enabling regular expressions (-e) we substitute each combination of underscore (_), a series of numeric characters ([0-9]*), a whitespace () and a bracket, escaped (\[) by a whitespace and an escaped bracket only (\[).

ADD COMMENT
0
Entering edit mode

Thanks for the quick respond! It almost works perfectly.

>lcl|NC_002712.2_cds_NP_268424.1_10 [gene=dnaZA]

When the "_numerical" goes up to 10 or 100 it is not substituted.

ADD REPLY
0
Entering edit mode

I edited the command in my first comment. It now has a start * after the numeric specification [0-9]. This means that any sequence of numbers from 0-9 will now be deleted.

ADD REPLY

Login before adding your answer.

Traffic: 1351 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6