Question: Remove part of headers in FASTA file
1
gravatar for peterlageweg603
13 months ago by
peterlageweg60310 wrote:

I want to delete the part starting after ">" until "cds_". And the characters after the accession. In the example "_1" and "_2". In the original file this counts up to 784, so "_784". Can someone help me with a solution? Would be great.

From this

>lcl|NC_002712.2_cds_NP_268424.1_1 [gene=dnaZA]
ATGACTG
>lcl|NC_002721.2_cds_NP_268453.1_2 [gene=dnaGC]
ATGTTCG

To

>NP_268424.1 [gene=dnaZA]
ATGACTG
>NP_268453.1 [gene=dnaGC]
ATGTTCG
unix counter header fasta • 372 views
ADD COMMENTlink modified 5 days ago by mergidaba0 • written 13 months ago by peterlageweg60310
2

Having spaces in fasta headers may look visually appealing. Keep in mind that if you were to use this file for alignments etc most aligners will drop all text after they encounter first space in a fasta header. So you will lose that gene name.

ADD REPLYlink written 13 months ago by genomax92k

Thanks for the heads up! I downloaded this from GenBank. I will include tr " " "_" in my unix command.

ADD REPLYlink written 13 months ago by peterlageweg60310

On the same train of thought as @genomax, I would say that also brackets, pipes and equals could be avoided. Not because they could damage you now, but because you'll never know what you'll need these data for in the future. Those characters could easily mess up your future pipelines!

Something like:

>NP_268424.1_dnaZA

Would probably be the best way to ensure no problems in the future while retaining all the necessary info.

ADD REPLYlink written 13 months ago by Macspider3.3k

I wanted to remove the name between fasta files from command line and merge them together. for example

> PHNY00000001.1 Astraceae varex cultivar OL2 scaffold000392, whole genome shotgun sequence 
agttaaacataattaatatatgttattaaatttgatatttatgaggggtaattcagtaatttcaaatgaataaattgtct caaggaaccccctagttgctctattatatatataatagatgtgtgtgtgtataatatatgtattatattcaaatttggtt aaaaaaattataaaatttaatctttGTTGCCCTTTTGTAATCGTTGATAAATTGGTCCGTTGCATATATTAGTACTAGTT GATATTAGTATTATTGTAGTATTTATATAACTGTTCAAAATATTGGTGATTGTTTGACAACTTAATTagtaatatttaat tgatataaattCATGTATTTATTCATGTCTCAAGTAAGAATACAACTATGATAGCgcaaccaattttttttttaaatcaa atgttaaatttgaattgaCACTAAAATTACAAGAATGACAATAACCTTCAACAGGGCCATCAGGGGTGTCCCTTACTATA GGCGAATTAATCTTGCGCGTGGGGTTCCGAACACTTAACAACccccaaaaatttcaaatctcaTATACATAAAGTATAAA ATGTAGTTTAGacatatgataatttgatagaATGTATGACGTCAATACTGAACATTATCTTTCTCTCATGAAAATAAGTTTT 

> PHNY00000001.2 Astraceae varex cultivar OL2 scaffold000393, whole genome shotgun sequence
GATATTAGTATTATTGTAGTATTTATATAACTGTTCAAAATATTGGTGATTGTTTGACAACTTAATTagtaatatttaat tgatataaattCATGTATTTATTCATGTCTCAAGTAAGAATACAACTATGATAGCgcaaccaattttttttttaaatcaa atgttaaatttgaattgaCACTAAAATTACAAGAATGACAATAACCTTCAACAGGGCCATCAGGGGTGTCCCTTACTATA GGCGAATTAATCTTGCGCGTGGGGTTCCGAACACTTAACAACccccaaaaatttcaaatctcaTATACATAAAGTATAAA ATGTAGTTTAGacatatgataatttgatagaATGTATGACGTCAATACTGAACATTATCTTTCTCTCATGAAAATAAGTTTT

> PHNY00000001.3 Astraceae varex cultivar OL2 scaffold000394, whole genome shotgun sequence
GATATTAGTATTATTGTAGTATTTATATAACTGTTCAAAATATTGGTGATTGTTTGACAACTTAATTagtaatatttaat tgatataaattCATGTATTTATTCATGTCTCAAGTAAGAATACAACTATGATAGCgcaaccaattttttttttaaatcaa atgttaaatttgaattgaCACTAAAATTACAAGAATGACAATAACCTTCAACAGGGCCATCAGGGGTGTCCCTTACTATA GGCGAATTAATCTTGCGCGTGGGGTTCCGAACACTTAACAACccccaaaaatttcaaatctcaTATACATAAAGTATAAA ATGTAGTTTAGacatatgataatttgatagaATGTATGACGTCAATACTGAACATTATCTTTCTCTCATGAAAATAAGTTTT

as:

> PHNY00000001.1 Astraceae varex cultivar OL2 scaffold000392, whole genome shotgun sequence
agttaaacataattaatatatgttattaaatttgatatttatgaggggtaattcagtaatttcaaatgaataaattgtct caaggaaccccctagttgctctattatatatataatagatgtgtgtgtgtataatatatgtattatattcaaatttggtt aaaaaaattataaaatttaatctttGTTGCCCTTTTGTAATCGTTGATAAATTGGTCCGTTGCATATATTAGTACTAGTT GATATTAGTATTATTGTAGTATTTATATAACTGTTCAAAATATTGGTGATTGTTTGACAACTTAATTagtaatatttaat tgatataaattCATGTATTTATTCATGTCTCAAGTAAGAATACAACTATGATAGCgcaaccaattttttttttaaatcaa atgttaaatttgaattgaCACTAAAATTACAAGAATGACAATAACCTTCAACAGGGCCATCAGGGGTGTCCCTTACTATA GGCGAATTAATCTTGCGCGTGGGGTTCCGAACACTTAACAACccccaaaaatttcaaatctcaTATACATAAAGTATAAA ATGTAGTTTAGacatatgataatttgatagaATGTATGACGTCAATACTGAACATTATCTTTCTCTCATGAAAATAAGTTTT GATATTAGTATTATTGTAGTATTTATATAACTGTTCAAAATATTGGTGATTGTTTGACAACTTAATTagtaatatttaat tgatataaattCATGTATTTATTCATGTCTCAAGTAAGAATACAACTATGATAGCgcaaccaattttttttttaaatcaa atgttaaatttgaattgaCACTAAAATTACAAGAATGACAATAACCTTCAACAGGGCCATCAGGGGTGTCCCTTACTATA GGCGAATTAATCTTGCGCGTGGGGTTCCGAACACTTAACAACccccaaaaatttcaaatctcaTATACATAAAGTATAAA ATGTAGTTTAGacatatgataatttgatagaATGTATGACGTCAATACTGAACATTATCTTTCTCTCATGAAAATAAGTTTT GATATTAGTATTATTGTAGTATTTATATAACTGTTCAAAATATTGGTGATTGTTTGACAACTTAATTagtaatatttaat tgatataaattCATGTATTTATTCATGTCTCAAGTAAGAATACAACTATGATAGCgcaaccaattttttttttaaatcaa atgttaaatttgaattgaCACTAAAATTACAAGAATGACAATAACCTTCAACAGGGCCATCAGGGGTGTCCCTTACTATA GGCGAATTAATCTTGCGCGTGGGGTTCCGAACACTTAACAACccccaaaaatttcaaatctcaTATACATAAAGTATAAA ATGTAGTTTAGacatatgataatttgatagaATGTATGACGTCAATACTGAACATTATCTTTCTCTCATGAAAATAAGTTTT

How can i make it using sed or another tool? Any help is appreciated! Thanks

ADD REPLYlink modified 5 days ago by genomax92k • written 5 days ago by mergidaba0
4
gravatar for Macspider
13 months ago by
Macspider3.3k
Vienna - BOKU
Macspider3.3k wrote:
cat $filename | sed 's/>.*cds_/>/'  | sed -e 's/_[0-9]* \[/ \[/'

First sed

Each character (.*) that is contained between > and cds_ is replaced by only >. This leaves the FASTA header intact but removes everything until NP_.

Second sed

Enabling regular expressions (-e) we substitute each combination of underscore (_), a series of numeric characters ([0-9]*), a whitespace () and a bracket, escaped (\[) by a whitespace and an escaped bracket only (\[).

ADD COMMENTlink modified 13 months ago • written 13 months ago by Macspider3.3k

Thanks for the quick respond! It almost works perfectly.

>lcl|NC_002712.2_cds_NP_268424.1_10 [gene=dnaZA]

When the "_numerical" goes up to 10 or 100 it is not substituted.

ADD REPLYlink written 13 months ago by peterlageweg60310

I edited the command in my first comment. It now has a start * after the numeric specification [0-9]. This means that any sequence of numbers from 0-9 will now be deleted.

ADD REPLYlink modified 13 months ago • written 13 months ago by Macspider3.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1247 users visited in the last hour