Question: Remove part of headers in FASTA file
1
gravatar for peterlageweg603
12 days ago by
peterlageweg60310 wrote:

I want to delete the part starting after ">" until "cds_". And the characters after the accession. In the example "_1" and "_2". In the original file this counts up to 784, so "_784". Can someone help me with a solution? Would be great.

From this

>lcl|NC_002712.2_cds_NP_268424.1_1 [gene=dnaZA]
ATGACTG
>lcl|NC_002721.2_cds_NP_268453.1_2 [gene=dnaGC]
ATGTTCG

To

>NP_268424.1 [gene=dnaZA]
ATGACTG
>NP_268453.1 [gene=dnaGC]
ATGTTCG
unix counter header fasta • 80 views
ADD COMMENTlink modified 12 days ago by Macspider3.0k • written 12 days ago by peterlageweg60310
2

Having spaces in fasta headers may look visually appealing. Keep in mind that if you were to use this file for alignments etc most aligners will drop all text after they encounter first space in a fasta header. So you will lose that gene name.

ADD REPLYlink written 12 days ago by genomax73k

Thanks for the heads up! I downloaded this from GenBank. I will include tr " " "_" in my unix command.

ADD REPLYlink written 12 days ago by peterlageweg60310

On the same train of thought as @genomax, I would say that also brackets, pipes and equals could be avoided. Not because they could damage you now, but because you'll never know what you'll need these data for in the future. Those characters could easily mess up your future pipelines!

Something like:

>NP_268424.1_dnaZA

Would probably be the best way to ensure no problems in the future while retaining all the necessary info.

ADD REPLYlink written 12 days ago by Macspider3.0k
3
gravatar for Macspider
12 days ago by
Macspider3.0k
Vienna - BOKU
Macspider3.0k wrote:
cat $filename | sed 's/>.*cds_/>/'  | sed -e 's/_[0-9]* \[/ \[/'

First sed

Each character (.*) that is contained between > and cds_ is replaced by only >. This leaves the FASTA header intact but removes everything until NP_.

Second sed

Enabling regular expressions (-e) we substitute each combination of underscore (_), a series of numeric characters ([0-9]*), a whitespace () and a bracket, escaped (\[) by a whitespace and an escaped bracket only (\[).

ADD COMMENTlink modified 12 days ago • written 12 days ago by Macspider3.0k

Thanks for the quick respond! It almost works perfectly.

>lcl|NC_002712.2_cds_NP_268424.1_10 [gene=dnaZA]

When the "_numerical" goes up to 10 or 100 it is not substituted.

ADD REPLYlink written 12 days ago by peterlageweg60310

I edited the command in my first comment. It now has a start * after the numeric specification [0-9]. This means that any sequence of numbers from 0-9 will now be deleted.

ADD REPLYlink modified 12 days ago • written 12 days ago by Macspider3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1274 users visited in the last hour