How can I remove ">" sign from whole genome sequence
1
0
Entering edit mode
5.3 years ago
Ahmad_Bio • 0

I have word file of whole genome sequence around 1709 pages, each gene is separated by ">". I need to blast whole genome sequence against a protein sequence from other organism for homology. Is there anyway to remove this information line ">gm_orf648 67_127_d_D 579383 580123 + 741_nt 246_aa" at once. instead of manually deleting one by one.

sequence • 1.1k views
ADD COMMENT
1
Entering edit mode

ADD REPLY
1
Entering edit mode

Do not do not do not do not do not keep your sequences in Office formats. Ever.

I’m actually amazed it’s even opened that many pages without crashing.

ADD REPLY
1
Entering edit mode

You don't have to delete these.

ADD REPLY
0
Entering edit mode

Thank you for meaningful help, I managed to copy it in oligo 7 thereby no need to remove > lines.

ADD REPLY
5
Entering edit mode
5.3 years ago
h.mon 35k

The correct answer to your problem is create a blast database from your file and blast the protein against this database - and blast can correctly parse the lines with >.

Your file is in fasta format (also see Is There A Precise Specification For Fasta Files? ). The line you want to remove is part of the format specification:

The first line in a FASTA file started either with a ">" (greater-than) symbol

A fasta file is just a text file, I guess Word is configured to open text files on your computer - but I doubt it is really a Word document.

Virtually all bioinformatics software can correctly parse fasta format, and there is no need to remove these lines.

For the sake of completeness (even if I shouldn't), here is the answer to your original question:

sed -i.bak '/^>/d' file
ADD COMMENT

Login before adding your answer.

Traffic: 1585 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6