Question

How can I remove ">" sign from whole genome sequence

0

Entering edit mode

5.3 years ago

Ahmad_Bio • 0

I have word file of whole genome sequence around 1709 pages, each gene is separated by ">". I need to blast whole genome sequence against a protein sequence from other organism for homology. Is there anyway to remove this information line ">gm_orf648 67_127_d_D 579383 580123 + 741_nt 246_aa" at once. instead of manually deleting one by one.

sequence • 1.1k views

ADD COMMENT • link 5.3 years ago by Ahmad_Bio • 0

1

Entering edit mode

Perhaps bioinformaticians should just give up and rewrite tools to accept sequences in Word fasta format :)
— Mick Watson (@BioMickWatson) January 31, 2012

ADD REPLY • link 5.3 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Do not do not do not do not do not keep your sequences in Office formats. Ever.

I’m actually amazed it’s even opened that many pages without crashing.

ADD REPLY • link 5.3 years ago by Joe 21k

1

Entering edit mode

You don't have to delete these.

ADD REPLY • link 5.3 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank you for meaningful help, I managed to copy it in oligo 7 thereby no need to remove > lines.

ADD REPLY • link 5.3 years ago by Ahmad_Bio • 0

score 5 · Answer 1 · 2019-01-19

The correct answer to your problem is create a blast database from your file and blast the protein against this database - and blast can correctly parse the lines with >.

Your file is in fasta format (also see Is There A Precise Specification For Fasta Files? ). The line you want to remove is part of the format specification:

The first line in a FASTA file started either with a ">" (greater-than) symbol

A fasta file is just a text file, I guess Word is configured to open text files on your computer - but I doubt it is really a Word document.

Virtually all bioinformatics software can correctly parse fasta format, and there is no need to remove these lines.

For the sake of completeness (even if I shouldn't), here is the answer to your original question:

sed -i.bak '/^>/d' file