Question: How To Convert A Fasta Sequence Format Into Binary Format (0-1)?
gravatar for amritayadav1991
5.8 years ago by
United States
amritayadav199110 wrote:

how to convert fasta sequence format into binary format (0-1) ???

ADD COMMENTlink modified 7 months ago by vishwaas170430 • written 5.8 years ago by amritayadav199110

what do you really want ? NCBI asn1tool, NCBI makeblastdb , UCSC fatotwobit , even gzip , etc... they all do that job

ADD REPLYlink written 5.8 years ago by Pierre Lindenbaum122k

Given that any computer representation is a binary encoding, you don't even need to go that far. A fasta file is a binary encoding of a biological sequence, which has already been encoded in to a sequence of one letter codes representing each residue/base, with the addition of a little meta-data.

I would guess that what is really being looked for is details of a nucleotide sequence encoding, such as the GCG 2-bit encoding where the four basic DNA bases are represented in two bits:

00 = C, 01 = T, 10 = A, 11 = G

Various forms of this and other encodings providing compression are discussed in Which Dna Compression Algorithms Are Actually Used?.

ADD REPLYlink written 5.8 years ago by Hamish3.1k

for that matter, Microsoft word find/replace would do it...

ADD REPLYlink written 5.8 years ago by Whetting1.5k

The bash program is available at

ADD REPLYlink modified 7 months ago • written 7 months ago by vishwaas170430

Don't want to be really harsh but this is probably not a good method. For instance, printing out the text character '0' is 8 bits, it is not the same as a binary which would occupy 1 bit, so your script will double the size of any FASTA file simply by using two characters instead of one DNA letter. If you are just learning, it's interesting to practice this way but this would not be a recommended script

ADD REPLYlink written 7 months ago by cmdcolin1.2k
gravatar for Matt Shirley
5.8 years ago by
Matt Shirley9.1k
Cambridge, MA
Matt Shirley9.1k wrote:
> xxd -g 0 -b file.fasta
0000000: 001111100110011101101001011111000011001000110010  >gi|22
0000006: 001101000011010100111000001110010011100000110000  458980
000000c: 001100000111110001110010011001010110011001111100  0|ref|
0000012: 010011100100001101011111001100000011000000110000  NC_000

> xxd -g 0 -b file.fasta | cut -d' ' -f2 -
ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by Matt Shirley9.1k

:-) .

ADD REPLYlink written 5.8 years ago by Pierre Lindenbaum122k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1445 users visited in the last hour