[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re:Bandwidth limitations, DNA binary coding
-----BEGIN PGP SIGNED MESSAGE-----
Perry writes:
>15 symbols, HALF a byte (actually a touch less.) One nybble can express 16
>possible symbols (or one Hex digit, or whatever.)
Oops, I stand corrected, 15 symbols half a byte. What I was trying to convey
is that GenBank (the repository for genomic sequence data) has a specific
format for binary representation of DNA sequence data. Most genomic analysis
programs use GenBank sequence format now (some use EMBL which is similar) and
probably will in the future. Thus, the half byte per GATC symbol is defined as
convention, not by the fewest binary digits neccesary for encoding them. It
may waste bandwidth but that's no problem for fiber optics. Which is how this
thread started.
>plus, of course, the genome is highly compressable -- lots of repeated
>sequences, especially in interons.
This brings up an interesting topic. There are four classes of DNA: Foldback
DNA, highly repetitive DNA, middle-repetitive DNA and single-copy DNA.
Foldback DNA consists of palindromic sequences which form hairpin like
structures. Highly repetitive DNA is made up of short sequences from several
to hundreds of bases long (repeated around 5 x 10^5 times). Middle repetitive
DNA consists of longer sequences, hundreds or thousands of bases long (these
appear hundreds of times in the genome). Single-copy DNA sequences are usually
genes themselves, of which (in humans) it is estimated that there are around
1 x 10^5.
Since the genome is highly redundant (in mammals up to 60% of the genome is
repetitive sequence), you could probably compress alot of it just by
designating symbols for specific repetitive elements. Most of the repetitive
nature of the genome is found as highly repetitive sequence localized as
tandem arrays (not in introns). However, a second class of element known as
SINEs and LINEs are found in introns, gene flanking regions and intergenic
regions. The most widely characterized SINE is the Alu sequence, which is
approximately 300 bases long and scattered throughout the genome over
5 x 10^5 times. This constitutes 5-6% of the genome! That's a lot of
compressability.
I often wonder if the redundancy is a way to encrypt a species genome, thus
keeping different species from genetic communication. The "key" being millions
of random base pairings which allow like species to decrypt their own genetic
code and successfully have progeny. Pairings between species that are too
dissimilar would be a refractory event because the key is not homologous.
By the way, genes are made up of exons and introns.
Scott G. Morham ! The First,
[email protected] ! Second
PGP Public Key by Request ! and Third Levels
! of Information Storage and Retrieval
! DNA,
! Biological Neural Nets,
! Cyberspace
-----BEGIN PGP SIGNATURE-----
Version: 2.3a
iQCVAgUBLOR1gj2paOMjHHAhAQFNZwP+Lv7Xv4bityeHd2L53fgY4seWKZX/Mkrw
YmHv5hPpusiXx6jt2tVGPnPyH0TVtdFb5Cy1YVnvLydgU4FPblJAO7chWuc5EPXn
7/SQ29AuGrDnWu9gEGaQiqEUgn40idPgvDVVQPikAX8tn5OmWo8vygMwIYgicQUh
Po8BHvPSLfg=
=ek9F
-----END PGP SIGNATURE-----