Re:Bandwidth limitations, DNA binary coding
-----BEGIN PGP SIGNED MESSAGE----- Perry writes:
15 symbols, HALF a byte (actually a touch less.) One nybble can express 16 possible symbols (or one Hex digit, or whatever.)
Oops, I stand corrected, 15 symbols half a byte. What I was trying to convey is that GenBank (the repository for genomic sequence data) has a specific format for binary representation of DNA sequence data. Most genomic analysis programs use GenBank sequence format now (some use EMBL which is similar) and probably will in the future. Thus, the half byte per GATC symbol is defined as convention, not by the fewest binary digits neccesary for encoding them. It may waste bandwidth but that's no problem for fiber optics. Which is how this thread started.
plus, of course, the genome is highly compressable -- lots of repeated sequences, especially in interons.
This brings up an interesting topic. There are four classes of DNA: Foldback DNA, highly repetitive DNA, middle-repetitive DNA and single-copy DNA. Foldback DNA consists of palindromic sequences which form hairpin like structures. Highly repetitive DNA is made up of short sequences from several to hundreds of bases long (repeated around 5 x 10^5 times). Middle repetitive DNA consists of longer sequences, hundreds or thousands of bases long (these appear hundreds of times in the genome). Single-copy DNA sequences are usually genes themselves, of which (in humans) it is estimated that there are around 1 x 10^5. Since the genome is highly redundant (in mammals up to 60% of the genome is repetitive sequence), you could probably compress alot of it just by designating symbols for specific repetitive elements. Most of the repetitive nature of the genome is found as highly repetitive sequence localized as tandem arrays (not in introns). However, a second class of element known as SINEs and LINEs are found in introns, gene flanking regions and intergenic regions. The most widely characterized SINE is the Alu sequence, which is approximately 300 bases long and scattered throughout the genome over 5 x 10^5 times. This constitutes 5-6% of the genome! That's a lot of compressability. I often wonder if the redundancy is a way to encrypt a species genome, thus keeping different species from genetic communication. The "key" being millions of random base pairings which allow like species to decrypt their own genetic code and successfully have progeny. Pairings between species that are too dissimilar would be a refractory event because the key is not homologous. By the way, genes are made up of exons and introns. Scott G. Morham ! The First, Vaccinia@uncvx1.oit.unc.edu ! Second PGP Public Key by Request ! and Third Levels ! of Information Storage and Retrieval ! DNA, ! Biological Neural Nets, ! Cyberspace -----BEGIN PGP SIGNATURE----- Version: 2.3a iQCVAgUBLOR1gj2paOMjHHAhAQFNZwP+Lv7Xv4bityeHd2L53fgY4seWKZX/Mkrw YmHv5hPpusiXx6jt2tVGPnPyH0TVtdFb5Cy1YVnvLydgU4FPblJAO7chWuc5EPXn 7/SQ29AuGrDnWu9gEGaQiqEUgn40idPgvDVVQPikAX8tn5OmWo8vygMwIYgicQUh Po8BHvPSLfg= =ek9F -----END PGP SIGNATURE-----
participants (1)
-
VACCINIA@UNCVX1.OIT.UNC.EDU