subject = Information Theorytitle = Data CompressionData Compression-inbeginners’ terms?Data Compression’ just sounds complicated. Don’t beafraid, compression is our good friend for many reasons. It saves hard drivespace. It makes data files to handle.

It also cuts those immense file downloadtimes from the Internet. Wouldn’t it be nice if we could compress all filesdown to just a few bytes?There is a limit to how much you can compressa file. How random the file is, is the determining factor to how far it canbe compressed. If the file is completely random and no pattern can be found,then the shortest representation of the file is the file it self. The actualproof that proves this is at the end of my paper.

Order nowThe key to compressing afile is to find some sort of exploitable pattern. Most of this paper willbe explaining those patterns that are commonly used. Null suppression isthe most primitive form of data compression that I could find. Basically,it says that if you have different fields that data is in (possibly a spreadsheet), and any of them have only zeros in them, then the program just eliminatesthe data and goes straight from the empty data set to the next.

Only onestep up from null suppression is Run Length Encoding. Run length encodingsimply tells you how many of what you have in a row. It would change a setof binary data like {0011100001} into what the computer reads as (2)zeros,(3)ones, (4)zeros, 1. As you can see, it works on the same basic idea of findinga series of 0’s (null suppression) and 1’s in this case too and abbreviatingthem.

Once the whole idea of data compression caught on, more people startedworking on programs for it. From these people we got some new premises towork with. Substitutional encoding is a big one. It was invented jointlyby two people: Abraham Lempel and Jakob Ziv. Most compression algorithms (bigword meaning roughly ?program’) using substitutional encoding start with ?LZ’for Lempel-Ziv. LZ-77 is a really neat compression in which the programstarts off just copying the source file over to the new target file, but whenit recognizes a phrase of data that it has previously written, it replacesthe second set of data in the target file with directions on how to get tothe first occurrence of it and copy it in the directions’ place.

This is morecommonly called a sliding-window compression because the focus of the programis always sliding all around the file. LZ-78 is the compression that mostpeople have in their homes. Some of the more common ones are ZIP, LHA, ARJ,ZOO, and GZIP. The main idea behind LZ-78 is a ?dictionary’.

Yet it worksquite a bit like the LZ-77. For every phrase it comes across, it indexes thestring by a number and writes it in a ?dictionary’. When the program comesacross the same string, it uses the associated number in the ?dictionary’ insteadof the string. The ?dictionary’ is then written along side the compressedfile to be used in decoding.

There is a combined version of LZ-77 anLZ-78. It is called LZFG. It only writes to the dictionary when it findsthe repeated phrase, not on every phrase. Then instead of LZFG replacing thesecond set of data with directions on how to get to the first occurrence ofit, the program puts in the number reference for the dictionary’s translation. Not only is it faster, but it compresses better because of the fact that itdoesn’t have as big of a dictionary attached.

Statistical encoding is anotherone of the new compression concepts. It is an offshoot of the LZ family ofcompressors; It uses basically the same style as LZFG, but instead of assigningthe numbers in order that the strings come out of the source file, statisticalcompressors do some research. It calculates the number of times each stringis used and then ranks the string with the most number of uses at the top ofthe hash table. The string with the least is ranked at the bottom.

(A hashtable is where the rank is figured) The higher up a string is on this list,the smaller of a reference number it gets to minimize the total bit usage. This gives this compression just a slight edge on the others, but every littlebit helps. (ha ha -bit- )Beware! There are a few compression programsout there that claim wonderful compression ratios; ratios that beat the compressionlimit for that file’s randomness. These programs aren’t really compressionprograms.

They are OWS and WIC. Never compress anything with these. Whatthey do is split up the file that you desired to compress and hide most ofit on another part of your hard drive. OWS puts it in a specific spot on thephysical hard drive disk. WIC puts the extra information in a hidden filecalled winfile.

dll. The real problems with these programs are that if youdon’t have the winfile. dll or the information on the certain spot on your drive,then the program won’t put your file back together. My original intentwith this project was to invent a new compression algorithm. I started withthe idea that if you took the file in its pure binary form and laid it outin a matrix, there were certain rows and columns that you could add up to getan output that would be able to recreate the original matrix.

I was closetoo. I had four different outputs that actually would be what would make upthe compressed file that combined together to create one output for each bit. From this single output I could determine if the bit was 1 or 0. It workedperfectly for matrixes of 1×1, 2×2, and 3×3. Except that with this small ofa matrix, I wasn’t compressing it at all. It was more of a coding system thattook up more space than did the original file.

I even found a way to shrinkthe size of the four outputs but it was not enough to even break even on bitcount. When I got to the 4×4’s I found an overlap. An overlap is a term Imade up for this algorithm. It means that I got the same single output fora 1 as I did a 0. When that happens, I can’t figure out which it is: a 1or 0.

When you can’t recreate the original file, data compression has failed. It becomes lossy. I needed a fifth original output. If you want more informationon how I thought the algorithm would have worked, please refer to my Inventor’sLog that I included. It’s way too much to re-type here and it would serveno real purpose in this paper. If you were paying attention earlier, youwould be saying, ?Why don’t you find a pattern? Otherwise you can’t compressit.

You are treating it like a random file. ? I didn’t find out that it wasimpossible to compress random data until about the time my algorithm was failing. Becauseof my setbacks I started looking for an entirely new way to compress data,using a pattern of some sort. I got to thinking about all of the existingalgorithms.

I wanted to combine a hash table, statistical coder, and a runlength coder. The only hard part that I would see in that would be tryingto get the patent holders of each of those algorithms to allow me to combinethem and actually modify them slightly. In its current algorithm, the statisticalcoder only accepts alpha-numeric phrases. I would like to modify it to notread the characters that the binary code spells out, but the binary code itself.

I don’t know what form the output is aside from compressed, but formy purposes it wouldn’t matter what the form the output is. I would programinto the program all of the 32 combinations of 5 bits (2^5). Each of the combinationswould be labeled in the program 1 through 32. I would then make a hash tableof all of the 5 bit combinations.

This would give me an output which I wouldrun through a run length coder. Since the entire algorithm is reliant on binarycode and not on characters, it can be recursable, or it can compress furtheran already compressed file. LZ’s can’t do that because once they convert astring into it’s dictionary/sliding window form, it ceases to be one of thecharacters that it compresses. Now that you are aquatinted with our friend,Data Compression, I hope he will be able to serve you better. Now you candownload programs faster, save space, and who knows? Maybe you will inventthe next new compression algorithm.

Until then, keep your mice happy and yourmonitors warm. Proof that random data is not compressible:Let’s supposethe file to be compressed is 8 bits long (x works, but this is easier) andis randomThere are exactly 2^8 different possible 8 bit data strings. (2^x)Tocompress the file it must shrink it by at least one bit (2^x)-1So there areat most 2^7 different compressed files 2^(x-1)Thus at least two source filescompress down to the same file.Therefore The compression cannot be lossless.BibliographyAronson, Jules Data Compression- a comparison of MethodsWashington D.C.: Institute for Computer Science and Technologyhttp://simr02.si.ehu.es/DOCS/mice/compression-faq/part1/faq-doc-x.htmlx=Intro,8,9, and 10http://www.epl.co.uk/genit18.htmhttp://www.sees.bangor.ac.uk/~gerry/sp_summary.htmlhttp://Literacy.com//mkp/pages/3468/index.html