make
Please let me know if you need help using the makefile, David.
Then do
./lz [filename]
to run.
We have files in our files directory, so
./lz files/davidcopperfield.txt
is something you can actually type. The outputted file will be saved as filename.WE_LOVE_CS_421 and it will be in the same directory as the original file.
When it decompresses, it will decompress to OUTPUT.txt in the same directory as lzw. It WON'T be in the same directory as your file, only the compressed file will end up there.
Start by constructing the table with the ASCII characters in the first 256 entries. Get the first character and then since it with is only a single character it is in the table so we want to see if that plus the next character is in the table which it is not, so then we are going to add that to the new table and increment the size of the table. Note that newcode is a counter to make sure that we don’t go past the number of possible substrings. If we reach that condition, then we want to just encode the text using the values in the table and stop expanding the table. The encoded sequence is then outputted to file.
Build the same dictionary and then we want to search through the dictionary to find the characters. This function will read in an encoded file and then we will write a decompressed version. We read in the first value which is a number in ASCII so we look it up in the table we have created. Then we go and check if the next character can be read. If it is a value greater than the table size, then we are going to want to add the new value as an entry of the old code + new character. We want to output the string to the output file.
The compressed file will have the extension WE_LOVE_CS_421
The compression and decompression algorithms both successfully were used on DAVID COPPERFIELD, which is Ariel's favorite book!
Book Title | Initial File Size (KB) | Final File Size (KB) | Compression Ratio (ie Compressed Size / Original Size) | Time to Compress (ms) | Time to Decompress (ms) |
---|---|---|---|---|---|
David Copperfield | 1986 | 769 | 38.7% | 766 | 170 |
Dracula | 861 | 349 | 40.5% | 352 | 92 |
Pride and Prejudice | 781 | 277 | 35.5% | 297 | 75 |
Ariel's HUM110 final | 6 | 5 | 83.3% | 3 | 12 |
Ariel's Lit 472 final | 22 | 14 | 63.6% | 11 | 17 |
Ariel's Thesis (as of 10/29/21) | 26 | 17 | 65.4% | 13 | 14 |
lempel_ziv.cc | 8 | 6 | 75% | 4 | 11 |
Ariel: I was curious to see how redundant papers I've written before are. My HUM final was really unique, but I'm glad the crowning achievement of my time at Washington University in St Louis was less redundant than David Copperfield (which I had to read for that 400 level class). I kinda went crazy with it, but it was cool to compress random stuff, including the code for this project.
Canturbury Item | Initial File Size (KB) | Final File Size (KB) | Compression Ratio (ie Compressed Size / Original Size) | Time to Compress (ms) | Time to Decompress (ms) |
---|---|---|---|---|---|
alice29 | 149 | 69 (nice) | 46.3% | 81 | 40 |
asyoulik | 123 | 62 | 50.4% | 78 | 41 |
cp.html | 25 | 15 | 60% | 17 | 22 |
fields.c | 11 | 7 | 63.6% | 8 | 25 |
grammar.lsp | 4 | 3 | 75% | 4 | 17 |
kennedy.xls | 1006 | 344 | 33.3% | 413 | 126 |
lcet10.txt | 417 | 168 | 40.3% | 292 | 173 |
plrabin12.txt | 471 | 201 | 42.7% | 238 | 70 |
ptt5 | 502 | 69 (nice) | 13.7% | 193 | 49 |
sum | 38 | 25 | 65.8% | 23 | 24 |
xargs.1 | 5 | 4 | 80% | 4 | 23 |
As you can see, diff-ing output and alice29.txt returns nothing because there's no difference. When we diff between the compressed file, alice29.WE_LOVE_CS_421, the computer tells us there's a difference. There is a difference there because there better be a difference between the LZWed file and the uncompressed version.
I don't want to include screenshots for every single Canturbury Corpus item, but I did check that diff produced the same results for every file tested. When testing these, we had to add extensions to ptt5 and sum so that we could write normal looking output files.