Data Compression: What it is and how it works

UK based Information Engineer/Internet Social Scientist

Popular Topics
Search within this manual Search all Support content. Also, since interframe compression works best with mostly stationary video, this is why confetti ruins video quality. Other math tricks are used, but the main thing to remember about lossless compression is that while space is temporarily saved, it is possible to reconstruct the original file entirely from the compressed one. Using lossy compression for a text file would be problematic, as the resulting information would be garbled. It is really awesome. If you choose to use 10 numbers 0…9 , then each new digit must always be 10 times larger than the previous digit. No help at all!

Lossless Compression

How data compression works

It automatically reconstructs the original file once it's downloaded. But how much space have we actually saved with this system? In an actual compression scheme, figuring out the various file requirements would be fairly complicated; but for our purposes, let's go back to the idea that every character and every space takes up one unit of memory.

We already saw that the full phrase takes up 79 units. Our compressed sentence including spaces takes up 37 units, and the dictionary words and numbers also takes up 37 units. This gives us a file size of 74, so we haven't reduced the file size by very much. But this is only one sentence! You can imagine that if the compression program worked through the rest of Kennedy's speech, it would find these words and others repeated many more times.

And, as we'll see in the next section, it would also be rewriting the dictionary to get the most efficient organization possible. How Data Centers Work. Data compression condenses large files into much smaller ones. Does that mean information is different to data? Lets take an example: I ask Bob who won the Arsenal game. He then launches into a 30 minute monologue about the match, detailing every pass, throw-in, tackle etc. Right at the end he tells me Arsenal won.

I only wanted to know who won, so all the data Bob gave me about the game was useless. Data compression works on the same principle. There are two types of compression: As the names suggest, lossy compression loses data in the compression process while lossless compression keeps all the data.

First, though, you need to understand a bit about encoding and binary. When you want to send a message you need to translate it into language a computer will understand.

This is called binary. In order to understand it, we need to take a step back and think about how we usually represent numbers. When you want to represent a number higher than 9 you need to add an extra digit. So the number 27 means you have 2 lots of ten represented by the first digit , and 7 lots of one represented by the second digit.

If I want to represent a number higher than 99 I need to add another digit. You can break a number down and see what each digit represents, so for instance the number can be broken down like so:. This is not a coincidence. If you choose to use 10 numbers 0…9 , then each new digit must always be 10 times larger than the previous digit.

The number 8 is meaningless in base 7. This means that if I want to represent a number higher than 6 I need to use a new digit. To figure out what a number is in base 7 it is now convenient to break it down, So for the number Continuing the logic of the previous examples, each new digit is always 2 times larger than the previous one. The first digit represents 1, the second digit represents 2, the third digit represents 4, the fourth 8, the fifth 16 and so on. As before, we can make a table to translate a binary number into a recognisable number.

For instance the binary number can be written like this:. Try it for yourself, think of a number between 0 and 63 and figure out the binary representation of it using the table above.

Let me explain the second point. You can think of it in normal numbers. The reason for this is because it makes it easier for computers. Lets say I want to encode the alphabet into binary so a computer can understand it. The alphabet is 26 letters long, so I need enough bits to count to As 26 is less than 31 it means we can represent the alphabet with 5 bits.

We can then encode the alphabet into binary in a common sense way: With lossy compression we do get rid of data, which is why we need to differentiate data from information. How can we measure if it is successful? Well the purpose is to make a file, message, or any chunk of data smaller.

If we had a Mb file and could condense it down to 50Mb we have compressed it. Furthermore we can say it has a compression ratio of 2, because it is half the size of the original file. If we compressed the Mb file to 10Mb it would have a compression ratio of 10 because the new file is a 10th the size of the original. A higher compression ratio means better compression.

As I said at the end of the last section, lossless compression involves finding the smartest way to encode data. The best way to explain this is through an example:. It makes sense to encode it as: This is called a frequency distribution; it shows how frequently we used each letter.

Can you see a way how we can cut down the amount of data we transmit? We need to change how we encode the alphabet.

If we calculate how much data we transmit now, the calculation will be:. Imagine we happened to use each letter exactly the same amount of times, would there be an efficient way to encode the data? The graph would look like this:. There is no way to encode this in an efficient way.

The shape of this graph, a straight line, is therefore the most efficient shape of the frequency distribution in terms of information. When the frequency distribution is anything other than this shape which is almost all the time it is said to have a redundancy of data.

Whenever there is a redundancy in data, lossless compression can compress the data. Therefore lossless compression exploits data redundancy. In lossy compression we get rid of data to make our message more efficient. As I said previously, lossy compression tries to get rid of as much data as possible while still retaining as much information as it can.

How do we decide what is information and what is fine to discard? Again it is much easier to understand with an example. Audio is made up of frequencies. These are measured in Hertz.

Lossy Compression