Quora uses cookies to improve your experience. Read more

Robert Walker, Writer of articles on Mars and Space issues - Software Developer of Tune Smithy, Bounce Metronome etc.

Updated Aug 28, 2014

Yes, there's this theorem which backs up your intuition - that whatever compression method you use, there have to be many almost incompressible strings - that is to say - that the "compressed" versions are nearly as large as or larger than the data they compress. And indeed if you use random data - then most strings will be incompressible in this sense.

It's quite a simple proof also.

Suppose for instance that you want to compress random data of 100 bits. Then count all the possible uncompressed versions of the data you could have, it's 2^100.

Now count all the possible compressed versions, must be the same number, 2^100 of them assuming that you can reconstruct the data unambiguously from its uncompressed version.

So - now suppose that you want to compress it by a factor of 2, to 50 bits. That means you have only 2^50 compressed versions available and nearly all the data can't be compressed that far.

Even compressing it so it is shorter by just 2 bits cuts it down to a quarter of the available descriptions - so we see that only a quarter of the possible files can be compressed by as much as 2 bits.

And that's a general mathematical truth about all possible methods of data compression.

See Kolmogorov complexity

So our compression methods only work because of the repetitive or predictable nature of the data we typically compress.

E.g. if typical files were filled with truly random data then zip compression or any other compression method would not work at all on most files. But because the files that we want to compress are not random, then compression is possible.

Might be that PI compression for some reason works well for some types of file we are interested in. But it could also go the other way and the meta data be longer than the file compressed.

In case of PI - then you need to have some way to specify the position in the decimal expansion of PI - and that then will take up more or less bytes depending how far you go to find the string - so we can deduce - that most of the hexadecimal sequences will be so far along the expansion of pi that the number that expresses their position takes up at least as much storage as the sequence itself.

It is a neat idea. He is using a result about PI - that though in most cases it is really hard to find digits of PI - in special case where you express it in hexadecimal - there's a clever formula that lets you calculate the nth digit of PI without first calculating any of the earlier digits. So in hexadecimal - but not to any other base - we know digits of PI very far along its decimal expansion.

For details see Bailey–Borwein–Plouffe formula

So - though compressing data is hard - need to spend ages going through the PI sequence to find the sequence you need - uncompressing it would be far faster, when you know where to look then you can generate the hexadecimal sequence from then onwards.

But - it's not going to work as a method of compression for most sequences. It would depend on whether there is any result of PI that would favour any particular interesting sequences we are likely to want to compress.

I've found a page you can use to find the index for a short (say 6 digit) sequence in the first 200 million digits of pi. It's in decimal, not hexadecimal, still gives an idea.

Sometimes the position is a larger number, sometimes is a smaller number; The Pi-Search Page

1k Views · 13 Upvotes

View 2 Other Answers to this Question

About the Author

Robert Walker

Writer of articles on Mars and Space issues - Software Developer of Tune Smithy, Bounce Metronome etc.

Studied at Wolfson College, Oxford

Lives in Isle of Mull

4.8m answer views110.4k this month

Top Writer2017, 2016, and 2015

Published WriterHuffPost, Slate, and 4 more