An idea I have been pondering recently is using very large data-sets as a means to compression. Two parties want to exchange large amounts of data over a very limited amount of bandwidth, they can’t agree up front what data will be exchanged in the future, but they can agree for now to both receiving two identical copies of a very large amount of data that is indexed using a sliding window with the index of each piece of data stored in a lookup table.
So now each party has a large lookup table of robust CRC’s or equivalent, and each CRC maps to a piece of data in the larger data set. Both parties go off to opposite sides of the world and connect over a slow connection to each other, when one party wants to transmit a sizable amount of data to the other party the first party would scan through they local copy of the large data set, looking for either identical data matches, recording the identical match index in to the data set, or “close enough” matches, recording the index and the differences. Then the first party could transmit these indices, or indices and differences to the other party.
It would also be possible to create this large data set either from a very large prime number or a sufficiently random, non-repeating, seedable sequence if both parties agree to the random number algorithm and seed in advnace.
Obiously if we are going to store this much data then hard drives need to come down in price. I’m thinking for this to be a valid experiment you would require at least 500GB, possibly even 1TB of data storage to test out this idea but it would be interesting to see.
With this technique it would be possible to transmit web pages or binary data across slow connections much faster than what is currently possible, though as bandwidth increases in the future the value of the idea becomes less important.