How to Not Store the Same File Twice (using JavaScript and PouchDB)

De-duplicating data storage

Table of content
  1. A Working Example
  2. Reducing Storage
  3. Completing the Solution
  4. Further Reading
Photo by Jeremy Bezanger on Unsplash

Many years ago, I was reading an article about Google's internal labs having managed to create a sha1 hash collision between two PDF documents. The documents were very different documents, but through clever bit manipulation, resulted in an identical sha1 code. It was a fascinating read for a Friday afternoon, but over the weekend I started to ask the question: why does Google care?

One significant place where this would impact Google would be on their storage platform, Google Drive. Given they are storing massive numbers of files, on behalf of massive numbers of people, in all probability there are going to be a massive number of duplicate files.

Logically, we can see this will likely be true just through its usage. Assuming I am engaging in a real estate transaction, it is not uncommon to exchange emailed PDFs with scanned signatures. Assuming everyone is using Google Drive to back up their documents, there are four people that have copies of the exact same document: buyer, buyer's agent, seller's agent, the seller. Don't forget to add lawyers and lenders later in the process.

I have worked for organisations whose primary business involved the interchange and storage of data (Oil/Gas Production, Telecommunications, or Data Repositories), and in every case, our solution to the problem was simple: charge the customer. Charge the customer a rate per byte: the amount of drive space they take up * replications space * server cost * electricity * rent * markup. Charging the customer is a good way to offset the cost, and Google does bill its customers, but through de-duplication, it is possible to maintain the same revenue, while drastically reducing the amount of physical storage.

Viral Memes were always a fun way to watch a poorly configured communication network burst into flames

Take a chat service I was involved in. The GSMA-RCC specifies that images can be interchanged between client devices; it also specifies that those images should be retained on the server for later pickup. So if a meme goes viral, thousands of individuals may forward the image to one another, resulting in thousands of copies of that image flying across the network, and being stored on server drives.

Further, if it's a good meme, people are going to forward it back to people they know; people who have already received it. So round and round the image goes, and it doesn't stop until theoretically its been sent on every possible communication between pairs of people.

With 10 billion devices on the planet, that's more than Seven Bridges that need to be crossed.

If some jerk sends that meme as a bitmap each image instance takes up about 1079KB.

assume
- $1.00 per MB revenue
- $0.90 per MB cost
- 1,000,000 user interactions (transfer)
- 1024KB file

= 1 MB * 1M transfers * ($1 - $0.9)
= 1TB of storage
= $100,000 profit

We can be sensible about it and require that everything is converted to PNG (notice Google asks to do this for Google Photos) reducing the size to 615KB (60%). This is useful when dealing in a fixed revenue, but not a promised true copy of data storage.

Storing the object once per person is how the users perceive a file system

Storing the image once per person that receives it, is an inefficient use of resources. If, on the other hand, we can identify that it is actually the same image, we can reduce our cost to storing only one instance, but charging for each transfer.

= (1MB * 1M transfers * $1) - (1 unique file * 1MB * $0.9)
= 1MB of storage
= $999,999.10 profit

There is a big difference between having a cost of 90¢ per transaction, and a cost of 90¢.

Identifying that everyone is storing the same data, allows us to significantly reduce the amount of space we consume while giving the same level of service to our users.

However, de-duplication is not an easy problem to solve.

First of all, we receive the files independently of one another, from different people, with different names. Given a large number of large files, it is going to take a significant amount of processing power to compare the files byte-by-byte to every other file we have stored. In a large-scale system, this simply is not feasible.

This brings us back around to where we began: Google had managed to cause a collision between sha1 in the lab. Why would they have been concerned with researching the extreme possibilities of collision in binary documents? Because they are using the hash to help identify the uniqueness of files.

We can use large hashes as primary keys for the files.

A Working Example

Pulling on some past experience solving this problem using CouchDB, we can implement an in-browser demonstration of the principles involved using JavaScript and PouchDB. If you are not familiar with PouchDB and using its DB interface, I suggest reading the definitive introduction by Nolan Lawson.

We will be creating a simple example with three users (Alice, Bob, and Carol) sharing their favourite lines from a new Opera they just saw.

let db = new PouchDB('filestore');
async function main(){
await db.destroy();
db = new PouchDB('filestore');
await CreateIndex();
await save(
'alice',
'gilbert.txt',
'I am the very model of a modern major general'
);
await save(
'bob',
'sullivan.txt',
'I am the very model of a modern major general'
);
await save(
'carol',
'gilbert.txt',
'I have knowledge of things animal, vegetable and mineral'
);
await save(
'alice',
'sullivan.txt',
'I have knowledge of things animal, vegetable and mineral'
);
let data = await db.allDocs({include_docs:true});
console.log(`DB Size: ${JSON.stringify(data).length}`);

console.log(await dload('alice' ,'gilbert.txt'));
console.log(await dload('bob' ,'sullivan.txt'));
console.log(await dload('carol','gilbert.txt'));
let peruser = await db.query('allfiles',{
reduce:true,
group:true,
group_level: 1
});
console.log(peruser);
}
main();

The process is relatively straightforward: three people store one of two lines of text to their account, and then retrieve them from the database.

Of interest is the total size being stored, as well as the per-user size (billable size) usage:

DB Size: 1813 bytes

| User | Files | Size |
| ----- | ----- | ---- |
| alice | 2 | 101 |
| bob | 1 | 45 |
| carol | 1 | 56 |

The real meat of the program happens in the save and dload which abstract away the interactions with the database. Further CreateIndex defines the mechanism for search and retrieval.

async function CreateIndex(){
return await db.put({
_id: '_design/allfiles',
views: {
'allfiles': {
map: function (doc) {
var userpath = [doc.user,doc.path];
emit(userpath,doc.size);
}.toString(),
reduce:'_stats'
}
}
});
}

async function save(user,filename,blob){
return db.put({
_id: [user,filename].join('@'),
user:user,
path:filename,
size:blob.length,
_attachments: {
'0': {
content_type: 'text/plain',
data: window.btoa(blob)
}
}
});
}

async function dload(user,filename){
let recs = await db.query('allfiles',{
reduce: false,
include_docs: true,
attachments: true,
key: [user,filename]
});
let blob = recs.rows[0].doc._attachments['0'].data;
blob = window.atob(blob);
return blob;
}

The save and download functions work on the assumption that we are going to store a copy of the record for each person, while the index manages a list of users and their files.

A JSFiddle giving an example of the working demonstration

Reducing Storage

To modify this example to reduce our storage, we must first modify the save function to not blindly save for each user, but rather to save each blob as a primary object, and track users observing it as a secondary item.

async function save(user, filename, blob) {
let hash = new TextEncoder().encode(blob);
hash = await crypto.subtle.digest('SHA-256', hash);
hash = Array.from(new Uint8Array(hash));
hash = hash
.map(b => b.toString(16).padStart(2, '0'))
.join('')
.substr(0,4);
let userpath = [user, filename].join('@');
let rec = null;
try {
rec = await db.get(hash);
} catch (e) {
if (e.status !== 404) throw e;
// create the object, with a list of user's using it
rec = {
_id: hash,
userpaths: [],
size: blob.length,
_attachments: {
'0': {
content_type: 'text/plain',
data: window.btoa(blob)
}
}
}
}
if (!rec.userpaths.includes(userpath)) {
rec.userpaths.push(userpath);
}
// finally save the record
return db.put(rec);
}

This ensures that we only ever store a blob once.

Using the hash as the record identifier means that no matter how many times it is submitted, we just keep using the existing record. New users are simply added to the list of users using the item. Users can even make copies of it by submitting the same item with a different name; we just keep adding notes that the user has a name for the record.

Naturally, this breaks the lookup index we were using. The original index used the name of the record (username+path) to look up the file, but the file no longer uses this as its record name. Instead, we need to create a lookup index that is constructed from all the users that use the same file.

async function CreateIndex(){
return await db.put({
_id: '_design/allfiles',
views: {
'allfiles': {
map: function (doc) {
for (let userpath of doc.userpaths) {
userpath = userpath.split('@');
emit(userpath, doc.size);
}

}.toString(),
reduce:'_stats'
}
}
});
}

By looping through each of the user paths, we have updated our index to return the file, for each use of the file. This means no change to our download function, as the view's interface has not changed.

Note that we had included the file size in the view, and used the stats reduce method. This means that when it comes time to do billing we can simply add up the number of blobs the user references, as well as add up their total size.

Re-running main gives us a new total size and billing data:

DB Size: 1109 bytes

| User  | Files | Size |
| ----- | ----- | ---- |
| alice | 2 | 101 |
| bob | 1 | 45 |
| carol | 1 | 56 |

Even in our trivial example, with a very small content size, our database size (and therefore our business expenses) decreased by a whopping 39%. As more of the data structure is taken up with content, and less is taken up with metadata (think MP3, MP4, and PNG), a better return is seen.

Also, notice that the billable sizes never changed.

The same example as previously presented but with the changes to make the storage smaller.

Completing the Solution

After all of that, we cannot forget that Google demonstrated that collisions are maliciously possible, therefore they cannot be absolutely trusted. We must perform complete byte-by-byte comparisons. Hash functions are useful tools for determining dissimilarity between binary objects, not for determining similarity.

Hash functions are useful tools for determining dissimilarity between binary objects

This is a very useful feature. By using a large hash, I can quickly narrow the required comparisons to almost nothing. There is a good chance I will have no more than one match that needs to be checked, and that is a lot less effort than millions. Once I've narrowed it down, the complete file must be checked with potentially an extra marker to distinguish it from one we already have.

Found this useful or interesting? Consider leaving a tip … it helps.

There is a second problem in that we need to delete records at some point. In this case, we can simply drop the reference the user has to the object, though we must also observe for the time when no user has a reference to the object. When all references are removed to the object, the object itself should be removed. Similarly, if a record is renamed, we need to ensure we remove the old reference before creating the new reference.

I leave these as exercises for the reader.

Further Reading

This demonstration uses JavaScript and PouchDB. I love in-browser demonstrations because they are so portable: you always have an IDE and runtime environment available.

PouchDB is a JavaScript implementation of CouchDB, to expand the solution to something useful at the enterprise level, one of the large-scale CouchDB implementations would be a good start.

It is also worth mentioning that this technique is not limited to any particular system. The same technique could be implemented at the Operating System level using symbolic or hard links. In fact, some combinations of database and FS storage may be an optimised way to go.


A sequel also exists that discusses how to efficiently store the same dataset as it changes over time.

De-duplicating Data Storage II: Storing (almost) the same file twice

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter and LinkedIn. Join our community Discord.