Paper as a Digital Storage Medium
Distributing Data in the Present and Preserving Data for the Future
Table of content
A description of reading and writing data with the reproducibility of digital data with the long-term storage capabilities of paper.
An experiment in using barcodes as a storage medium. The intent is to create an EPUB reader that stores its data on…gitlab.com
Two Stories
A story about anonymity
In recent times we have seen a war of information. In Russia, news sources are being silenced for criticizing their invasion of Ukraine. In China, online speech is monitored and can result in punitive damages for individuals. Saudi Arabia asks neighbours to denounce each other.
My grandparents migrated from Europe to North America after WWII. Europe had troubles after the war and, as with so many other refugees, everything my grandparents possessed had been lost so a move to a new land filled with opportunities captured their imaginations. I grew up on stories passed down to me by my Grandmother and various Aunts, and these inspired me to read more.
One narrative that always struck me was the burning of forbidden books. Naturally, we have seen this in several countries over the centuries as certain belief systems are suppressed, but the most famous would be the raid on the Institute for Sexology in Berlin in Germany in 1933.

The Institut für Sexualwissenschaft was the leading organization dedicated to the study and advocacy of alternate sexuality in Europe, and on May 6th, Government Officials raided the facility. Much of the early research (and advocacy) of gender studies was dragged out into the streets and dramatically destroyed for being “Un-German”.
When people first learn of the story, they are rightly distressed about the knowledge that was forever lost, but there is also a lesson that is learned from the later stories of lost treasure troves being recovered from someone's basement after the war. Here is the way the story goes in my head.
Magnus Hirschfeld publishes a great work and gives all his students copies. The professor and his students are arrested, executed, and their personal libraries looted and destroyed. Fortunately, Li Shiu Tong, one of his students had lent the book to an acquaintance. The acquaintance was sympathetic to the NSDAP authorities, but also did not want to cause trouble for his friend and had put it on his bookshelf and forgot about it. Years later, when he died, his wife put all the books into boxes and stored them in the attic, where they stayed for the next 30 years because nobody was looking for a forgotten book in an forgotten collection.
In my internal narrative, this happens on the East German side, where Stalin continued to suppress homosexuality. The book is completely lost, except for that one accident of it getting put away in a box and forgotten about. It has a chance at a new life when society is ready for change.
The ability to be forgotten and anonymous carries significant power in the dissemination of dissenting opinions.
In the modern era, as information delivery systems have become more robust, we see the same destruction of knowledge taking place, though in a much more subtle manner. As the cost of distribution has been reduced, we have seen data become centralized: it is much easier to go visit Wikipedia on your phone than it is to download the page and carry it around. Also, Wikipedia comes with an open edit history associated with the documents, not all websites are so open.
This leads to two risks:
- There is the risk of the lone copy, in a single organization's archive, being content being removed from the library (webserver). In the example above, Hirschfield had indicated his library should be donated to the University in the event The Institute was closed. This never happened, and the forced closure was deemed legal, ensuring all copies were destroyed.
- This centralization means that edits to the content can occur with no historic copies being maintained. The edit history, being lost, can never track significant shifting of opinions. History can be changed.
The Internet Archive demonstrates the need for this: websites and content are removed from the internet regularly for reasons as innocuous as cost (part of the reason Git was developed was to protect OSS from being lost to public servers being shutdown), and as nefarious as governments shutting down news stations to silence dissent. Central repositories like the Internet Archive help to protect knowledge by allowing us to observe changes but also put the knowledge at risk by being the only keepers of history.
By distributing the data across many bookshelves, it is protected from complete loss.
A story about storage
Many years ago, I heard a story. I don't know if it is true, but it carries a valuable lesson.
In the early '90s, an amazing product became accessible that allowed people to generate a lot more data than they ever had, and of a higher quality than ever before: Microsoft Word. What had previously been stored on paper was now able to be digitally encoded and stored on disk. The archivists loved it, they were stuffing data onto disks left, right, and centre.
In the late '90s, Microsoft upgraded Word.
Into an incompatible format.
There was no way to go back and recover all that long-term stored data. Legally they were not allowed to as it had to be stored exactly as it was placed into storage (and signed off on).
In another twist, magnetic storage degrades over time and is subject to very limited environmental conditions. It is very easy to damage the storage medium.
In the story I was told, Archivists at the US Congressional library said “you know what doesn't degrade? paper.” And just started printing everything to paper, bundling the paper, and storing away in the existing vaults.
What if there was a way to have the best of both worlds? What if it is possible to have the fidelity of digital storage with the lifespan of paper; the volume of transmission available in Smart Devices, with the anonymity of in-person conversation?
Unfortunately, much of the data that is produced now is dynamic. By “dynamic”, I mean you can interact with the visualisation itself (scroll through a map; rotate a 3D model; filter, search, and aggregate massive datasets); and once it has been printed to paper, that is no longer possible.

Also, it's hard to transfer large tables of data from paper to digital media. Scanning the documents as images and using OCR to collect tables of information loses significant amounts of metadata:
- Data Types must be guessed from the content
- Alignment issues cause data to be considered out of context
- Character fidelity can cause incorrect values to be interpreted
While high-resolution photography and Artificial Intelligence have certainly improved the quality of scanned content, there is still an analogue transfer of data and it will result in some mistakes being made.
Defining the Problem
What if there was a way to have a compromise between the two worlds: the long-term storage of paper, with the high fidelity of digital; the anonymity of a private conversation, with the distribution capacity of a computer network?
What we are looking for is a means to store digital information on physical media such as paper or etched into stone. We might call this “visible” media.
Properties of Digital
Companies, governments, and individuals, have a desire to store data for long periods for legal archival purposes. This is hard to do. Over the past 20 to 30 years, the cost of digital storage has reduced as we moved from paper to magnetic storage. This presents a problem for archivists that must store the resulting volumes of data: as it becomes cheaper for us to produce data, it becomes a greater challenge for archivists to store that data.
The data must have a means of simple interpretation: it must be stored in a format that is easily converted to something a human can read. Open Source standards are advantageous as they are unencumbered by intellectual ownership and are readily understood by a larger pool of experts.
Copying digital data is something we take for granted. When we make a copy of digital data, it is an exact copy. For example, music loses some fidelity when recorded into a high-resolution format, however, the replication of the song from that point forward retains an exact copy (at the resolution of the bit).
Properties of the Storage Media
While etching into stone, or carving into wood are viable options, the weight and volume of these media present a barrier to storage space, and weight. Linen and cotton sheets represent lighter options but are expensive to produce. Mylar and projector film reduce the size, which offers good potential.
Modern archival paper represents a balance of permanence, weight, and volume. Each of these could (and should) be considered for various purposes, in fact, the solution should be adaptable to all these solutions. We discuss paper as the primary media, it is because paper has such a rich evolutionary history as a storage media.
In order for a digital storage mechanism, it must offer a reasonable level of compression. By compression, we refer to the number of bits of information stored per square inch or pound. This means it should be able to be recorded in a small physical space, though this must be balanced with an ability to read it back easily.
Solution
By combining the needs of both these mediums, we can put together a combination of existing technologies to create a unique solution. ePUB is an Open Source container format for Electronic Books which offers a standardized (ISO/IEC TS 30135–1:2014) and unencumbered format for a plethora of data. Further, the use of 2D-Barcodes (in the form of QR Codes) has become ubiquitous as a means of transmitting URLs, however, fundamentally, they are just binary buffers, capable of storing any encoded sequence of numbers.
ePUB
- Diverse data storage
- Compression
- Accessibility Conformance
- Widely Consumable
The transition from paper publishing to screen-based mediums brought some transitional challenges. PDF was popularized as a means of digitizing paper and acting as an intermediary between paper and digital formats. On the polar opposite end of the spectrum from paper, digitized standards (such as those developed by the W3C) have been optimized for delivery to an unknown display.
HTML introduced the idea of reformatting content to adjust to meet the needs of the consumer. This meant that the text could be read by a screen reader, could reflow for people reading on small screens, or the text made larger for people with poor eyesight. This accessibility of the format gave birth to a plethora of other standards now managed by the W3C. These standards ensure maximum availability to the greatest number of consumers.
ePUB takes advantage of these standards to encapsulate websites into a single document. They embed webpages into a ZIP file format to allow for the contained viewing of the entire website. Generally, the documents are organized into Chapters.
By using the common ePUB format, anyone would be able to read a digital document and decode it. ePUBv3 allows for JavaScript to be embedded, meaning you could embed maps, interactive diagrams, etc. (like R-shiny, but self-contained). As a general W3C container, it is also possible to embed other file formats for consumption and preservation: datasets as CSV, or evidence in the form of video.
Barcodes
You can encode digital information into barcodes which can then be printed to paper for long-term archiving, and the barcodes can be read back to a digital device for reading.
2 Dimensional barcodes have been used for decades as a means of encoding specialized information. BRML, text, or other data formats, have been appended to printed documents, such as Drivers Licenses and invoices, to supplement the text with digital information. This usually amounts to a unique document identifier or a digital record.

Encoding an ePUB should be trivial with there being several issues:
- The encoding scheme must be identifiable by a reader (there must be sufficient information embedded in the data to allow a reader to reconstruct the correct form)
- The size of a single book will likely exceed a given 2D barcode's storage capacity. An encoding mechanism will have to be able to span multiple image tiles.
- There is a social issue that must be managed in that humans cannot read the codes directly. It is possible that they do not wish to view the material for legal, religious, or moral reasons. There must be sufficient metadata to allow the viewer to decide not to accept the message.
The issues are easily overcome once identified; adding metadata to the individual tiles in the form of application identifier, pagination, title, author, and subject should offer sufficient information to allow users to interact with individual tiles and reconstruct the data.
A prototype of the concept has been created to demonstrate the capability. The prototype's protocol consists of
- A URL: which points to the reader for either online use (browser only) or installation as a PWA, or just as a unique identifier that this is a compatible format
- A Protocol Version: as changes are made, it is important that the correct decoder be used
- Pagination: the current tile number and the total number of tiles to be converted. This allows for correct sequencing as well as a measure of progress
- Bibliographic: Title, Author, and subject allow a reader to decide if this is content that interests them, or is legal for them to interact with. Filters can be added to prevent accidental downloads from taking up space
- Parental Rating: not so much for parents, but generally for people that are not interested in certain types of content (filtering
xxxcontent from a work device for example) - Relevance Date: some content is only valid up to a certain point, and should be ignored after that time (poster for a concert). Offer a hint to the reader that perhaps this could be removed, or ignored.
With this information in every tile, the read of the first image can result in some information being given to the user, allowing them to decide if they want to continue or block. If they determine they wish to continue, the pagination can be used to determine what order the buffers should be ordered in for reconstruction.
A prototypic specification is available in more detail.
Various Uses
Secure Archives
Having access to an archive comes with permission issues. Controlling access to information in archives that store sensitive data can be difficult. Using this encoding mechanism acts as an envelope around the content.
In the description of meta-data, the content rating was suggested. It would be very easy to reuse this portion of the protocol to use classification ratings. Users offered access to a secure document could have their specialized reader first check the classification rating of the content before decoding it. If the individual only has sufficient clearance to view some related documents, but some of the documents in the area contain information that exceeds the individual's current clearance, it can act as a secondary filter for viewing it.
Obviously, this would be a tool to assist honest actors within the environment and not a way to interfere with malicious actors, but this is another layer of protection which assists the actors in managing the information in their possession.
Information Dissemination
Assuming you are in a place where information is controlled, you could print essays and newsletters to paper, which can then be scanned for reading later. For example, it could be printed in a pamphlet or posted on a bulletin board, and nobody would know who published it (beware of barcodes hidden on printouts).
One of the advantages, in this case, is the high compression ratio. In an early test, a hundred-page novel was compressed to 9 pages of barcodes. While still requiring some effort to distribute, the entirety of the novel could be tacked to a corkboard.
The contents would then convert to something readable on your phone, like an inspirational poster.

Remote Interactive Media
Textbooks, posters, and advertising all have the common element of having to display content in physically contextual locations: a sign in a museum, a poster stapled to a lamp post. Access to network communications is not guaranteed and the audience misses out on an opportunity.
Take, for example, a sign at the top of a mountain congratulating a mountain climber for their successful journey. A digital experience message could be left at the top but would require a WiFi-based website to be configured and powered.
Alternately, storing the immersive experience on the poster itself would allow the digital content to be available, but not require any power for maintenance.
Etching the information into something more permanent, such as wood, or stone, may be appropriate in a circumstance such as this

Conclusion
Anonymous and long-term storage of data and information is necessary. The free dissemination of ideas, and the storage of them for future reference, is a fundamental need for the progress of society. While the digital age has made information access easier than ever, it has introduced a host of new problems in its wake.
The use of paper as a digital storage medium is a novel and useful approach to addressing some of the new problematic circumstances.
If you find this concept of interest, I invite you to review a prototype of the concept on GitLab. There is a mobile application available, that can convert an ePUB to images and paper, and convert it back; right on your smartphone.
There are several ways you could contribute:
- Design a UI: I just grabbed an old one from another project to get up and running
- Write a Reader: currently, the app acts as a bookshelf. Some features would be well suited to a custom reader
- Custom Filters: users should be able to filter content by author and title if they come across something that does not interest them.
- Pass some notes around your school: Using a system highlights its problems. Post a club listing to a bulletin board at school using this encoding mechanism.
Submit an issue, post a merge request, or leave a comment below. If you found this content valuable, please remember to click the follow button.
An experiment in using barcodes as a storage medium. The intent is to create an EPUB reader that stores its data on…gitlab.com

UPDATES
Since writing this, I have come across several other interesting, and related reads
2022–11–15
The Internet Archive, Digital Books wear out faster than Physical Books (November 15, 2022)
2023–07–19
NYU Law, The Anti-Ownership Ebook Economy