Andrew Louis

    Pass by reference; pass by copy

    Aug 13, 2017

    Ted Nelson came up with the term “hyperlink” in the 60s and imagined them working quite differently from today’s. Here’s how researcher Belinda Barnet describes how they would have worked in Nelson’s Project Xanadu:

    […] no links would ever be broken, no documents would ever be lost, and copyright and ownership would be scrupulously preserved. The Magical Place of Literary Memory: Xanadu. In this place, users would be able to mark and annotate any document, see and intercompare versions of documents side by side, follow visible hyperlinks from both ends (‘two-way links’) and reuse content pieces that stay connected to their original source document.

    ptf1 1

    Unfortunately, Project Xanadu always ran a bit better in Ted Nelson’s head than as working code.

    Tim Berners-Lee took an implementable subset of Nelson’s ideas and the World Wide Web was born. Links on the web are one-way and a linked object doesn’t know what links to it. The object is also the only place where this data stored which means if that server goes down, the data goes with it.

    In programming terms, the web uses pass by reference instead of pass by value.

    Bret Victor talks about how the Library of Alexandria used pass by reference and things didn’t work out too well for it. On the other hand, the original Memex used pass by value. Same with nature:

    It’s interesting that life itself chose Bush’s approach. Every cell of every organism has a full copy of the genome. That works pretty well — DNA gets damaged, cells die, organisms die, the genome lives on. It’s been working pretty well for about 4 billion years.

    Obviously we shouldn’t oversimplify and declare one to be better over the other but it’s a good exercise to imagine these alternative histories of how the internet could have gone.

    In last week’s newsletter, I talked about importing my high school MSN logs. One frustrating thing about reading through these old conversations is that the majority of the links didn’t work anymore. Not having easy to access to the content of the links makes a lot of conversations hard to follow (“thoughts on this? [link]”). In the short term, I think I’m going to generate archive.org link based on the time period of the conversation and add them to the Memex interface.

    But the more I work on this project, the more I play around with the idea of saving an archive of every link visited. It’s neither technically hard nor expensive to store. I’m pretty convinced I could save every link I look at as well as all my full photo collection and the rest of the Memex data for less than $5/mo in Amazon S3 charges. 99% of these links will never be looked at again but for the times it’s useful, it’s very useful.

    When was this photo taken??

    Photos are a big part of our personal data archives. In addition to the memories they capture, we also use them to record whiteboard notes or remember items on a shelf of a store. They should be one of the most important datasets in a Memex.

    But there’s one technical problem that’s made me hesitate starting to import all my photos: it’s very hard to tell exactly when a photo was taken.

    A JPEG photo generated by your phone or camera has three different types of timestamps available. Here is why each of them can’t be relied on:

    • EXIF timestamps. Photos store metadata in a special part of the file called an “EXIF tag.” There are three timestamps in here (Date and Time, Date and Time (Original), and Date and Time (digitized)) which are more or less the same thing. Unfortunately, none of these timestamps includes timezone information!
    • The photo file’s Modified At timestamp. It’s a proper unix timestamp which in theory makes it exactly what we want; however, depending on how photos are synced off your phone, it might actually get set to the time of the sync, not the original capture time.
    • GPS timestamp. GPS satellites rely on highly-accurate atomic clocks to figure out position. A timestamp (with timezone) from this process is saved in the EXIF tag but it represents the time of the last GPS sync, not the time that the photo was taken and there’s almost always a difference. Also, it’s often not available (subways, airplanes, etc)

    unknown

    My general solution is using the EXIF timestamp (without timezone) and then using GPS latitude/longitude to determine which timezone the photo was taken in. For photos without geo data, we can either use a default timezone or try to figure it out based on earlier/later photos.

    “Who cares about such extreme accuracy?” you might ask. Being able to build an accurate timeline of sequential events from diverse data sources is at the core of the magic of this Memex. It would still be possible to view photos in approximate order otherwise but being able to collate them with other data sources let’s me answer queries like “find photos taken while with Michal” or “find photos of notes taken inside a train.”

    I haven’t started working with video files from my phone yet but apparently they’re even more complicated.

    If you’re a developer working on an API, for the sake of all future obsessive archivists, please make sure to add accurate timestamps!

    Importer system update

    An update on the project to turn this Memex into an installable app: the data importing system can now be managed through the Electron app.

    Here’s a recap of how data gets into the Memex:

    • For each third-party provider, I have an Extractor class that knows how to access the API or read the type of files we’re after. This data is fed into a Transformer class which knows how to reorganize the data to fit the schema of the Memex. Finally, each object is passed to a Loader class which knows how to send it to the Memex API.
    • A job represents a combination of an Extractor, Transformer, and a Loader, along with options. For instance, one job might represent reading ~/.bash_history, transforming it into the right type of ‘commanded’ activity, and uploading each into my Memex.
    • The schedule controls the frequency of when these jobs are run.

    Here’s a screenshot from the dashboard:

    1

    If you’re interested in beta testing the Memex and are willing to get your hands dirty a bit, please shoot me an email!