Holes in Web History
I am a great admirer or the Internet Archive. It represents the greatest single repository of dead Web-related matter that is publically available today, and whenever I give a talk or send an e-mail about my own modest work in this field, I always mention it in glowing terms.
Still, one cannot fully serve the cause of digital history preservation without pointing out that there are flaws many of the artifacts in Archive.org's collections, and that these flaws are already making it difficult for Web historians to glean much more than a surface understanding of what many of the early Web pioneers were up to.
What am I talking about here? Well, a picture is worth a thousand words, so here are a few screengrabs taken this week from exhibits in archive.org's collection. (caution: these files are fairly large - I intend to put lower-resolution images up soon, but am not currently at a machine that has a copy of Photoshop aboard it).
Exhibit A: IBM.com, circa 1999
As you can see, the page, as preserved, reveals very little about what the IBM webmasters intended to show back when they put this page up in 1999. It's not even clear that it's an IBM page, because even the famous "IBM 8-bar" logo is missing, and so is a date-stamp.One can presume that the large grey areas of the page contained some sort of graphical images, or perhaps Java elements (I very much doubt whether the very conservative IBM would have ever deigned to put shockwave on its pages!). But in truth, the grey areas could have contained almost anything, and given that it is unlikely, although not impossible, that any other copies of this page exist anywhere, future historians will be left with far more questions than answers as they ponder this page.
Exhibit B: MSNBC.com, circa 1998
This page is a bit better preserved than IBM's, but not by much. One can readily identify the MSNBC logo, and see a date-stamp, but the most important story of the day - the one that would have appeared in the large blue area - is missing, as are what appears to have been a "scrolling headline" that would have appeared in an area immediately above the blue area, and the navigation bar to its left. What did MSNBC's editors think were the most compelling issues on the date of April 28th? We'll never know.
Exhibit C: TalkMagazine.com, circa 2002
Could any article about Bitrot be complete without referring at least once to Tina Brown's extraordinarily ill-conceived venture into cyberspace known as talkmagazine.com? Of course not! Unfortunately, future historians using archive.org to see exactly what Ms. Brown built with her millions of dollars of funding will be hard-pressed to say anything more than "we know that there was a site called talkmagazine.com and it seems to have used frames and a bunch of navigation buttons along the top". Of course, I have chosen this example not just to skewer the honorable Ms. Brown, who now has a spectacularly well-paying job at the Washington Post while other better writers are starving for a single paying assignment, but to point out the fact that a screenshot in my own collection happens to do a better job of presenting a good illustration of Ms. Brown's site was doing. I hope, but by no means expect that future social critics of our celebrity-glutted age will thank me sometime after I've died.
All right - you've seen my three exhibits. So what's the "takeaway" from them? Well, it certainly is not to bash archive.org - which has done more to preserve our collective digital history than any single institution I'm aware of. In many, many cases, the Web sites in archive'org's collection are much better preserved than the three you've seen here - especially those that didn't use fancy home pages elements such as Java, Shockwave, or other history-resistant dynamic elements.
This brief tour of inadvertently-induced "bitrot" does, however, show some of the limitations of the robotic spidering approach to compiling digital archival matter. Spiders and robots, however efficient, have not had the capability of recognizing the presence of dymanic elements as a condition warranting any special action by a human being. Perhaps some day they'll have this ability, but this doesn't help any of the three cases above. Whatever they looked like a few years ago, whatever content was served up - well, it's as much a mystery now as it will be in a thousand years.
Should we mourn and tear our hair out because archive.org's history collections aren't perfect? Of course not. Nobody in his/her right mind believes it's possible or desirable to save every last bit that's been churned through cyberspace over the last ten years. One might have hoped that IBM and MSNBC - major, well-funded sites that one might have thought "socially significant" entities - might have survived the historical mill with fewer broken parts. But both of these organizations clearly have the wherewithall to have performed internal archiving on their own, and if they're particularly bugged about this, well, they can simply FedEx a tape drive over to archive.org's offices in San Francisco's Presido. And if they don't, well, they obviously don't give a damn about the problem, in which case history will give them what they deserve: obscurity.
One thing is clear, however, and it's a lesson worth noting by anyone building Web sites today: if you want people in the future to see your site the way you intended it, you really should eschew fancy, faddish "gimmick du jour" technologies and stick to good old fashioned plain vanilla HTML, GIFs, and JPEGs. Unless you resist the temptation to load up your pages with Java, Shockwave, Flash, XML and other fancy-dancy presentational goodies, the only thing that people of the future may see when they key in your URL is a big grey hole in Web history.