ArchiveBox Is Software That Can Archive The Ever-Changing Internet
Trust is a rare commodity these days. Developers on Cyberpunk 2077 trusted their CD Projekt Red bosses not to force their faces onto a grindstone with six day working weeks and horrendous working hours. You trusted CDPR to have a quality game available for download or purchase at some point last year.
You trust us to give you the nitty-gritty on what you need to know about cyberpunk culture, not to tell you lies and not to egregiously change articles to better suit our predictions when we mess up.
It would be easy enough for us to do. When Cyberpunk 2077 eventually hits the virtual shelves, it would trivial for us to skim through our prior prognostications in ancient articles and force our forecasts to fit the facts. We could claim we were right all along.
Yet another perfect piece of prognostication. We should start a sweepstake syndicate.
Web Pages are Constantly Changing
You trust us to steer clear of such shenanigans and by-and-large, we respect that. Occasionally, we’ll dive in to fix the odd typo or dead link, but that’s it. The content of our articles will not be retrospectively changed to fit an agenda.
Do you trust us? Do you trust what you’ve just read? How do you know that the all our awesome alliteration wasn’t added afterwards to alter and augment our ad-revenue?
A web page changes for a bunch of reasons. If, due to some unlikely chain of events, this page becomes wildly popular, people will link to it on their own sites on social media and out in the wilds of the Fediverse. This page will gain what is called ‘link equity‘ which is potentially valuable for actual money. With so many incoming links from reputable sources, it would make sound financial sense to swap it out for a full page Viagra pharmacy, a malware spewing repository or some other shit.
How do you know what was there before? You don’t. Not 100%.
Maybe there’s a copy of it in the Internet archive’s Wayback machine, a definitive copy which will last for eternity, guarded by an army of lawyers and the finances to keep it preserved forever.
But even the internet archive is vulnerable. Nation states periodically block access. Let’s face it, they make copyrighted work available to you for free regardless of the justification, they are essentially facilitating piracy on a massive scale. They occasionally lose things due to accidents.
The inside of the Internet Archive
This isn’t even taking into account the text, images, music, and videos lost when publications close their doors and pull down the virtual blinds as they’re bought out, merged, fail or the owners die of blogspam induced boredom.
The knowledge they hold, those original thoughts and brilliantly crafted long-form essays, are nothing but dead links from someone else’s site or a bookmark leading to nowhere.
When the Great library at Alexandria burned, humanity lost, at most, 35 million pages. We’re guessing it was mostly Homeric fan-fic, but there was probably something worth reading in there. Since the dawn of the internet, hundreds of millions of domains have expired, been deleted or abandoned. Again, a fair portion of it was probably slash fiction, but at least a part should have been preserved.
In our opinion, it’s important to carry knowledge from the past safely into our glorious augmented-reality, total surveillance, flying car cyberpunk future and you can’t count on anyone else to do it for you.
Sure, Screenshots are quick and easy to take. But they only capture a section of the screen, they don’t scale well and they’re super easy to edit for humorous purposes on Facebook or Instagram. PDFs present their own problems. Download the entire webpage as HTML? Get real. How are you going to organize it? How do you plan on searching through all the material? Do you actually want to make archiving your full time occupation?
As it turns out, there’s some software out there which will do the heavy lifting for you.
What’s in the ArchiveBox?
There are a fistful of archiving solutions out there. All are of of varying degrees of quality and usefulness. Today we’re focusing on ArchiveBox due to its ease of use, focus on web media and compatibility with a variety of systems. Yes, archive box was built for Linux and be installed via apt, but it will also run on macOS and Windows as it can be run on docker or as a python script. Neato. We, and the creators, recommend running it as a docker image.
If you’ve ever used the internet archive‘s Wayback machine, you’ll be able to get to grips with ArchiveBox almost straight away, not because it looks the same (it doesn’t), but because it’s so intuitive to understand the software’s purpose and how it goes about achieving that purpose.
The bottom line is this: Point it at a webpage or a group of webpages and it will pull those pages down to your machine so that you can browse them locally. It doesn’t matter what’s on them so long as they are HTML based. It will pull the text, images, embedded videos, MP3 files, cookie consent forms and whatever ads were showing at the time. So far, we’ve been able to make perfect reproductions of almost every page we’ve come across. If you have pi-hole on your network, you’ll be saved from adverts taking up valuable space on your hard drive. You can feed it a list of URLs through the web interface and let rip.
If you don’t mind using the shell, you can feed ArchiveBox the URL of an xml sitemap and then tell it to pull everything down to whatever link depth you require. We considered pointing the software towards one our own sitemaps and pulling the trigger on the whole damned thing. But consider that one of our ordinary WordPress pages is around 16 MB (including all assets), and that one with embedded videos can reach a gigabyte.
We paused after realising that a single webpage with embedded media came to 357.2 MB
Storage Space, Download Speeds and Piracy with ArchiveBox
I don’t know how many videos we have embedded on cyberpunks.com, but I do know that we have limited drive space here. Fortunately, you can pick and choose what types of media you want to preserve. We plan to change the defaults so that video is ignored and then suck the entirety of cyberpunks.com from the net and down onto our own personal machines. It will take a while, so we’ll probably need to leave ArchiveBox running overnight.
Two of the technologies leveraged by ArchiveBox are YouTube-dl and wget. You’ll probably be aware of recent YouTube-dl controversies as multiple DMCA requests from the RIAA resulted in its repository being taken temporarily down by github. Yes, YouTube-dl can be used for copyright infringement. Wget is free software package for retrieving files using HTTP, HTTPS, FTP and FTPS and is, at least according to Herr Bischoff’s Bot Database, “usually up to no good unless you explicitly host downloads that are to be automatically retrieved”.
And here’s an important point. Archiving websites is piracy in the same way that home taping is killing the music industry. You’re taking copyrighted materials and you’re making an unauthorized copy of them. In our opinion, it’s not an especially big deal. It’s unlikely that you’ll have FBI agents banging on your door. But as we occasionally feel the need to restate, we’re not lawyers. While you technically could point archive box at a YouTube playlist page and tell it to get going, there are far more efficient ways of getting your illegally download music fix that isn’t what the software was designed for.
The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don’t think everything should be preserved in an automated fashion, making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about.
You don’t need to run dedicated server hardware to have a great experience with archiveBox. It will happily start up its own server with a web front end accessible through localhost. Or you can do everything via command line. Being the kinds of people we are, we have ArchiveBox installed on a raspberry Pi, open to the web at large and accessible through a normal URL -so we can check our saved pages wherever we are in the world. YMMV.
Content is stored in a variety of formats. In addition to the actual webpage and associated content you see in your browser, you’ll get a generated PDF, screenshots, an HTML dump, browsable media and a gzipped WARC (Web ARChive) file which is nice.
There’s really no limit to what you can do with ArchiveBox. You can start archiving now and never stop, filling up drive after drive in a ceaseless and hopeless quest to preserve the entire web. Room after room in your house will be taken up by server racks. Your power bill will shoot through the roof as you eventually attempt the impossible task of archiving your own archive, causing a paradox and a tearing a rift in the space-time continuum.
Don’t blame us. Blame Nick Sweeting.