Archiving Stuff for Offline Use

, posted Jun 18, 2023

Despite the ‘everything on the internet is permanent’ discourse that has been drilled into our heads since childhood, the opposite is true: nothing is permanent on the internet.

Every now & then, I come across dead links on the web, in my bookmarks, in my wiki & so on.

I sometimes check back into saved links, say a certain blog post & it comes as no surprise that the link would be down, whether because that domain has expired or the content has been removed or censored. Unfortunately, there’s often nothing we can do about it.

Everything we see exists on a server. Even if the server is “virtual”, it exists on a real hard drive somewhere. That means anything can happen, from a natural disaster to a genuine mistake.

The site content can be locked behind paywall or the owner might be breathing his last, the governments or copyright holders might have issues with the site, we can count ‘n’ no of reasons for why would a site won’t last on internet.

Kiwix.org

Kiwix.org is a non-profit organization that provides free and open-source software for offline access to web content. It is an offline reader for the content like Wikipedia, Project Gutenberg, TED Talks, Crash course, Wiktionary & so on.

It is available as an application for all mainstream operating systems. The files supported by Kiwix come in highly compressed format .zim

You can import zim archives into your Kiwix client easily. Simply launch your Kiwix app and click on open files and slect your zim archive and you are done. Here are some useful zim archives that you might find useful.

library.kiwix.org - official kiwix library

wikipedia_en_all_maxi_2023-05.zim - 93 GB
wiktionary_en_all_maxi_2023-04.zim - 8 GB
crashcourse_en_all_2023-06.zim - 45 GB
khanacademy_en_all_2023-03.zim - 168 GB
openstreetmap-wiki_en_all_maxi_2023-05.zim - 889 MB
zimgit-post-disaster_en_2023-06.zim - 614 MB
ncert-audiobooks_en_all_2022-05.zim - 41 GB
opentextbooks_en_all_2023-05.zim - 137 MB
gutenberg_en_all_2023-05.zim - 70 GB

Kiwix will enable you to browse all these collections offline without any network connectivity. You can save the zims into a portable SSD or a thumb drive & access the offline version of useful web on the go anytime.

HTTrack

HTTrack is a free and open-source Web crawler and offline browser. It allows users to download a World Wide website from the Internet to a local directory, building recursively all directories, getting html, images, and other files from the server to user’s local computer.

The speical thing that makes HTTrack stand out it it’s ability to maintain the original site’s relative link-structure. You can browse the site dumped with HTTrack from link to link, as if you were viewing it online.

It’s extremely easy to save the offline version of a website using HTTrack, especially on Unix based OS.

On Ubuntu run the below commands to install & archive the given website for offline access. Preferably create a separate directory to dump your desired website.

sudo apt-get install httrack 

httrack https://wiredtoolkit.netlify.app

On MacOS run

brew install httrack

httrack https://wiredtoolkit.netlify.app

Online Archiving

I don’t see a reason for not indexing tools to archive websites over internet itself. So for online archiving, we have:

web.archive.org - famously known as Wayback Machine. If you sign up with an account, you can track the sites you’ve archived. More importantly, if you sign up, it will give you an option to check “save outlinks” this is very important because it means it will attempt to save all pages of the site, instead of just one.
archive.ph - is a time capsule for web pages! It takes a ‘snapshot’ of a webpage that will always be online even if the original page disappears. It saves a text and a graphical copy of the page for better accuracy
archive.is - another online web archive, actually a mirror of previous one.

That is all about this post. If you are familiar with better archiving tools or methods please let me know.

You can reply via mail