I just discovered Archivebox today, https://archivebox.io/
From what I understand it looks like a self hosted webserver, that will auto archive websites that are listed in a variety of services. RSS, XML, JSON, brower history or bookmarks, etc..
Again I just found it today, so I have no idea how well it does it job. Here is there demo site https://archive.sweeting.me/
Just looking at a few example pages, it looks like each page has a photo taken, a PDF print out, and an HTML version of the page. So you will hopefully get at least 1 readable copy of the page.
If you want the Wayback Machine but local and with more flexibility, you could try ArchiveBox. The documentation has great tutorials on usage and installation so I won't go over it here.
If it doesn't work for you you can go look at awesome web archival primarily the acquisition section. It has multiple tools for archival and crawling the webpages.
There is ArchiveBox https://archivebox.io/ . It does not crawl and create a full clone of the site, more like take a snapshot of whatever url you give it like the Wayback Machine for offline viewing/archiving
Hang a couple cheap USB drives on it and use it for:
A big Archive Box https://archivebox.io — you can spread the crawling and indexing load across different PI’s, some crawling, some youtube-dl’ing, etc.
You can also use it as a NAS to back up your desktops and mobile devices through things like Timemachine, rdiff-backup, etc.
So, interesting thing: since iOS 13 iPhones can run SSH commands remotely via the Shortcuts that were added in that iOS version.
This opens a lot of doors to doing some cool things with iPhones. For instance, I use ArchiveBox to save and archive news articles and websites to my home server using this: basically when I find a page I want to archive I just hit the Share button, and you can set Shortcuts to appear on the share sheet. Then I just tap Archive (the label I gave the shortcut) and it sends it off to be archived on my home server, via SSH.
It's really quite a convenient solution.
Quick rundown for anyone who is interested in archival stuff:
You can host your own!
There are free software options for you to host your own archiving server. ArchiveBox is one I am aware of, and you can then dictate what is or is not valid content to retain a copy of.
If you are paranoid about someone trying to take your server offline, then you should go with a host that values free speech. NearlyFreeSpeech is in the US, so they may not be thrilled about defending about copyright stuff, but archiving web content is generally considered fair game since it's publicly accessible and in the early days of the internet mirroring someone's site for them was a courtesy, not "theft". I'm not sure of any hosts outside the US that are known for outright ignoring copyright requests. The EU is probably not a good place to host right now. In general, I don't find the risks associated with hosting out of Russia to be worth the trouble anymore.
Remember to share, regardless of how you download it. If it gets removed, be ready to spread it around. Learn how to configure a torrent. Configure IPFS to allow others to grab and host their own copies. Put it up on a website. But don't let it die.
Well the good news is that the individual requirements are all very simple with well established technologies. It would only take a smidge of technical skill and decent organizational thought to get something together.
I wonder if https://archivebox.io/ another tool , maybe even wallabag, might be of value. I haven't looked at either in a while and not thinking of this project. It might be less work to just create something original from existing parts than to wrangle an existing tool.
The other ideas I came across that is viable if you're focused more on the audio/media side of things would be PLex or other media servers.
I'd recommend ArchiveBox, it takes care of extracting videos and media files using youtube-dl, and it also saves to Archive.org for redundancy.
For particularly difficult pages I recommend https://ArchiveWeb.page and https://webrecorder.io, they have the best archival and replay tech for JS-heavy / media-heavy pages.
You could 'right click' on the page and save as HTML. Which will pull the whole webpage for you to view locally. It's a single click.
If there's no right click option. Try hitting Ctrl + S while viewing the page.
Or something like ArchiveBox, which will do the same. Requires some technical setup.
You could give ArchiveBox[1] a try.
From their website:
> ArchiveBox takes a list of website URLs you want to archive, and creates a local, static, browsable HTML clone of the content from those websites (it saves HTML, JS, media files, PDFs, images and more).
I have a similar issue (~10K bookmarks due to having started bookmarking 30 years ago), and have been writing a python script to try to deal with the issue of bookmark files that were copied and diverged (having same starting point but were updated and added to under different profiles).
​
Anyways, during my due diligence process I came across this interesting little project you might like:
​
Let me know if you have any feature requests or questions! (I'm the ArchiveBox creator @pirate on Github.)
It actually supports saving outside the install directory, just set the environment variable:
env OUTPUT_DIR=/some/other/path ./archive ...