What is Reddit's opinion of grab-site?
From 3.5 billion Reddit comments

➔ grab-site website

By popularity on Reddit, this Service is:

9 reviews of this app found across Reddit:

>im looking for a selfhosted alternative to the Wayback machine, where you can have the webpages saved and all the attacked elements like pictures, videos, and other stuff like that.

But the main thing that I would like: Updating of already downloaded webpages, and the ability to have links in saved webpages to to other saved webpages, just like the Internet Archive

You want a WARC file. Its the only standardized web archiving format and there are several programs to "play" the file. Just like the internet Archive, which is open sourced by the way. ;)

https://github.com/ArchiveTeam/grab-site/

Oh I would definitely recommend you to use grab-site (to download the site) and then use Replay.Web (The application not the website!) to access that site because. Its almost as if you have Internet with a working connection but it works completly offline.

User Script with cron

>#!/bin/bash
>
>docker exec grab-site sh -c "grab-site https://google.com --no-offsite-links --no-video --no-sitemaps"

https://github.com/ArchiveTeam/grab-site#usage

If the command line doesn't scare you too much you can use grab-site and tune the ignore regex to ignore all urls that don't fit the right product page syntax.

If you wouldn't mind, it would be awesome if you could share the archives afterwards!

First, try downloading the links using old.reddit.com rather than reddit.com if possible. reddit.com is a lot more complicated.

Try using a tool such as grab-site. https://github.com/archiveteam/grab-site

You can feed it a list of URLs from a text file (newline-separated) with grab-site -i <file>

It will record in WARC format which is better for preservation. To play back the WARC, use a tool such as https://replayweb.page

https://github.com/ArchiveTeam/grab-site is a very very easy way, you can even put the URLs in a file and pass it into grab-site (with the -i option). It doesn't run JavaScript unfortunately so if the website needs JS it won't do a complete backup (it'll back up the javascript, but not the websites that the JavaScript downloads). It saves into WARC which can be injested into the WBM or loaded at replayweb.page

I use a combination of pywb and grab-site. For most pages, I can get the warc/cdx with grab-site and then host it in pywb. Sometimes I have to resort to using the pywb proxy and manually visit pages to get them archived.

This pointed me in the right direction wpull [buggy, wget fork] and its forks use with PhantomJS to grab js bits of sites.

grab-site from ArchiveTeam being one of the working, active forks, worked for me

I never used this script to download a subreddit, but it worked for other types of websites and there is even a chapter for downloading subreddits in the documentation.

https://github.com/ArchiveTeam/grab-site

What is Reddit's opinion of grab-site? From 3.5 billion Reddit comments

➔ grab-site website

By popularity on Reddit, this Service is:

9 reviews of this app found across Reddit:

What is Reddit's opinion of grab-site?
From 3.5 billion Reddit comments