Look at the ‘--post-file=file’ option.
This example shows how to log in to a server using POST and then proceed to download the desired pages, presumably only accessible to authorized users:
>wget --save-cookies cookies.txt \ --post-data 'user=foo&password=bar' \ http://server.com/auth.php
>wget --load-cookies cookies.txt \ -p http://server.com/interesting/article.php
If the server is using session cookies to track user authentication, the above will not work because ‘--save-cookies’ will not save them (and neither will browsers) and the cookies.txt file will be empty. In that case use ‘--keep-session-cookies’ along with ‘--save-cookies’ to force saving of session cookies.
wget -c -r -A flv,wmv,mp4 --no-parent http://admin.fetishbox.com/_content/
Edit: /u/BombTheFuckers comments (at some cost in karma): > Not cool. Bandwidth costs money, too.
He's not wrong. And anyway downloading "all of this" would take all night (slower and slower as other leeches join) until the site owner realizes what going on and plugs the leak before you can finish.
I didn't get a measure, but there's well over 200GB of content here, and most of it is low-resolution. The better-quality Kink.com stuff seems mostly 340x460, which is far from the best available from Kink.com . I'd only recommend siteripping if you have very poor vision, or need filler for your tube site.
Incidentally, there's a subreddit for open directories: /r/opendirectories
I think the easiest way would be not to use Firefox in this case. Instead you probably should use wget, a Unix/Linux tool that's also available for Windows.
It's the tool of choice used by the people over at /r/opendirectories to download files while preserving the folder structure. Easy tutorials on how to install and use wget are available there, and they're definitely able to help you should the need arise.
I'm not really sure what you need. Do you want the entire page to be accessible offline? Or do you just want the images? Or images + pesterlogs + narration?
And lastly, how comfortable are you with scripting?
There is a unix utility called wget which will just fetch anything from a given URL. I'm not sure the best way to fetch the whole comic. You can use wget to fetch a single page, but it will be missing all external things that make the page look pretty (like the flash animations, the actual comic panels and most formatting). This would get you the pesterlogs, since they are hard-coded into the pages.
If you want the images, they seem to be hosted at pages with the following format:
http://cdn.mspaintadventures.com/storyfiles/hs2/#####.gif
Where "#####" is a sequential 5-digit number (going up from 00001). You can get it to download everything using a for-loop in your favorite language (a shell script would be easiest but you can do it in python too).
If you're on windows and are scared of words like "unix utility" then there should be python packages that can do all this for you.
Hope this is helpful.
You need to alter the options you're using a bit. You need to turn on recursive downloading (-r) for a start. Also, most of the pictures on the subreddit aren't actually hosted on reddit.com, so you'll need to use options for Spanning Hosts. I'm sure other people can finetune your command usage more precisely...
Also keep in mind that wget is mostly designed for sites with relatively simple, static designs - sites like Reddit can pose problems whether or not your wget usage is correct.
I'm certain there's a way to do it with Powershell (since you can do damn near anything with enough code), but choose the right tool for the job.
wget has a built-in switch to mirror a website/URL, including following links & retrieving images. Minimal additional scripting needed.
The site uses a header field, 'content-disposition', which when using a browser will download a file.
Luckily, wget has an experimental feature (wget docs) supporting that header type:
wget --content-disposition https://forums.alliedmods.net/attachment.php?attachmentid=112169&d=1352928476
Which will save the desired file to your current directory.
I believe it's hosted on Amazon S3, infrastructure speed shouldn't be an issue there.
How are you downloading it? Use a program like Wget that provides a --continue
option so you don't need to redownload from the beginning if it fails.
This might help..found it here.
http://www.gnu.org/software/wget/manual/wget.html#Recursive-Accept_002fReject-Options
2.12 Recursive Accept/Reject Options
‘-A acclist --accept acclist’‘-R rejlist --reject rejlist’
Specify comma-separated lists of file name suffixes or patterns to accept or reject (see Types of Files). Note that if any of the wildcard characters, ‘’, ‘?’, ‘[’ or ‘]’, appear in an element of acclist orrejlist, it will be treated as a pattern, rather than a suffix. In this case, you have to enclose the pattern into quotes to prevent your shell from expanding it, like in ‘-A ".mp3"’ or ‘-A '*.mp3'’.
‘--accept-regex urlregex’‘--reject-regex urlregex’
Specify a regular expression to accept or reject the complete URL.
‘--regex-type regextype’
Specify the regular expression type. Possible types are ‘posix’ or ‘pcre’. Note that to be able to use ‘pcre’ type, wget has to be compiled with libpcre support.
you'd have to specify the directories in a list then. Either output all the directories you current have for the exclude option or the ones left for the include option
http://www.gnu.org/software/wget/manual/html_node/Recursive-Accept_002fReject-Options.html
You could use wget or curl to POST data to a URL like speedtest does. Check out "--post-file" in the http options in the man page. http://www.gnu.org/software/wget/manual/html_node/HTTP-Options.html
The OP is supposedly using SSH to log in remotely to an other, presumably coworkers, computer. The screencap is of htop, a command-line processes manager (similar to ctrl+alt+del, but for command-line). The OP is looking at/highlighting a running process called wget, which is a command-line downloader program (i.e. wget www.whatever.com/file.avi will download the .avi to that directory). It looks like not only is the owner of that box using wget to download pornography, he's also running it as root, which is an account that should not be used unless you're administrating the system.
Probably the quickest way to do it is with wget. You can specify your user name and password with the --post-data option. You can put this in a .sh file (os x/linux) or a .bat file (windows) so you just need to run the script when you want your IP.
In the simplest case, you would just need to run a single wget command that submits to the same URL as the form does and supplies your username/password.
That being said, there may be some other things that complicate this for you:
the form submission may require submitting a nonce that is provided on the login page, which would require you to do two wgets (the first to get the page + nonce, the second to submit it).
They may do some session-tracking which begins when you first get the empty form. You may need to use the wget options related to this (--save-cookies + --keep-session-cookies / --load-cookies)
There may be some client-side code which causes the page to submit something other than your password: for example, I've seen the management pages for routers which base64 encodes the password before submitting it.
It's also worth pointing out that at some point in this process, you're going to need to keep a plaintext (unencrypted) copy of your username/password around in a place that this script can access. So, if somebody found your laptop, this would be a Bad Thing.
edit: clarification
No, this is a terrible approach for so many reasons I don't know where to begin... sorry that's sounds harsh but it's true.
Ok, if you want to run a query every day the simplest approach would be to use something like cron. You would setup your cron job to hit an endpoint (using wget for example) that your plugin will recognize and run the query at that point in time.
There is a WordPress way of doing this using wp_schedule_event but this relies on your website receiving traffic for it to check if the event needs to run so basic cron jobs are more reliable.
There are loads of courses and documentation out there about all this stuff but hopefully I've pointed you in the right direction.
Take a look at the -A <em>acclist</em> / --accept <em>acclist</em>.
I've only used it download specific extensions, but from what I've read, you should be able to use wildcards to download specific patterns e.g -A *ArthurDent*
to download all files with "ArthurDent" in the filename.
Maybe just use this. Looking at the imgur page I think using the album names as file or directory names will take some real scripting. There's some recursive downloading options in wget that make me think that with those options properly configured you can at least get the same results as the above shell script.
If this is a one-time operation, I'd use a recursive wget to make a local copy of the article pages and then use something like hxselect to separate the article text from the HTML files.
If this is something I'd need to do regularly, then I'd just make a bash script of the above.
In either case, I don't think JavaScript is a practical tool for this job.
use wget.
options you'll need off the top of my head
--page-prerequisites
--convert-links
Check the man page for other options. Wget is perfect for getting readily available data/content without bothering the client. There are Win and *nix versions.
Try wget.
You can use a command similar to the following:
wget -r -nH --cut-dirs=2 --no-parent --reject="index.html*" http://mysite.com/dir1/dir2/data
Oldschool people use wget.
EDIT: Oops, it was supposed to be a reply to this comment.
If you use things like wget or downthemall this wouldn't be an issue nearly as much, since they can continue after being interrupted.
From the image I can see , you are better off using downthemall to downlod the files. I would not recommend downloading the entire folder, since you will end up with a lot of duplicates (720p and 420p version of same video).
Also, unless you have a commercial internet line, most ISPs will have issues with you downloading 13TB!! I would suggest you stagger your downloads over several weeks. Check the fine print for fair usage policy details. How fast is your line anyways?
( Funfact: as per wolfram alfa, it would take me 6 years and 160 days to download the entire 13 TB over my 512 Kbps line!! ) .
Edit: If you are are really serious about this, look into a command called "wget". I've only used it to download single files but the tool is pretty powerful and can download entire folders and re-create the same folder structure on your local machine. It is built into all linux/osx machines but windows versions are also available. I think you can also tell wget to look for specific patterns so you only download one type (480 or 720) of files.
There are guides online you can find, here's one with examples:
https://www.labnol.org/software/wget-command-examples/28750/
And you can always consult the official manual:
>However, I end up with filename.1, then filename.2 if ran again.
Have you tried turning on time-stamping?
>time-stamping in GNU Wget is turned on using ‘--timestamping’ (‘-N’)
> (link)
Without looking into it too deeply, I'd say look at your referrer options in wget. This is the simplest of protections and won't work against more advanced systems that use cookies or IP.
I had a look through http://www.gnu.org/software/wget/manual/wget.html and it looks like there is no equivalent for use in the .wgetrc file.
What is it you're trying to achieve? Why does it need to be done via the config file and not just via commandline?
how is it supposed to wget the file? wget isn't a windows command...
all IF NOT EXIST does is look to see if something exists, and if not, it does whatever. in this case it launches your exe but IF NOT EXIST shouldn't affect the way that exe operates.
you could add wget for windows to the system...
Maybe he's going too far in the library, but it's amazing how many systems use streaming instead of paging. It should be possible to continue from many kinds of failure (including from a timeout). <code>wget -c</code> is a good example, when the server supports the <code>Range:</code> header.
It depends upon what options you specify.
From the download options page under -nc
"When running Wget without ‘-N’, ‘-nc’, ‘-r’, or ‘-p’, downloading the same file in the same directory will result in the original copy of file being preserved and the second copy being named ‘file.1’. If that file is downloaded yet again, the third copy will be named ‘file.2’, and so on. (This is also the behavior with ‘-nd’, even if ‘-r’ or ‘-p’ are in effect.) When ‘-nc’ is specified, this behavior is suppressed, and Wget will refuse to download newer copies of ‘file’. Therefore, “no-clobber” is actually a misnomer in this mode—it’s not clobbering that’s prevented (as the numeric suffixes were already preventing clobbering), but rather the multiple version saving that’s prevented.
When running Wget with ‘-r’ or ‘-p’, but without ‘-N’, ‘-nd’, or ‘-nc’, re-downloading a file will result in the new copy simply overwriting the old. Adding ‘-nc’ will prevent this behavior, instead causing the original version to be preserved and any newer copies on the server to be ignored.
When running Wget with ‘-N’, with or without ‘-r’ or ‘-p’, the decision as to whether or not to download a newer copy of a file depends on the local and remote timestamp and size of the file (see Time-Stamping). ‘-nc’ may not be specified at the same time as ‘-N’. "
If it's a public page and you need to log in you may have to work with wget options to specify the login information. Unfortunately I can't give you much more than the manpage and Google for help. There may be better/easier to understand resources for the manpage than what I linked.
better to use -m/--mirror
> Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to ‘-r -N -l inf --no-remove-listing’.
Use requests instead of urllib for opening pages etc: http://www.python-requests.org/en/latest/
Use beautifulsoup if you need to parse html: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Or maybe you could use something like wget: http://www.gnu.org/software/wget/manual/wget.html
WGet is a utility to download stuff. The documentation sounds like you shouldn't need it. Just download the listed URL.
wget
can work with cookies. On its man page[1], look for the section covering the --post-file
switch; it illustrates an example authentication session that uses cookies.
[1] <code>wget</code>'s man page (In case you're on a system without it atm.)
Unfortunately, they're not mine. They're all ftp links. I found that using wget was easiest to download them:
Something like
datestring=date +%m-%d-%y-%H:%M
wget -O reddit-$datestring.html reddit.com
will save reddit's front page to an html file. wget has an extensive manual documenting its many options. Cron is a program for scheduling scripts to run at regular intervals, but your Linux server may not supply you with cron jobs; you may need to ask.