I wish the uploaders spent a bit of time explaining how to consume the data. I often see interesting stuff uploaded to academic torrents, with literally zero description of how to ingest/use the data shared, or context of where it came from (date ranges, sources, reason for sharing).
As an example: http://academictorrents.com/details/0a853fdcc1d28c306d75e29195a5536087f6e2b4
I know it's street map data, using the PBF format. but how would you use it? Do you need ArcGIS? Is there a freeware tool to convert it to other formats? Uploaders take note, context is very helpful to get people interested in your uploads.
That link requires some registration and karma bullshit.
I haven't tested this myself but here is a quick link I found after using google and the torrent name: http://academictorrents.com/details/34ebe49a48aa532deb9c0dd08a08a017aa04d810/tech&dllist=1
Also this torrent is about 5 years old...
Concert recordings of trade-friendly artists? http://bt.etree.org/
But if you mean you can only seed until the end of the month, maybe not much point.
edit: indie films: https://vodo.net/
academic research data: http://academictorrents.com/
> a torrent would likely be up quite soon after
Not sure if you're referring to ImageNet or Open Images there, but just in case people don't know, there is a 1.3 TB torrent of ImageNet. It had no seeders a few months ago, but it's back up to 2.
I agree. Smells to me like some overly-zealous person in their IT services department has gone overboard. This could backfire on them badly.
And what about someone running a bittorrent client to download a Linux distro? Or how about something like Academic Torrents. Are they distinguishing use at that level?
There are musicians that create music under creative commons, while they don't need to put it in a torrent with other content sharing sites, but you can download gigabytes worth of many free songs at once.
981 public domain movies right here.
This could be a money maker on its own, 8 terrabytes of academic data downloaded every day, see here. Basically if a data scientist wants to get their data faster they might spend BTT to increase bandwidth.
Other than that, there's the everyday freeware apps, sure you can go to the website, but not every website is going to let you download at your full bandwidth, so if you're trying to download a 1gb or say 5gb app from a website and it's going slow, you can download it faster on bittorrent.
I was recently thinking of emailing you asking what amount of donation would cover downloading the entire set of data and complaining that you don't have torrents of it.
However I did find a partial torrent: http://academictorrents.com/browse.php?search=reddit
I really think you should look into releasing yearly torrents. That would be easier on everyone. Most people don't have download managers installed anymore.
>legit stuff which provide torrent download for cost and speed optimization reason always has HTTP download options
Beside the fact that this is not true (see https://xato.net/today-i-am-releasing-ten-million-passwords-b6278bbe7495#.us1jhf697 for a concrete example or http://academictorrents.com/ for a lot more), I have a big issue with that kind of thinking. BitTorrent is just a distributed data transfer protocol with some engineering properties that make it better suited for some applications than other protocols. The assumption that it's a network for piracy and a secondary channel for HTTP blobs is a bias that is only perpetuated by accepting the status quo like a college banning the protocol entirely. This restricts human creativity and hurts our ability to communicate with each other freely.
As another data point, BitTorrent is my preferred way of sending a file to someone directly instead of gmail (size limitation), facebook (don't trust them at all) or any cloud based solution where I first have to transfer my file on a middleman machine so another person can transfer it to their computer from there. You just set up a private .torrent file, send the magnet link, and it usually starts transferring right away, especially if we are on the same network. This is taken away from you by a blanket ban, under the pretense that it's fighting crime.
Think twice before you use this. Torrents are good for distributing but not archiving. Lots of old datasets have no seeds, which means you can't download them. I've tried :(
A better idea might be a user pays s3 bucket. You just pay a few dollars a year to host. And people pay a few dollars to download.
Or donate and upload to archive.org.
Don't just torrent and forget.
A simple place to put the data to start with could be something like http://academictorrents.com
At least you could get the data out there and then see what people would do with it while you work to build out a usable interface.
Here is some high res data.
Massachusetts USGS 30cm Color Ortho Imagery (2013) - JPEG2000 Format http://academictorrents.com/details/82c64b111b07ff855b8966701a13a25512687521
And with labels:
Mnih Massachusetts Building Dataset http://academictorrents.com/details/630d2c7e265af1d957cbee270f4328c54ccef333
I'm a researcher and I am also currently doing work on reddit data. Personally, I ended up downloading the entire reddit dataset from pushshift.io. You can also save the site some bandwidth by using this torrent for the comments.
If you just want to work with the BigQuery data, you could run something like
SELECT * FROM [fh-bigquery:reddit_posts.2016_05] WHERE subreddit LIKE 'politics' LIMIT 1000
to grab all posts (limited to 1,000 posts from May 2016 in this instance) from a particular subreddit.
I've started a collection of tools for working with PushShift reddit dump data that I've been working on the last week or so. It's very far from the point where it is ready for scrutiny, but you might find something of value in what is there so far.
Ask, and ye shall receive.
Here's the dataset: ChestX-ray14 About 40+GB unzipped so be aware.
Make sure you have a radiologist friend to help you and be particularly careful about diagnoses of pneumonia. see: chexnet-a-brief-evaluation written by me and links to many other concerned and informed deep learning practitioners.
I downloaded the dataset from here
Not a lot of seeds, and I'm not seeding the complete dataset, but I'm seeding the 2010 subset.
Do you mean the NYC Taxi Data?
http://academictorrents.com/details/6c594866904494b06aae51ad97ec7f985059b135 http://academictorrents.com/details/107a7d997f331ef4820cf5f7f654516e1704dccf
Thanks for the bug report! GUIs are hard! We will make sure zooming does not get rid of the menu options. We decided on mobile to just present the same desktop view for most pages.
We have an issue tracker here: https://github.com/AcademicTorrents/academictorrents.com-feedback/issues
I'm still trying to track them all down (have been for about a week), but I've added some of the missing VOC dataset files to academictorrents.com: http://academictorrents.com/browse.php?search=voc
Note: x-post from /r/datasets
These are just two, the others aren't up yet
http://academictorrents.com/details/cf7efcf33370e24985ce883532c069cc43176d1b
http://academictorrents.com/details/8904671dffa9d296edcd095caca519c678c240f1
These are just two, the others aren't up yet
http://academictorrents.com/details/cf7efcf33370e24985ce883532c069cc43176d1b
http://academictorrents.com/details/8904671dffa9d296edcd095caca519c678c240f1
But to be honest the most success I have had with finding PDFs has either been through scribd (lol) or a p2p sharing service like soulseek. You can find collections on soulseek that’s are simply amazing.
/u/-Archivist your archive shows 41.06 GiB of data and 33049 files
However, there is 43.92 GB of archive files and 33053 files
http://academictorrents.com/details/30e27c1d63e8ee36d42457e700e4c1a268718885/tech&filelist=1
Academic torrents mostly tracks papers and datasets, but they also have some courses available online. Probably not as robust as yr looking for but it is a start:
That's the torrent I came across, it only has data up until March 2017.
http://academictorrents.com/details/85a5bd50e4c365f8df70240ffd4ecc7dec59912b
​
> Data Update: Due to the large amount of demand for the data-set, I will be publishing AWS links here tomorrow (Friday) around 2PM Central Time in 50GB chunks, this should enable everyone that wants the data access without compromising the utility of my machine. I plan on hosting the data for 3 months.
You might want to consider Academic Torrents or check out /r/pushshift (cc /u/Stuck_In_the_Matrix might be interested).
Here you go with the first: http://academictorrents.com/details/70ecab072b2792c9239ab8197d3f52cc1d075be1/tech
The admin kindly upgraded my account to uploader, but I had forgotten that the upload form requires the torrent file to already contain their own tracker. Luckily torrent-file-editor
allows to change such details without changing the info_hash, otherwise it would have fragmented the node swarm.
The site has "collections" which provide RSS feeds that work with ruTorrent and uTorrent. We recommend these instead of blindly mirroring all the data. For example here is a list that I recommend to mirror: http://academictorrents.com/collection/joes-recommended-mirror-list
What would be a preferable format? A text file with just the magnet links?
PS: Hmm, I just found this statement: "We would like to avoid the blind mirroring of all data."
> For years all names of all Facebook users were listed at https://www.facebook.com/directory.
> Even without the links to the profiles themselves, it was a very useful dataset of diverse names and versions of the same name around the world.
> In 2010 a security researcher scraped it and made it available as a torrent.
> > 171 million names (100 million unique) ... limitation is that these are only users whose first characters are from the latin charset.
> It grew significantly in size and diversity between 2010 and 2018, when it was deleted (This page isn't available) without warning, presumably in connection with political issues around user data.
> Is there a relatively recent version available? It could be incomplete if randomly sampled.
I don't have it but yesterday I was looking for something else and I came across a torrent which had it. Here's the link. (It's not a mega link, but a legit torrent site though)
Can't comment on the quality, I'm not sure if it's HQ.
What kind of data are you looking for?
Edit: you might be interested in this - it's pretty old, but it's a huge offline dataset of Dota match data from YASP (now OpenDota).
Yes. Here's a torrent to a datadump of ~3.5M parsed matches as of Dec 2015. Guess what? It was compiled by yasp! And since it's just reporting straight, raw data with no analysis, it's excellent quality.
http://academictorrents.com/details/5c5deeb6cfe1c944044367d2e7465fd8bd2f4acf
For more recent data, you'll want to use Valve's match data API, or find another datadump from a similar aggregator site.
You might want to check this thread. It used to have 175gigs of dota 2 matches, I hope it's still alive somewhere.
edit: found yasp's dump, YASP 3.5 Million Data Dump, hf!
because almost every research study cost money to access. Even the ones you agree with. Thats why they dont print them in the newspaper and write articles about them.
There is a torrent site for research papers. I just dont remember the name of it.
*edit here http://academictorrents.com/
1) The site http://academictorrents.com shows all the public torrents that can be downloaded. The tool will also download torrents that are marked private. You can browse the collections at this url http://academictorrents.com/collections.php
2) The wikipedia torrents are updated every few months. The collection I curate is here: http://academictorrents.com/collection/wikipedia
The 20130805 version of wikipedia is the most popular. I think this is because it has been mirrored in so many locations that are accessible to those who cannot reach wikipedia directly.
>With the constant decrease in funding across the board, do you think a national repo would be a realistic idea?
One thing I have never been able to predict is how terribly the next research funding effort will fail. But I would argue that a prototype already exists. Research data repositories exist at SLAC, Fermilab, Brookhaven, basically all of the national labs. That data is often further duplicated at universities all over the place. We might charitably describe this as the beginnings of a CDN. These data stores could be mirrored and served up publicly. Those would make good stable boring civil service jobs for some people, or nice little IT contracts for some big-gov IT consultancy, so, maybe?
Or it may just happen on its own if efforts like this one gain any momentum http://academictorrents.com/