What are /r/datasets' favorite Products & Services?

Alright. Here's the source code:

const fs = require("fs"); const path = require("path");

const runDir = process.cwd(); console.log(Looking for fish files in ${runDir}...); const files = fs.readdirSync(runDir);

files.forEach(file => { const matches = file.match(/^(.+)fish_(.+).png$/); if (matches) { const prefix = matches[1]; const number = matches[2]; const folderPath = path.resolve(runDir, prefix); if (!fs.existsSync(folderPath)) { fs.mkdirSync(folderPath); } const filepath = path.resolve(runDir, file); const newFilepath = path.resolve(runDir, prefix, fish_${number}.png); fs.renameSync(filepath, newFilepath); } }); console.log("Files moved");

You can run it by installing NodeJS, copying it into a file called index.js (inside the directory you want to organize), and then running node index.js in that directory from command prompt. I'll PM you the exe link.

Note on running: The only thing I'm uncertain of here is which folder this will run in. If no files match the pattern [x]fish_[y].png, it won't do anything, so I don't think there's too much to worry about, it's just a matter of getting it to work. I'd try running the exe in the folder you want sorted. If that doesn't work, go into command prompt and run renamer-win.exe inside that directory.

https://repl.it/repls/FarawayGenerousCores

Used the link from the OP. Click the run button at the top, should show up on the right.

If you want it to be not indented, in line 14 remove the ', indent=4' part from the print line.

how are your programming skills?

a quick look notices that each email can be accessed by going to https://wikileaks.org/dnc-emails/emailid/X, where X is an integer between 1 and 19252 (or something close to that). if you know how to webscrape, that probably wouldn't be that hard to get to.

Hi, I believe all of these are s3 buckets so this will help

https://stackoverflow.com/questions/8659382/downloading-an-entire-s3-bucket#18762381

So for example this dataset :

https://aws.amazon.com/public-datasets/landsat/ has the s3 bucket

s3://landsat-pds/

so you would just download that bucket.

There's a discussion underway at YCombinator: https://news.ycombinator.com/item?id=7026960

Edit: a poster at YC said that the dropbox downloads failed (presumably something related to the sudden heavy demand), so we'll need to watch that page for updates on ongoing availability.

I especially found Common Crawl interesting - web crawl data from over 5 billion pages (541 TB) or all of Wikipedia (66 GB)

You normally have to license this dataset from the Post Office for an extortionate fee. Wikileaks leaked the database in 2009, though I obviously wouldn't recommend using it for non-personal projects.

https://wikileaks.org/wiki/UK_government_database_of_all_1,841,177_post_codes_together_with_precise_geographic_coordinates_and_other_information,_8_Jul_2009

You could also create this dataset yourself from any set of map data or GIS system.

OpenStreetMap would be a possible data source. They actually encode roads as "ways" which are a series of "nodes" (GPS coordinates) which is pretty close to what you want. Example: https://www.openstreetmap.org/way/226041025

While not an answer to your question, I suspect you'll find some level of relevant discussion over on the scrapy-users list. If nothing else, it might be worth your while to ask your question there, since that audience will be more scraping oriented than /r/datasets. Plus Scrapy is an amazing framework and I encourage every non-trivial scraper to learn it.

Alternatively, you could just ask the question (here or over there) that you would expect a hypothetical "/r/scraping" would discuss.

Open culture article on it http://www.openculture.com/2020/01/the-names-of-1-8-million-emancipated-slaves-are-now-searchable-in-the-worlds-largest-genealogical-database-helping-african-americans-find-lost-ancestors.html

Selecting a handful of users is a good idea, but you will only be able to get their last 3,200 tweets. Thus, if the users post tweets frequently, going back that far might be difficult.

Another alternative is to look at the data from the Internet Archive's Twitter Stream Grab. This seems to have the months you are looking for, and could be subsampled for tweets containing (e.g.) #ACTA.

Just to clarify, I did not collect this data. The credits go to Halle Tecco, who explains it in this Quora answer.

I realized only now that it's not downloadable. I will try requesting her to make it exportable, but I am not sure if she'd be willing to.

Just to clarify, I did not collect this data. The credits go to Halle Tecco, who explains it in this Quora answer.

I realized only now that it's not downloadable. I will try requesting her to make it exportable, but I am not sure if she'd be willing to.

h/t to the Data is Plural newsletter (run by a friend of mine):

http://tinyletter.com/data-is-plural/letters/data-is-plural-2016-08-10-edition

(the actual newsletter text contains links to the relevant stories; I'm too lazy to reformat it for the blockquote below)

> Pretrial inmates. Connecticut has begun publishing a daily census of every inmate held in jail while awaiting trial. Starting July 1, the database contains one row per inmate per day; each row includes basic demographic data (age, gender, race), as well as the inmate’s bond amount, main offense, and jail location. Read more at: The New Haven Independent and TrendCT. Question: This release seems unprecedented; does any other state or country publish such detailed data on pretrial inmates? [h/t Camille Seaberry]

I found a list at mmorpg.com. Claims to include all MMO's, unlikely but it's still a comprehensive list.

Includes name, release date and publisher fields. Couldn't find info on platform I'm afraid.

Here's a pastebin version, delimited.

Do you have mad tech chops?

Right off the back, I'm thinking of a web-crawler. Just write a spider in Python (Scrapy Framework, provide names of actors in an index array, iterate through the array and concat to wiki url, pass as a param/directive for your crawler, target the info you're looking at with a css selector, dump into a DB or csv.

Edit:
Alternative, slow solution: create an index of actors, loop through it together in a bash script and use wget to download the wiki pages. parse the raw text/html for a css selector or a key word, dump to an empty text file.

Thanks a lot :)

I ended up settling on http://www.omdbapi.com/ if you donate to them you can download a dataset, I sent $50 and then got a full movie dump (1.1 million movies) and also a separate file which looks like it contains TV-Show episodes

After a bit of googling around I also found https://musicbrainz.org/ which looks like it might do for the music ...

Next up is books and Games and Software .. wish me luck!

Whoops, discovered a mistake in my scraper that caused some records to be missed. Here's the new file, with all 2 million records: https://mega.nz/#!CQIyWYCB!u1_h2ej4meRtuMlFjE_O0QaCJW8M1Go0uXhQwTl259Y

I coincidentally have been looking into how to collect email data as well. I wanted to analyze my own personal emails, so not sure if you have a personal collection of spam emails, but it looks like this tool might help you generate the data with email address and sender name. Not 100% sure, but might be worth a look!

https://www.emailmeter.com/

Your going to want to use something like Ghost.py or Selenium so that you can execute the javascript as a browser would. You will also want to make sure that any bot you code up uses a popular browsers User Agent and accesses the page as though it were human – AKA, at a human-like pace with some random variability between each of your requests.

And remember, "don't be evil." The last thing you want to do is wake up one day to realize you're as morally vacuous as Zuckerberg himself.

IFTTT is the key to mobile personal datasets imo. Recently I downloaded some Fitbit 'recipes' to automatically throw my walking/sleeping data into a Google Drive spreadsheet. I just started it last week but I plan to check out what days I don't get enough sleep, trends in the exercise I get over time, etc. Additionally, they have 'recipes' to track the phone call durations for your phone for example. Being that i use Google Hangouts, it isn't so great for me, but kinda cool. In general on IFTTT, the Google Drive channel is linked to a lot of recipes that stick data into a spreadsheet that can be queried. https://ifttt.com/recipes/popular?channel=google_drive

Another personal dataset is when I was heavier into the stock market, I would upload my results into tradervue.com. Plenty of juicy details in there about how you react as a trader to market trends, what you have the most/least success with, and unconscious patterns to fix/repeat as a trader. I haven't been as heavy in the market but one day I will also upload a csv of trades to SQL on my own to play around with it. I have a regular trading account as well a self-directed Roth that I plan on analyzing.

You would be looking for an app on your phone that does time tracking. Make sure it exports to a CSV and you can then use it with whatever database you want.

This is one of many links: https://play.google.com/store/apps/details?id=com.aloggers.atimeloggerapp

Via opencilture http://www.openculture.com/2018/05/the-16000-artworks-the-nazis-censored-and-labeled-degenerate-art.html The public database is not complete yet http://emuseum.campus.fu-berlin.de/eMuseumPlus?service=ExternalInterface&lang=en and I am not sure if they will release a dataset. Hopefully they do.

If anyone wants to turn this into a dataset by ourselves message me?

This is far from optimized, but I built a scraper a while back that can be used to store data from both users and comment threads. It's not extremely stable for large, newer threads, but it works relatively well.

RedditTreeGrab

I was able to store 40 GB of data using it (again, not efficient), but you may want to occasionally dump data into a sql database instead, since the shelve module absolutely sucks for larger objects and objects that get overwritten frequently. The reason you may want to use this (or fork this) is that handling comment trees can be messy when Reddit decides to delete things and you get unexpected results using the API.

As for the nature of the analysis, I'd have to get behind that pesky paywall, but once you've collected all the data, you'd probably do want to tokenize the words (I did this manually in the code, but if I had to do it again, I would use the nltk (natural language toolkit) provided in Python.

However, the limiting factor is likely going to be data collection. It takes months to get an appreciable amount of data in tree format if you are looking far down the thread due to the 2 second delay per request (and even more if you are looking at a user's comment history, which is 2 seconds per 100 comments/submissions). I think you can ask Reddit for permission to decrease this limit, but you'd need to modify your version of PRAW to reflect this.

I don't know what your background is, but to get an idea of what can be done with this data, here is a writeup I did for a course project about 18 months ago: https://www.scribd.com/doc/250044180/Analyzing-Political-Discourse-on-Reddit

I didn't look at evolution as much as classifying users and doing some simple statistical tests/visualizations of the results. If you have any specific things you would want to analyze, I might be able to give you some ideas.

I've found this link

https://www.quora.com/Where-can-I-find-the-Netflix-Prize-dataset-as-of-June-2012 I think that you can pull it from there, if not I may have a copy.

MD5 SIGNATURES AND FILE SIZES

d2b86d3d9ba8b491d62a85c9cf6aea39 577547 movie_titles.txt

ed843ae92adbc70db64edbf825024514 10782692 probe.txt

88be8340ad7b3c31dfd7b6f87e7b9022 52452386 qualifying.txt

0e13d39f97b93e2534104afc3408c68c 567 rmse.pl

0098ee8997ffda361a59bc0dd1bdad8b 2081556480 training_set.tar

Not sure of any good tutorials- however PostGIS is very 'easy' if you're comfortable with SQL and pretty hard if you're not. As a starting point, assuming you're on a windows box, I'd do the following:

-Install virtualbox or vmware
-Install an ubuntu/other linux VM
-Install postgresql, postgis, pgsql2shp, ogr2ogr on the VM (use this guide for installation and basic tutorial for postgres, the rest should be simple 'sudo apt-get install ogr2ogr' etc) -Install pgadmin or navicat in windows, connect to your postgres server on VM
-Get some sample data in there: I'd suggest TIGER counties, which you can use shp2pgsql or ogr2ogr to import (from linux command line). That, along with point data like populated places and polygon data like national park boundaries (both available from USGS).

At that point, you'll be able to try out basic joins using postGIS functions to get a feel for how it works.
Part of my job is importing and cleaning GIS data, and I can tell you that the import/setup/scripting steps for this kind of stuff are much more time-consuming than the actual work. It took me a few weeks to get the SF1 census data in a database (with blockpoint centroids and census tracts, rollup tables, etc) but if you want to know how many 18-25 year old females there are within 3 miles of every mcdonald's in the US, I could tell you in about 5 minutes.

IFTTT has a trigger for photos in an area:

https://ifttt.com/instagram

which would correspond to the following api:

https://instagram.com/developer/realtime/

But from quickly looking through the api pages, it doesn't seem like you can scrape data for all the data coming in, what you can get is:

User subscriptions
Tag Subscriptions
Location Subscriptions
Geography Subscriptions

The Location/Geography Subscriptions only lets you look in a 5000 meter radius.

So for what you are doing I don't think you're going to be able to source this data yourself, maybe there is public data available, but I couldn't see any from a quick google search.

Good Luck.

If this than that just released a couple new apps that could do this. Check out Do Button: https://ifttt.com/products/do/button. You tell the app what to do when you hit the button, it can log the time/whatever in a Google doc. Just set up the recipe however you want.

Most of this stuff is protected behind APIs for money, but I've heard that NOAA has a pretty good API. http://www.programmableweb.com/api/noaa-national-weather-service-nws

What do you mean "affected" by storms? Just a list of zip codes involved in various storm events, like Hurricane Sandy or something?

That might require some manual work.

You may want to look into Dark Sky's API (https://darksky.net/dev/docs#api-request-types).

You can make requests into past weather data using it for any location (they use latitude and longitude, so you'd simply need a way to map zip codes to lat/lon) and for any time by using a timestamp (not sure how far back this goes, but since you need recent data I think you'll be fine).

The API is free up to 1000 calls per day, which probably won't be enough to scrape the amount of data you need, but since you're willing to pay, this may not be an issue for you.

Then again, you say you looked into a bunch of APIs, so maybe you've seen this and it doesn't work for you...

You could also look at scraping weather underground. https://www.wunderground.com/global/ZA.html

For the regions you want you may need to combine multiple scrapes but it's fairly easy with import.io

How long a timescale are you looking for?

It's neat, but I can't really see myself paying for a service like this, considering how dead simple Python is and how libraries like Scrapy basically do this for you already.

At that point the only other thing you're getting here is data/server space, which if you're going to pay for just get a server, which you can use for things like this in addition to anything else you ever wanted to run.

The only use I can see for this service is companies/etc who want to outsource all their data scraping/etc, but again any competent Python dev already working for said company can just read up on Scrapy and get this done with relative ease.

(And if your company doesn't have someone who knows Python or can at least learn enough to use Scrapy, you've gotta reevaluate your hiring practices, everyone and their mom can use or learn Python these days.)

http://www.statmt.org/moses/?n=Moses.Releases

Moses releases some pretrained models. It looks like it's trained on Europarl, which is a data set of all the European Parliament proceedings translated into multiple languages. It's a pretty standard data set in academic MT.

Unfortunately it is somewhat small and might not contain coverage of the words you need. Generally though, you can expect Moses will perform near state-of-the-art amount non-proprietary algorithms. (Google, Bing, etc greatly benefit from having many orders of magnitude more training data in many more languages.)

But it will give you out of the box full MT system.

If you have a lot of money to burn, Mathematica or Matlab. If you don't have money to burn SciPy. Though with any graphing tool that is based on a computer language (e.g. Octave's with GNUPlot) you could hack an animation together with a for loop outputting images with an incrementing variable. Then you find software that will put those images together into a movie (which any of the open source movie editing software should be able to do. )

http://www.scipy.org/Cookbook/Matplotlib/Animations

It would probably be better to ask this in a postgres-specifc sub, but...

>PgAdmin keeps asking postgres user password when I'm trying to connect to the virtual server, which doesn't make sense to me.

Why doesn't it make sense to you? You need to provide login credentials. The server setup directions include a section on accessing the postgres server by first. They assume that you are using a shell/console on the virtual machine.

They also mention that you'll have to change the server configuration if you want remote access (ie access from outside the virtual machine, like pgAdmin on your Mac). They provide a link to a page that explains it in more detail.

>Is there a way to import the dumps files directly in pgAdmin without mounting a virtual server ?

pgAdmin is just a client. It can't do much without a Postgres server to connect to. There are various options for running Postgres directly under MacOS/OSX. Once you have it running, you should be able to use pgAdmin to load dump files into it.

Postgres.app is one of the easier ways to get Postgres running on a Mac

Ask, and ye shall receive.

Here's the dataset: ChestX-ray14 About 40+GB unzipped so be aware.

Make sure you have a radiologist friend to help you and be particularly careful about diagnoses of pneumonia. see: chexnet-a-brief-evaluation written by me and links to many other concerned and informed deep learning practitioners.

I downloaded the dataset from here

Not a lot of seeds, and I'm not seeding the complete dataset, but I'm seeding the 2010 subset.

post_id=url[-url[::-1].find('/'):]

Using a regular expression to find the id might be more stable. Being a regex noob myself I always go back to regexr to build the expression.

if not os.path.isdir(data_folder): os.mkdir(data_folder) # is same as os.makedirs(data_folder, exist_ok=True) # but well..

Instead of using a "dones.txt" I would pickle dump a set (rather than a list). The lookup time in a set is insanely faster and with pickle you don't need to parse anything.

The "extract_post" function is too long. Either split it up or give it some headline comments on what will happen in the next 10ish lines.

Avoid magic variables

[('User-agent', 'Mozilla/5.0')] # line 26 and 180

Also all kinds of numbers and filenames. Declare them at the top, or in a config.py.

I had a little time and wrote a scraper in python to get the fan data. If you can manage to install python (https://www.anaconda.com/download/), you'll just need to download this .py script, open it to change the output directory, and run it (at the command line, type python 'the path to the script.py').

With ~13k pages it will take about 8 hours to run!

https://gist.github.com/seanchrismurphy/b5c2904bcea1efd2c228d2657ca38326

Hey data_junkie, thanks for the dataset, I just used this data to build this dashboard https://my.infocaptor.com/dash/mt.php?pa=inflation_50da569f84101

and here is the tutorial

http://www.infocaptor.com/dashboard/consumer-price-index-charts-and-dashboard

What is going on, on Wikidata, I can't figure it out, what is an item?

It looks like an item has languages and statements, but whats the purpose?

How would one use this?

edit: I found and read this, but hmm, still not sure, don't see myself using this. Any other thoughts?

Thanks for this! I just came up with an Android app which runs on this dataset, and along with that, has a feedback system (users can verify new pictures) This can specifically help in expanding the dataset.

How can I contribute to the dataset?

Check out the app : https://play.google.com/store/apps/details?id=com.achandratre.doglens

Sure thing. Glad I could help :)

I have found that KhanAcademy is great place to start as it as a lot of great material with regards probability theory (statistics and maths too). The teacher explains very well and the videos are of decent quality. In addition, there are exercises available in order to practice the skills learned. Here is the link to the probability theory section.

Looks cool u/Karlpy though you should probably build in a 10 sec sleep between scrapes per the site's robots.txt file ... https://www.wunderground.com/robots.txt

If you’re interested in local data check out your government’s municipal or other institutions responsible for cartographic information.

Other than that check out these data sources:

World Geo Json : JSON Format

OpenStreerMap Export : shp format

Good luck! 💪🏼

Is anyone able to shed some light on this for me?

https://www.spotify.com/us/legal/California-privacy-disclosure/

I am not a California resident so I assume I cannot request this additional information from them. Has anyone here submitted any right to know requests with them? Do they simply provide you with a better dictionary for the data to explain each column? Is this available anywhere to download (Not specific to one persons data, just the structure in general)

Thanks. I would love to get access to this additional detail for my own results but if its just a cut and paste response then I would love to be pointed to it online.

You also may want to look at something like Amazon's elevation tiles: https://aws.amazon.com/public-datasets/terrain/

It's a composite collection of the highest resolution open data for elevation (also bathymetry for underwater height data), available in tiled formats that correspond to WGS84 Z/X/Y map tiles (so you can get it at the resolution that best corresponds to your application).

It's the same data that's in Mapzen's Elevation API https://mapzen.com/documentation/elevation/elevation-service, but available in png, geotiff, and hgt.

Thanks for the link, I'm looking into it now.

Edit: Ahh, was finally able to find an answer as to whether Common Crawl has archived websites dating as far back as 2000. They do not :(

Here's an excerpt:

>According to Stephen Merity , there are six crawls:

>[ARC] Archived Crawl #1 - s3://aws-publicdatasets/common-crawl/crawl-001/ - crawl data from 2008/2010

>[ARC] Archived Crawl #2 - s3://aws-publicdatasets/common-crawl/crawl-002/ - crawl data from 2009/2010

> [ARC] Archived Crawl #3 - s3://aws-publicdatasets/common-crawl/parse-output/ - crawl data from 2012 > [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/

> [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48/

> [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-10/

Don't know if you have seen this already: (Indeed's Cookies, Privacy and Terms of Service)

>15. Miscellaneous

>You understand and acknowledge that Indeed or its affiliates, or its or their licensors, owns all right title and interest to the Site and all proprietary rights associated therewith. Indeed reserves all rights not specifically granted herein. You shall not modify any copyright notices, proprietary legends, any trademark and service mark attributions, any patent markings, or other indicia of ownership on the materials accessed through the Site, other than your User Content. Any use of materials or descriptions, any derivative use of the Site or its materials, and any use of data mining, robots, or similar data gathering and extraction tools is strictly prohibited. In no event may you frame any portion of the Site or any materials contained therein.

I think you are going to need approval if you what to scrape their site legitimately.

I tried multiple EC2 instance sizes.

The last one I tried was the m4.16xlarge model w/ 64 vCPU & 256 GiB of memory.

https://aws.amazon.com/ec2/instance-types/

Im a noob at this so I was thinking maybe I was misreading the memory or maybe there was a better way to get the comments into s3.

Great points. I would not use ckan to store such large datasets. I don't even think it supports large amounts of data.

Do look at the AWS Data Lake implementation. It is really inexpensive to get up and running and is also serverless. They expose API to which you pass the data and metadata. It automatically indexes the metadata in ElasticSearch. And you place the data in S3. And if you want a cluster, spin up an EMR cluster with EMRFS directly pointing to the data. Or use Athena/Presto to directly query the data. Again serverless.

Then yes something like ckan is a bit redundant unless you want to use its UI for people to browse a categorized catalog of data.

https://aws.amazon.com/blogs/big-data/introducing-the-data-lake-solution-on-aws/

I guess I should have clarified. When I said test a "mainframe application" I meant that I was building a chunk of one, not that I am production testing one for a business. While the inclusion of the full-end to end ERP would be great I really just want to build out a mainframe app which could do some light MRP functions with a .net front end and have numbers that look plausible.

I used to do supply chain so I have a bag of tricks that I have implemented in higher level programming languages that I would like to cut over to Cobol while I learn the language. I totally understand where you are coming from with your comments. The below link seemed to give me the most interesting first task which would be schedule building.

This link had the first component that I needed here: http://download.cnet.com/MRP_Excel-zip/3000-2067_4-10057875.html

Awesome, thank you. I also found this https://www.xe.com/iso4217.php, it explicitly states what I'm looking for which is what I'll end up using. But I couldn't have found it without you!:)

What about the various lists employed by ad blockers such as uBlock Origin? The lists that I use personally include nearly 150k adware, malware, and tracking rules; extracting a list of sites/domains from those lists would be incredibly easy. Another good source for such lists is filterlists.com.

Founder of Vectorspace AI here. You might want to take a look at free versions of NLP correlation matrix datasets we have available for the financial markets. We describe what can be done with them here: Generating and visualizing alpha with Vectorspace AI datasets and Canvas https://www.elastic.co/blog/generating-and-visualizing-alpha-with-vectorspace-ai-datasets-and-canvas

Our subreddit: r/VectorspaceAI

Hello, Guys, I tried Infograpia and is an extremely useful tool for quickly building stunning presentations. You can read reviews here:

https://www.trustpilot.com/review/infograpia.com

Hmmm... yeah, I don't know of a good dataset that'll have that. You could maybe get some insight from templates that are out there? I think most Apartments only provide the lease when they're ready to get a signature, most don't post that info anywhere.

Common Crawl which can be accessed for free on AWS at s3://commoncrawl has several million PDF files indexed from websites they crawl.

https://commoncrawl.org/2013/08/a-look-inside-common-crawls-210tb-2012-web-corpus/

NOAA has really long series of climate data https://www.ncdc.noaa.gov/cdo-web/datasets, though not sure how good is the coverage for stations outside of US. The global hourly data under the legacy section might be close to what you are looking for.

Or you could try weather app API, like Dark Sky

Detailed tracking and info for tropical storms and hurricanes in the North Atlantic since 1851. Original source is from Weather Underground https://www.wunderground.com/hurricane/hurrarchive.asp?region=at

Bing offers some bird's eye / oblique imagery, but I'm not familiar enough with it to know if there's time-based access... https://www.bing.com/maps?osid=1d174188-928d-40a7-ac72-8d52b04cb4a4&cp=40.769182~-73.978031&lvl=20&style=g&imgid=bc7f6870-f6c4-491f-a2df-84fdc7738f5a&v=2&sV=2&form=S00027

>Good luck, the reviews are the reason people go on those sites so they fight against scraping tooth and nail, you probably will have to adapt your script every couple weeks.

I don't think so, I developed scraper for tripadvisor, hotels.com and google business which has been working fine for the last 3 months with a minor change I did last week.

And yes I used Scrapy for that.

This repo has barcodes and qr codes in various distortions to test detection.

edit: more specifically look in false positives and upce folders for distorted barcodes.

They probably just downloaded a bunch of pictures from here: https://thispersondoesnotexist.com/

Some of the pictures in that dataset does not look right. When I first seen it I hoped that some quality check was done on it but probably nothing :/

you can find subtitles w/ character names here

you need to build a scraper to get the SRT files but from a quick inspection in textEdit it looks like they contain the character names. i might scrape it myself later but the website is pretty slow.

hope that helps

https://www.opensubtitles.org/en/search2/sublanguageid-all/searchonlytvseries-on/moviename-keeping+up+with+the+kardashians
here you can download .srt files, you only have to discard the timestamps

Maybe email the beets.io mailing list or something, ask for the sqlite3 `musiclibrary.db` files from people who are willing to share. I have a 23k track metadata library I could probably part with if you're willing to give info about your intended use case!

Depending on the project you're trying to do, you could also use an agent-based modeling approach to generate synthetic data- something like the NetLogo traffic grid

https://musicbrainz.org/ is top notch. I consulted for ClearChannel for a few years, worked with the guy who Paul Allen uses for his in-flight music libraries (boat/plane etc), the dataset run by Aldous Huxley's I believe grandon . called Muze .. and cddb. I can say for sure as a guy who dealt with music metadata for ages - musicbrainz is topnotch, and last.fm folsonomy plugin will give you all the tags you could ever want. Written in python but has a web api.

I'm a developer with gephi and they are sending me a t-shirt since I am a plugin developer. (This is a change from before when I was just employed by my university) See http://gephi.org/about/people/ for my picture. If you have any questions about the software please ask me!

/u/vocabularian has already mentioned the data at OPISnet.com. I shall mention two more, but suspect that these data sets would also cost you considerable money. The first is GasBuddy.com, and the second is the AAA data set:

1) http://gasprices.aaa.com/

2) http://www.gasbuddy.com/Charts

Both AAA and GasBuddy have years worth of data.

While I haven't used this service before, I found it just now while searching for historical pricing datasets.

Keepa

If you're interested in Amazon pricing records, these guys seem like they have it all. I still made an account or looked for a reasonable way to request data from the site. Good luck!

If you have access to a terminal/grep this is a quick option.

$ grep “{search_word}” {file_name} > result.txt

Where: {search_word} is the “certain word” that you want to find, {file_name} is the 40 gig file that will be searched, and result.txt will have the printed lines that contain your {search_word}.

Look into using flags for a more refined search. For example the following line will perform a case insensitive search:

$ grep -i “{search_word}” {file_name} > result.txt

If you want to improve performance of your search look into installing silver searcher ( https://github.com/ggreer/the_silver_searcher/).

This one could be interesting for you: https://algorithmia.com/algorithms/algorithmiahq/DeepFashion

Event though it's the same name, this dataset is different from the publicly availaible one. And though it's not open source, it can be queried via their API. And has object detection. :)

Would analyzing the the https://libraries.io/data dataset help?

"Libraries.io gathers data from 36 package managers and 3 source code repositories. We track over 2.7m unique open source packages, 33m repositories and 235m interdependencies between them. This gives Libraries.io a unique understanding of open source software"

You get free credits on opening a new GCP account so you could use that for Google text to speech API (it's pretty good). Alternatively use open source python ASR libraries like:

CMU Sphinx (https://github.com/Uberi/speech_recognition/blob/master/examples/audio_transcribe.py This is Speech recognition 3.8 library available on pip)
Kaldi ASR

Or maybe use pretrained deep learning ASR models (this one is easy to use if you are familiar with pytorch) - https://pytorch.org/hub/snakers4_silero-models_stt/

I don't have any experience with ASR so I think you should ask on Stack Overflow for resources

I would use puppeteer to scrape the site as it appears the data you want is dynamically loaded. If you have any specific question about how you might use puppeteer, feel free to ask, though I am not an expert.

Depending on the volume of your data i would suggest Open refine http://openrefine.org/ or if you have tons of data Pentaho may be a more robust solution, if you have doubts send me a message about either or I have worked extensively with both.

Haven't got the actual dataset, but it's pretty easy to make one.

Select a list from here: http://skyscrapercenter.com/buildings
Then use https://import.io/ to scrape the table.
Download CSV of the data from Import.io.

I built a tool that does something similar: http://seo-analyser.import.io/ - Just stick all the URLs in there, and download as a CSV. You will get all the outbound links, title tags, headings etc...

If you wanted more specific data from a larger number of sites (and want it up to date); you should use https://import.io, for free, to collect all the data you want, and put it into a CSV or a Live Google sheet. It's stupidly easy and takes about <5 mins to set up a full crawl per site. You could easily do 50 websites, with all teh data you want in a an evening.

Another 'paid for' option is to look a SEO focused web index tools like Majestic SEO. I don't know if they include outlinks and what not, and you wont have any control over the data you get from the site (eg. if you wanted to get the page text you would struggle) but its worth a look.

I work at import and use it quite a bit, so i'd be happy to help. Just drop me a msg.

Good luck :)

Since I'm already using python this sounds like a good route to take.

Installing numpy, scipy and matplotlib only good a couple of minutes. Then another minute and pasting this anim.py example into a file and running it - and I had a nice little gtk animation.

Any recommendations on an effective way to capture this video and save it as a movie on linux?

And here's a bit more info on creating a movie from a sequence of images. And Matplotlib's FAQ on making a movie that describes how to convert a number of images into a movie.

It has moved to https://sponsor.ajay.app/database, the sqlite db is an older copy kept there for backwards compatibility.

By the way, I only started collecting video duration recently, so not many videos have it (you'll see many are zero)

Mapping in the sense of geographic maps? I think they all use qgis https://twitter.com/tjukanov is good to follow on making maps

I know R package a bit and the related ggplot2 end of making maps, visualisations and mathematical art

The Javascript stuff (or maybe Processing) might be better on a raspberry pi as it is kind of naturally web based and moving. https://d3js.org/

If you do want to model something like this, your best bet is probably an OLS (ordinary least squares) linear regression model. You use these models when your dependent variable (the one that you want to explain) is a continuous number (as opposed to a category variable). These are extremely simple and common models that allow you to describe your dependent variable as a combination of other variables (predictors). For instance, you might think that the amount of money a customer spends is a function like this:

$spent = intercept + a*age + b*sex + c*income + error

If you find a statistical software package (let me recommend R), you can run a linear model easily. By plugging in all of your values for $spent, age, sex, and income, the model will spit out the values for intercept, a, b, c, and error. If, for example, the value of a is 2, then for every one unit increase in age (which would probably be a year), your customer is expected to spend 2 extra dollars (assuming that $spent is measured in dollars).

So, if you do make this model and only use 90% of your data, you will know the formula that predicts how much money people will spend at the store. Then, you can take the values of age, sex, and income for the remaining 10% of people and plug them into that formula (forget about the error for now). The output is the predicted amount of money they will spend. You can then compare this to the true values to see how close you were.

The problem with this approach is that they generally rate limit or restrict the origin of requests to APIs to prevent mass data scraping like this. I took a look at Western Union's site and they refer to an API on their own domain at this URL: https://location.westernunion.com/api/locations?country=US&q=08889

The parameters being country for country of course and q being zip code. I emulated a GET request using a REST client, Insomnia, and got an error. This is without messing around with the request headers though. It's not something that'd be simple to do if you're not familiar with it unfortunately.

The example you posted wasn't so much that the guy found a raw data file, but that he found an unsecured API that just lets you make requests from anywhere. So he was able to scrape all the data into a nice file for that post.

I mean, you can probably pay for something like this if you want to.

Otherwise, it might take some doing. I might look at the Cook Index, which tracks political lean of all districts in the country. You can find 2017s here: https://www.docdroid.net/4vS5iWM/arranged-by-state-district-1.pdf#page=2 or you can google “cook index.”

Then you would just need a list of congressional districts and what zip codes they contain. I’m sure that’s easy to find (check the census website, failing that someone probably has this on google).

That gives you national politicians but I’m not sure of a good way to get it to be more granular than that without paying for it.

Just enter random values. The phone number is optional anyway. For the email, use getnada.com Basically any [email protected]. Open the link and find your way to the required email inbox.

Power Bi does this with ease, and it’s free. The only catch is that you can’t publish the reports anywhere without paying, but you CAN export to PDF or share the pbix file for anyone else with the desktop app to see.

Use the “Bing maps” visual.

https://powerbi.microsoft.com/en-us/

https://powerbi.microsoft.com/en-us/desktop/

Best part is it’s free for the desktop app and takes in a ton of different connectors (including excel and CSV).

My favorite report to show people is the Visual Vocabulary, which i use for reference all the time.

https://community.powerbi.com/t5/Data-Stories-Gallery/FT-Visual-Vocabulary-Power-BI-Edition/td-p/584460

But the showcases on the PBI website are great.

https://powerbi.microsoft.com/en-us/partner-showcase/

They’re awesome for presenting PowerBI demos. I recently did one on the FEMA payouts from ‘89 to ‘15. Great stuff when you can use them for visuals.

https://powerbi.microsoft.com/en-us/downloads/

Tableau is publishing a good quality set here: https://www.tableau.com/covid-19-coronavirus-data-resources

It is aggregated on location but looks pretty clean.

Dataclysm by Christian Rudder (OkCupid cofounder)

Presents insights from OkCupid user data and makes broader assertions about western culture. Can be a bit handwavy at times, but I found the book interesting. Contains a lot of good data visualizations if you’re looking for that.

I am not sure if you are looking for a static map image or the actual data to make your own. A google Search for LIDAR data or DEM images will get you a lot of results for both data and the Google Images section as well. You might also try something like ESRI Online maps. Amazon has this as well https://www.amazon.com/YellowMaps-Death-Valley-topo-map/dp/B07L2JJG8M

Please provide more details?

DataCluster is a Data Collection startup based out in India. We help researchers and companies collect large and diversified image/video datasets using our managed crowdsource platform DailyData.

We also help in new annotations and re-annotation of existing datasets.

Please let us know details of your requirements at

Regards Team DataCluster www.datacluster.in https://play.google.com/store/apps/details?id=com.daily.data

We can help you with data collection and annotation services.

DataCluster is a Data Collection startup based out in India. We help researchers and companies collect large and diversified image/video datasets using our managed crowdsource platform DailyData.

We also help in new annotations and re-annotation of existing datasets.

Please let us know details of your requirements at

Regards Team DataCluster www.datacluster.in https://play.google.com/store/apps/details?id=com.daily.data

The data should be coded by both DRGs (inpatient) and HCPCS (outpatient). DRG coded data has been previously released by CMS, so the HCPCS data is the big deal.

Both of those codesets come out of the box with hierarchies that are useful for classes of analysis. However, the core of HCPCS is the proprietary CPT codeset, which can make analysis a bother.

If your eyes are not bleeding yet, I will again shamelessly plug the book I wrote about this stuff Hacking Healthcare

-FT

This is really cool.

I'm wondering why even the most mentioned thing only has 74 mentions. I know you only scrapped comments from 2015 - 2017, but that still seems so low! Does your script only count it as a mention if the link is written out? Or does it capture hyperlinks like this as well?

O.

I found this data after getting interested reading this book (which is great so far)

https://www.amazon.com/Farewell-Alms-Economic-History-Princeton/dp/0691141282/ref=sr_1_1?ie=UTF8&qid=1486831380&sr=8-1&keywords=farewell+to+alms

If your local library has a copy it might be worth checking out

What are /r/datasets' favorite Products & Services?
From 3.5 billion Reddit comments

The most popular Products mentioned in /r/datasets:

Etymotic Research ER20 High-Fidelity Earplugs (Concerts, Musicians, Airplanes, Motorcycles, Sensitivity and Universal Hearing Protection) - Standard, Clear Stem w/ Blue Tip

Dataclysm: Love, Sex, Race, and Identity--What Our Online Lives Tell Us about Our Offline Selves

A Farewell to Alms: A Brief Economic History of the World (The Princeton Economic History of the Western World, 27)

Hacking Healthcare: A Guide to Standards, Workflows, and Meaningful Use

YellowMaps Death Valley CA topo map, 1:250000 Scale, 1 X 2 Degree, Historical, 1954, Updated 1966, 22 x 32.1 in

Magahi Folklore and Folk Tales

Hacking Healthcare: A Guide to Standards, Workflows, and Meaningful Use

The most popular Services mentioned in /r/datasets:

Academic Torrents

Amazon Web Services

Weather Underground

MusicBrainz

OpenStreetMap

Dark Sky

Scrapy

Hacker News

Internet Archive

Common Crawl

InfoCaptor Dashboard

IFTTT

Quora

Project Gutenberg

Back4App

The most popular Android Apps mentioned in /r/datasets:

Texty Time - SMS Statistics

aTimeLogger - Time Tracker

The most popular reviews in /r/datasets:

What are /r/datasets' favorite Products & Services? From 3.5 billion Reddit comments

The most popular Products mentioned in /r/datasets:

Etymotic Research ER20 High-Fidelity Earplugs (Concerts, Musicians, Airplanes, Motorcycles, Sensitivity and Universal Hearing Protection) - Standard, Clear Stem w/ Blue Tip

Dataclysm: Love, Sex, Race, and Identity--What Our Online Lives Tell Us about Our Offline Selves

A Farewell to Alms: A Brief Economic History of the World (The Princeton Economic History of the Western World, 27)

Hacking Healthcare: A Guide to Standards, Workflows, and Meaningful Use

YellowMaps Death Valley CA topo map, 1:250000 Scale, 1 X 2 Degree, Historical, 1954, Updated 1966, 22 x 32.1 in

Magahi Folklore and Folk Tales

Hacking Healthcare: A Guide to Standards, Workflows, and Meaningful Use

The most popular Services mentioned in /r/datasets:

Academic Torrents

Amazon Web Services

Weather Underground

MusicBrainz

OpenStreetMap

Dark Sky

Scrapy

Hacker News

Internet Archive

Common Crawl

InfoCaptor Dashboard

IFTTT

Quora

Project Gutenberg

Back4App

The most popular Android Apps mentioned in /r/datasets:

Texty Time - SMS Statistics

aTimeLogger - Time Tracker

The most popular reviews in /r/datasets:

What are /r/datasets' favorite Products & Services?
From 3.5 billion Reddit comments