Continuing last response:
i found this page to implement a rate limiter on a flask server using redis. To use it on falcon is basically the same principle.
http://flask.pocoo.org/snippets/70/
I still haven't tested it, but later today I can to some prototypes to see the results
Took a while, but I finally got this done: https://academictorrents.com/details/8269758bdeab03a311829e52744e30aaa318d3e0
I have an network storage drive, like this one. It holds 30 terabytes.
There are two general paradigms for making services that perform at low latency: vertical scaling, and horizontal scaling. Vertical scaling is when you use more powerful hardware, like upgrading your RAM or processor. Horizontal scaling is using more hardware. To achieve this, the processing of a request is often broken up into multiple jobs that are delegated to different machines or nodes to work on in parallel. Depending on the service, each machine might host a full copy of the dataset being queried, or it might only host a chunk of data. This chunk is usually called a "partition" and refers to some kind of sorting of the data.
Elastic search "shards" are nodes assigned different partitions of the data for parallelizing queries. I believe shards can have replications, i.e. the data partitioned to a particular shard isn't exclusive to that shard in case the shard is unavailable, which is why I was asking for clarification on interpreting the shard metadata.
Part of the Elasticsearch backend.
> An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone. > > To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.
https://www.elastic.co/guide/en/elasticsearch/reference/6.2/_basic_concepts.html
That's the torrent I came across, it only has data up until March 2017.
http://academictorrents.com/details/85a5bd50e4c365f8df70240ffd4ecc7dec59912b
​
I found this article about running PIP in a conda environment. But it looks like Conda manages its own package repository and doesn't include lots of smaller libraries like PSAW.
But PSAW is also a very thin wrapper. It's very easy to just replicate the functionality you want yourself. I have a python script here that downloads posts and comments from pushshift to a text file. You can just copy the relevant bits and download the data you want, no need for PSAW.