What is Reddit's opinion of Apache Cassandra?

Hi - Jeff from the Apache Cassandra PMC. Apache Cassandra is a NoSQL, Big-data database, which seems to hit a handful of your tech/interests.

We're always looking for people to work on new stuff. It can be very, very basic, or more involved. If you've never used cassandra, check out some of our low-hanging-fruit JIRA tickets.

Low hanging JIRAs are here: https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+=+12310865+AND+labels+=+lhf+AND+status+!=+resolved

Instructions on how to contribute are here: http://cassandra.apache.org/doc/latest/development/patches.html

If you're wondering WHY you should work on cassandra, consider this: it's used by companies like Facebook, Apple, Netflix, Walmart, Uber, and Microsoft.

If you need help, #cassandra-dev on freenode is typically responsive during PST business hours. The dev mailing list is active, and you can email me directly (my reddit username @ gmail, or @ apache, or whatever you prefer). I'll help review+commit your contributions, just nudge me to get my attention.

This is the review checklist we point new contributors at for Apache Cassandra: http://cassandra.apache.org/doc/latest/development/how_to_review.html

The basics are there and apply to most software -

Are you conforming to style?
Are you using libraries properly?
Are you handling errors? Are you catching errors from third party libraries properly? Are you checking for null inputs? Do exceptions propagate?
Did you document? Are there edge cases you've described? Do you have //TODO hanging around?
Did you write tests? Is the code testable? Are the tests actually testing what you think? Are they passing?
Do you have appropriate logging?

No, clustering is not really something built into an operating system (or at least not into Windows/Linux), it's up to the database/application itself to do it. I'm pretty sure SQL Server express doesn't support it, nor do most SQL-based databases like MySQL or Postgres.

A good example of a clustered database is Cassandra, originally built at Facebook. It is designed from the ground up with clustering in mind so it works quite differently from any SQL-like database. You could definitely setup and run that on your two PC's, though it doesn't really have any advantage until you're running it on dozens or hundreds of machines.

Cassandra is absolutely overkill for any small or medium application. It's only when you're running a site with millions of users, many Terabytes of data, and need to do hundreds of thousands of queries per second does this stuff become necessary.

Game servers too. As others have said, anything being ran on a server has a good chance at being in Java.

Data processing jobs, hadoop.

Graph databases, neo4j.

One of Facebook's databases, cassandra

Maybe relevant: https://github.com/reddit/reddit/blob/master/r2/r2/models/vote.py

Looks like they use Cassandra(http://cassandra.apache.org/) to store the comment votes as a Relation(Account,Comment), where the vote is either -1, 0 or 1.

Don't focus on how to interview, actually spend your time learning something. If you're good enough at something, the interview becomes a formality. My advice: Pick an OSS project that interests you, and work on it.

You'll gain real world experience, and more importantly, you can become an expert in software real companies use.

Particularly partial to Apache Cassandra - the docs for how to contribute are here: http://cassandra.apache.org/doc/latest/development/patches.html

(If you pick a JIRA you're interested in working on, PM me, I'll talk you through it, and review/commit it when it's ready).

I'll be up front, this isn't a great use case for Cassandra. What you're asking for is essentially a sorted set, something Redis is great at, since it stores 2 structures, one for the set look ups and one for the sorting. I've had this exact problem in the past and I used Redis for the sorted sets.

That said, if you feel like jumping into dark territory, you could try it like this...

CREATE TABLE inbox ( user int, last_contact timestamp, contact_id int, primary key (user, last_contact) ) WITH clustering order by (last_contact desc)
AND compression = {'sstable_compression': 'LZ4Compressor', 'chunk_length_kb': '4'} AND compaction = {'class': 'LeveledCompactionStrategy'} ;

When a user gets sent a message, you'll have to delete the old last_contact record and insert a new one. I'm not wild about this because a high churn on messages will generate a lot of tombstones, but since you're dealing with people you might only see a few hundred of these per week.

If you do hit a high tombstone count, my advice is to use LCS and run daily subrange repairs on this table using reaper: http://cassandra.apache.org/ which we (The Last Pickle) maintain and is open source. Once you've got your repairs running regularly you can drop your gc grace seconds down to a number close to your repair schedule, and let the tombstones drop out at a faster rate than they do by default.

I think you'll also probably need a per-user lookup table to identify all the messages from a user:

create table inbox_by_user ( user int, contact int, message_id id, // other necessary message details here primary key ((user, contact), message_id) );

Whenever you want to lookup all the messages in the inbox table from a specific user, you can consult inbox_by_user. It also gives you a per-user history, which might be helpful.

cqlengine was mainly built to expose CQL, not be a replacement for the thrift interface, or as an alternative to something like mongo. It is expected that someone using this would understand the underlying data model of cassandra/CQL.

Check out the section on partition keys and clustering keys here: http://cassandra.apache.org/doc/cql3/CQL.html

You have the ability to define clustering keys alongside your primary key in cqlengine.

This week is pretty slow, but going to start learning Cassandra when I have some downtime. Starting an in-house course on it that'll take up most of next week and databases are a weak spot for me.

http://cassandra.apache.org/doc/latest/development/patches.html

> Not sure what to work? Just pick an issue tagged with the low hanging fruit label in JIRA , which we use to flag issues that could turn out to be good starter tasks for beginners.

And

> And since it's not on github contributing to it doesn't have nearly as much value as other projects

Insert eye-roll here. Your measure of value is in ability to interact on a specific website. Every contributor uses github, the only difference is we don't use PRs and GH Issues, because. The contributions end up there, properly attributed, searchable. The commit itself isn't pushing a merge button, other than that it's github based.

Working with Apache Flink and Cassandra, using PhantomDSL. Current Flink connectors are really designed for java (really a pain in a pure-scala project).

If you are interested in Big Data / Data Analysis and want a challenge and something to put on your resume. You can look into learning Java and some of the more popular open source big data platforms like HBase, Cassandra, and Spark. These will be pretty difficult to learn without a lot of previous development experience, but if you do understand them and can develop applications that utilize them, it will open a lot of doors for you.

Might I suggest Cassandra DB. It scales horizontally by making certain nodes in the cluster authoritative for certain chunks of data making it very fast for reads and writes. Also provides fault tolerance, intra and inter-datacenter replication, and its even rack-aware. I think this will get you past the performance issues.

Actually yes. Facebook is a pretty large supported of Open Source Software projects and has released a lot of their core infrastructure as OSS. Such as Apache Cassandra, Apache Hive, and Apache Thrift.

Along with running one of the largest Linux deployments on the planet.

Feel free to not use or like Facebook's service (I don't use it either), but "tragic affair in the Linux community"? Please.

You're not wrong. There can be some tough decisions. But I think you're not quite appreciating the purpose of the itemization system as it is designed. Or why the inventory space you have is limited. It's not because a few more rows in the database are prohibitively expensive. All of your inventory on all three character slots will not amount to one megabyte of stored data in a Cassandra cluster, or similar denormalized database.

Rather, the limitations are there specifically to extend the lifetime of the game for its most ardent fans. This is accomplished by generating a large pool of potential affixes, making the 'God Roll' diminishingly unlikely. Giving players a way to hoard the most useful affixes runs contrary to that goal.t

I'd, personally, seriously consider Cassandra

File size is not overly important in Cassandra, but you are going to need a decent number of nodes, and think long-and-hard about where single-points-of-failure exist.

How are you going to use your data? Do you really need a distributed NoSQL database for your purposes, or will a locally hosted RMDBS suffice?

I'm pretty new to the db as well, but right now I'm going through the Cassandra getting started documentation: http://cassandra.apache.org/doc/latest/getting_started/index.html

Here's a tutorial for Cassandra's Python driver: https://datastax.github.io/python-driver/

And a DataStax Academy tutorial for setting it up: https://academy.datastax.com/resources/getting-started-apache-cassandra-and-python-part-i You need an account for this though.

> Simply untrue. I'm a committer for Apache Cassandra, and we have a ton of low-hanging fruit that any entry level Java programmer should be able to do.

Cool! Now can you show me where I can find what issues I could work on? The main project isn't on github (the repo is a clone) and that repo isn't very inviting to beginners at all: https://github.com/apache/cassandra

Your community page has the exact same issue. No help at all for beginners. No where to just check which issues are open and suitable. Your backlog has a pile of 1500 open issues.

You disagree with me on your experience as a Cassandra committer. I was not talking about your project specifically but about projects in general that are on GitHub. Your project isn't. And since it's not on github contributing to it doesn't have nearly as much value as other projects. Those other projects will in general have the easy stuff cherry picked already.

> This just isn't true at any job I've ever worked at. Most employers already have projects in flight.

You're misunderstanding. I meant that you can build new features from scratch. Go from user requirements to technical design and implementation. That's something you can easily demonstrate by building stuff yourself.

I'm not saying that contributing to an OS project like Cassandra is useless (far from; I use it daily in my job and it's a popular skill with recruiters). I'm saying that contributing to an OS project on github is not as easy as people make it sound and also doesn't give huge benefits that easily: no one is going to be impressed with you just fixing a few simple bugs and typo's for example.

> To interface Java with C it uses JNI. JNI code comes with two complementing parts - Java and native C code

Aww, such a joykiller :(

Anyway, could you guys highlight the database features a bit more? Like, how does it compare to Cassandra (as in, the column database) or HSQLDB (as in, the pure java sql db)?

Would a high throughput NoSQL solution fit your needs -- such as Cassandra or Aerospike? Aerospike has a hybrid mode where indexes are stored in memory but data on disk.

So, rkt is a container technology that is being developed by the makers of CoreOS (I guess they didn't like some of the directions that Docker was going, and I can't say I disagree), but it is relatively new compared to Docker AND they seem to be doing a lot of collaboration lately with with each other to develop the Open Container Project (OCP). Containers are a fairly new technology in the Open-Source world, so there are lots of constant changes and growth. :-)

As far as distributed storage; currently I have 3 Cassandra nodes running in my cluster. I am using DataStax Cassandra, for that is what I currently am employed working on. I have not done a lot with Cassandra on my home cluster, other than get it working, for I have plenty of test nodes at work (no they are not containerized at work).

I like that CoreOS makes administration simple. I simply pick the release I want to use (Alpha, Beta, or Stable) and CoreOS does the updates automatically and reboots the physical machines in a rolling fashion when needed. Fleet (or Kubernetes, which I am working on learning now) then Fleet serves up (restarts the containers on another node) containers and they keep on rolling. THAT is why you would run containers in a CoreOS cluster on multiple nodes, rather than just one node; you get to keep the availability of those containers to essentially 100% uptime.

The real advantage of containerizing applications is that it simplifies and separates them to make administration easier. In a way, it takes away a lot of the work from the admins and puts it on the developers.

Wow. In short, yes.

Apache Cassandra

Scribe

Open Compute Project

Just to name a few.

Reddit is basically a ton of javascript, with some pylons mixed in for good measure with a cassandra backend.

cassandra.apache.org

What you're talking about is creating a hash or some unique way to identify an image. That would need to be calculated and stored for every image on reddit. For it to be used as a dupe check, the identifier for each image would need to be indexed. Indexes in a database help organize things so they can be retrieved more quickly. In the interest of speed, it would have to be indexed, or people would just stop submitting pictures all together. Since there are multiple instances of the reddit database running concurrently on multiple servers, building indexes is very non-trivial. They have to be calculated and stored in real time across all of the various nodes.

Cassandra, sorta sucks at this (or it did, according to them they've fixed it greatly).

There are many benefits to using Cassandra in a place like reddit, but that is one of its bigger drawbacks.

What is Reddit's opinion of Apache Cassandra?
From 3.5 billion Reddit comments

➔ Apache Cassandra website

By popularity on Reddit, this Service is:

24 reviews of this app found across Reddit:

What is Reddit's opinion of Apache Cassandra? From 3.5 billion Reddit comments

➔ Apache Cassandra website

By popularity on Reddit, this Service is:

24 reviews of this app found across Reddit:

What is Reddit's opinion of Apache Cassandra?
From 3.5 billion Reddit comments