The app basically listens to an audio sample and makes a fingerprint from the highs and lows of the wave form. Here is a shitty visual aid I drew. In this example, the fingerprint would be something like CBDAEBDBDBC.
This fingerprint gets sent to our server where we have a database of over 50 million songs. Each song in the database has the fingerprint for the full length of the song. We basically just search the database for any fingerprints containing CBDAEBDBDBC.
There is actually not much heavy calculation required which is why it is able to do this so quickly. It is essentially the same thing as doing a ctrl+f search on a website for a particular word. Obviously, the longer the audio sample, the longer the fingerprint we can use to search more accurately.
This is the same technology iTunes uses for the "Get Track Names" feature except that it is able to get a pure audio sample from the actual file and can do it much quicker. On your phone, we have to use additional algorithms to filter out background noise which is way more complicated and beyond my level of expertise.