A quick dive into a SQL query and the nuances of filters and limits.
Recently I worked with @peckjon on a Transposit connector for Algorithmia and an app that uses it. We ended up with a pretty interesting demonstration not just of the Algorithmia connector, but also the power of SQL for APIs.
We wanted to create a SQL query that gets commit messages from GitHub for a given repository and sends them to a sentiment analysis algorithm in Algorithmia. This is easy to do with Transposit and the final query looks like this:
SELECT *
FROM (SELECT C.commit.message, S.result[0].sentiment
FROM github.list_commits AS C
JOIN algorithmia.call_algorithm AS S
ON S.$body.document = C.commit.message
WHERE C.owner='seattlerb' AND C.repo='heckle'
AND S.user='nlp' AND S.algorithm='SentimentAnalysis'
LIMIT 100)
WHERE sentiment > 0.2 OR sentiment < -0.2
ORDER BY sentiment
This query has few interesting nuances, but first let's start with the basics.
To get the commit messages from GitHub for a given repository we can use this query:
SELECT *
FROM github.list_commits
WHERE owner='seattlerb' AND repo='heckle'
To send text to Algorithmia for sentiment analysis we can use this query:
SELECT *
FROM algorithmia.call_algorithm
WHERE user='nlp' AND algorithm='SentimentAnalysis'
AND $body.document = 'text to send to the sentiment analysis algorithm'
We can use JOIN
to combine these queries together:
SELECT C.commit.message, S.result[0].sentiment
FROM github.list_commits AS C
JOIN algorithmia.call_algorithm AS S
ON S.$body.document = C.commit.message
WHERE C.owner='seattlerb' AND C.repo='heckle'
AND S.user='nlp' AND S.algorithm='SentimentAnalysis'
As expected, this query will get the list of commits from GitHub and for each one will call the sentiment analysis algorithm with the commit message.
When you run a similar query in a relational database the DB engine will try to match all the rows in both tables based on the condition in the ON clause.
Transposit will basically do the same with the underlying APIs, except for two differences:
S.$body.document
is an input parameter for the Algorithmia API, so instead of trying to "match" the values Transposit will send the commit message as a parameter to the Algorithmia API.Another interesting part in this query is in the last WHERE
clause:
WHERE sentiment > 0.2 OR sentiment < -0.2
as you can see this WHERE
is in the outer SELECT
, and the JOIN
is in a subquery.
We can reduce this type of query to the following template:
SELECT *
FROM (SELECT *
FROM source
WHERE [subquery filters]
LIMIT 100)
WHERE [outer query filters]
At first look this query looks the same as:
SELECT *
FROM source
WHERE ([subquery filters]) AND ([outer query filters])
LIMIT 100
But these queries are quite different: the first query tells the SQL engine to do the subquery filter
and the limit 100
before the outer query filters
and the second query applies the limit after both queries.
These two queries mean different things and can yield quite different results. The power of SQL is that you can do a lot by saying a little. As with any high level language, it’s important to mean what you say. Do you want to find “interesting” messages among the first 100 commits or do you want to find the first 100 polarizing commit messages? The former produces a well—understood amount of work; on a bland repository, the latter could churn until you hit a rate limit. Just like SQL for traditional databases, with SQL for APIs it’s important to say what you mean (and go crazy in those commit messages).