SQL functions in WHERE clauses are evil
Once we get up an running with the basic SQL syntax, doing inserts, updates, deletes and simple selects we start to learn about the SQL functions, the default ones like LOWER, COUNT, AVG and then the functions that are specific for the database you’re using. We learn them and start to feel that your fingers are itching to try them, to use them in the real world. Why would you learn them if you can’t use them anyway?
Well, I can’t tell you that you should never use them, but listen to my advice, do not ever use these SQL functions in WHERE clauses to filter data. They’re evil and they’ll try to kill your database and prevent you from working by having to discover crazy performance bottlenecks and slow queries that don’t look slow at all. At least until you run an “EXPLAIN” on them.
Let’s start with a simple example, we want to know, from our USERS table, which ones where born today so we can send them a beautiful and unpersonal e-mail to remember that they are getting older today. In your USERS table there is a column called “date_of_birth” which is a DATE and obviously it stores the date that the user said he was born at, so we know the day, month and year, we already have all information that we need, now it’s time to write the SQL code to find them.
In our first attempt, we write the following simple SQL:
SELECT * FROM users u where DAYOFMONTH(u.date_of_birth) = 12 AND MONTH(u.date_of_birth) = 1
Pretty simple and does the job perfectly, running it in your test database returns the correct users and it’s definitely fast. Now we get to try it in our production database so you can figure out which users are going to receive the e-mail. We’re running a big website, with a gazillion of users, so there should be some of them that are getting older today, we type in the query at your production database console and…
Wait.
Waiting.
Still waiting and looking at that blank console screen.
Well, you should find yourself something else to read, it’s going to take a while. After a long time waiting you get a list with 10 users with birthdays today. WTF? Why so long just to find 10 users?
A light shines in our heads and we remember that there is no index at the “date_of_birth” column, we never thought about using it in queries so we, as good database guys, did not create an index when it wasn’t needed. But now it is and you just type in the command to create the index for the “date_of_birth” column.
After waiting a little bit to have the index created, we type again our beautiful query and we wait again. This time it seems that’s it’s taking even longer to finish. This is clearly wrong, we have created an index at that database field and queries against that field should use that index. Now we have to bring the most important tool of anyone that has ever used a database, the “explain query” feature, which explains how a query is going to be executed by the database. At your database console, we type (this is for MySQL):
explain select * from users u where DAYOFMONTH(u.date_of_birth) = 12 and MONTH(u.date_of_birth) = 1\G
And here’s our output:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: users
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 4558
Extra: Using where
1 row in set (0.00 sec)
The most important lines are “possible_keys” and “key”, both of them are NULL, which means that our beautiful query isn’t using any indexes. But another information is even more alarming, the database is looking at 4558 rows to retrieve my results (and this is exactly the count of rows available at the users table). The database is scanning the WHOLE users table to fetch some 8 rows. Can you feel it’s pain?
We’ve created the index and we’re trying to filter just on that column, why is it not being used?
Because we’re using SQL functions, that’s the reason. The DAYOFMONTH function is a transforming function, it takes an argument and generates a value based on that argument and our query is performing it’s filtering based on this generated value. But here’s the problem, the database optimizer doesn’t know the generated value, it has no index for it nor a way to infer which value could be generated because it doesn’t know what this SQL function does. The optimizer can’t perform any optimizations at all.
When faced with such a complex query (from it’s point of view) the optimizer has no option but let the query run against the whole database, selecting every row, applying the function to every column and then finally filtering the result. Every time you use a SQL function that is not native to your database, like DAYOFMONTH, LEFT, RIGHT or MONTH, you might be leading yourself to such a bad query and future bottleneck. When you’re at your development database with a bunch of records, it might not yield any perceivable performance problems but once you reach the production environment with hundreds of rows, your problems will start to rise.
You should avoid filtering based on calculated or transformed data in you queries, as your database optimizer will not be able to give you the best “query plan”. If you’re faced with such a problem, you should create a separate column at your table and generate the value beforehand. In our case, we would need to create two new columns, “day_of_birth” and “month_of_birth”, create an index for both of them and every time a row has it’s “date_of_birth” updated, it should automatically update the “day_of_birth” and “month_of_birth” columns.
From now on, learn the mantra, SQL functions in where clauses are evil ![]()
Why learning HTTP does matter
It’s interesting to notice that there’s so many people working with web applications that don’t understand the basics of the Internet and the HTTP protocol. You might find applications that exibit bizarre behaviors anywhere, people just forget to read the specs or sleep during the HTTP protocol classes at college.
One of the most harmful exhibitions of this lack of knowledge is the “POST fever”. Every form in the application performs a POST, no matter what it’s doing or the side effects involved in it, it just works that way and people just don’t have a reason not to go like that, usually, if you ask them, they’ll probably say “oh, someone told me that the GET method has size limit in it’s parameters size”.
But, what’s so bad about it?
If you take a look at the HTTP RFC, you will find that the GET method is described as a “safe” method. Safe, in the HTTP context, means that you should be able to perform GETs to a web application and this should have no side effects on it, it should not change the resource you are requesting, because the whole idea of the GET method is that you should just GET a copy of the resource at that specific URL, you’re not doing anything funny with it, you should just receive it anywhere and anytime you want to.
But if you look at the POST method description, it’s defined as an “unsafe” method. If you send a POST to a URL you might be definitely changing something and generating an evil side effect that might render the whole application useless and bring Skynet and the Terminators to lay Armageddon on Earth. Or you might just be creating a new resource, as a blog posting like this one.
The obvious difference is that POSTs can (and usually should) change the state of something at the server side, while a GET should never do something like that. If you’re keen to SQL databases, GETs are just like “select” commands and POSTs like “insert” commands. Have you ever seen an “insert” returning a result set or a “select” inserting data? Neither me
But bear with me, it GETs even worse. Imagine that you’re the owner of that evil website that I said that just uses POSTs in it’s forms and one of those forms is a search form. Users will use it to search for your products and add them to their shopping carts. A user wants to buy the new AC/DC records but he’s not sure about it’s name, so he just types AC/DC and hits enter.
Voila!
There, at the top of the list, is “Black Ice”, their new record (Have you already bought yours?). He clicks on the link and while he’s viewing the CD page he remembers that he hasn’t bought the “Stiff Upper Lip” album. “Let me hit the back button and look for it too”, thinks the poor user and when he hit the button, the browser shows an interesting message:
“The browser will need to send data to the server to perform this action. Are you sure you want to do this?”
The user looks terrified to the message. “What have I done? Will they bill me for this? Are they going to send me the new Britney Spears album ‘cos I’m trying to hit the back button?”
As the HTTP protocol mandates, POSTs are not safe and the tools (usually, our browsers) should tell the user that something bad might happen if they try to POST by accident and that’s exactly what happens if you try to hit the back button after a POST. In this example, the user wouldn’t be doing anything wrong, but instead of coming back to a search page, he could be at a “add client” page and a “back” would make him re-create the last client he sent to the database, which isn’t really interesting.
Worse, if you’re using POST in a search form, they aren’t going to be able to use the back button (and the usability gurus say that it’s the most used feature in browsers) and they aren’t going to be able to bookmark the search results! Can you imagine something worse than that? You are keeping people from expressing their love for you website by posting it in their del.icio.us favorites!
Now, the reasoning is simple, if you’re not changing anything at the server side, you should always perform GETs. They don’t break the back button, they let the users bookmark their pages and they aren’t going to make the browser show the user any funny messages. If you’re changing state at the server side you should definitely use POST (and the other HTTP methods that are designed to change state, like PUT and DELETE), GET requests should NEVER change any state at the server side.
And before I forget, after every successful POST you should REDIRECT the user to a new page and not just render the page for him in response for the POST. Redirecting the user to the “response” page keeps the user from hitting the “back” button and re-entering the data they have already sent during the last POST.
What’s wrong with facebook apps
Facebook apps suck, they suck because I don’t want to change my facebook interface.
Sure I wouldn’t mind joining the tv show trivia game utilising my facebook info with some friends, provided I didn’t have to add a stupid application to my facebook. Why not let the developers just pull the data to their own webpages without requiring an app download(and a complicated screen to select how exactly I want to add the app). Facebook should just request consent and then redirect you to the other webpage utilising the facebook api.
They lose advertising minutes, but people already spend enough time looking at facebook and that isn’t the long term answer. If facebook locks developers in (because clients need facebook integration at the moment) then they get developer mindshare, which is the real value creator and the thing that the big boys are after(why do you think microsoft invested so much in .NET? and google has supposedly the coolest work environment on the planet for developers).
Powered by ScribeFire.