Over the years I’ve seen all sorts of SQL mistakes made by developers on my teams, and I’ve made quite a few of them myself. I’d like to share what I’ve learned with you, starting with security first.
Table of Contents
Security
Writing Raw SQL queries
Even if everyone on your team is careful, sooner or later, someone will inadvertently pass unsanitized user input to a query, inviting various SQL injection attacks. Use an SQL query builder like Knex instead. SQL query builders allow you to programmatically create queries and automatically sanitize all inputs using prepared statements, but they do still allow running raw SQL queries for those rare occasions that you need them.
Not Using Prepared Statements
If you really have to run raw SQL queries, make sure that you use prepared statements. Yes, it’s more work, but that’s one of the reasons why we are using SQL query builders in the first place.
Data Integrity
Missing Primary Keys
Primary keys serve as unique identifiers for each row in a database table. For example, each customer must have a unique, usually numeric, id. Other fields uniquely identifying a customer such as e-mail addresses and phone numbers might change, but a customer’s id never changes. If a table doesn’t explicitly rely on the primary key for relationships such as many-to-many intermediary tables, you may omit them, but those cases are few and far between. When in doubt, use primary keys.
Missing Foreign Keys
Foreign keys link data in one table to the data in another table. Their primary purpose is to enforce relationships between tables. For example, a foreign key in an “Order” table may reference the primary key in a “Customer” table. It’s not possible to delete a customer leaving their orders orphaned when such a foreign key is in place. The opposite is true as well. It’s not possible to add an order from a non-existent customer.
When you have a good set of foreign keys in your database, you know that your database has your back when your application attempts to do something it shouldn’t.
Not Using Transactions
Transactions allow you to group multiple operations into a single logical unit of work. They are essential in dealing with complex operations that involve multiple steps. Either all steps in the transaction are completed successfully, or none of them are. For example, when placing an order, you may add an entry to the “Order” table and update the stock values of the products purchased in the “Product” table. Without transactions, if an error occurs after the order is created in the database, you end up with incorrect stock numbers.
No ORDER BY
This one is relatively innocuous because it doesn’t directly affect data integrity, but queries without ORDER BY might return results in unpredictable ways. Developers often make the assumption that query results are always returned in the same order, but this isn’t guaranteed unless you explicitly use ORDER BY.
Performance
Inadequate Indexes
Having just a primary key in a table isn’t enough. As soon as you start to use columns other than the primary key in WHERE conditions, the database won’t be able to use indexes, and performance might suffer greatly as a result. Queries that take milliseconds to complete might take seconds or even minutes. If you are having performance issues, the first place to look at is indexes. Adding the right index can significantly improve an application’s performance, often by thousands of times.
It’s tempting to add indexes to every single column in a table, but there are drawbacks to that approach. Databases can’t always use multiple indexes at the same time. Also, they need to update every single index after every write operation (INSERT, UPDATE, DELETE). Finally, your data storage requirements can easily double or more, so it pays to be selective when adding indexes to a table.
Overusing SELECT *
Retrieving all columns using “SELECT *” may be convenient, but it can negatively impact performance. The database needs to retrieve and transfer entire rows, and your application needs to convert all incoming data to an internal format. These operations consume more CPU time, memory, and network bandwidth when there are many unnecessary columns included in the result set.
Not Using LIMIT Clauses
When you need only a small subset of a result, using LIMIT may greatly improve performance rather than retrieving the whole result and pruning the unnecessary parts in code.
Running SQL Queries Inside a Loop
It may not always be obvious which functions access the database to return a result. Database operations are naturally slow. Accidentally calling SQL queries inside a loop may cause your application to come to a crawl, whereas converting the function to return multiple results, either using WHERE IN or some other method, will likely work much faster.
Using WHERE IN with a Large List of Values
While there are good use cases for WHERE IN, it’s a good idea to keep its limits in mind. When there are a large list of values, some databases might not use indexes properly, and some also impose a limit on the number of values that can be used.
Running SQL Queries One by One
SQL queries are slow. Whenever you can, you should run SQL queries in parallel. For example, in Node.js you can use Promise.all to run multiple SQL queries at the same time.
Too many UPDATEs
Many relational databases don’t normally update rows in place – they create new rows instead and mark the old ones “dead”. Dead rows are usually reclaimed by the database after some time. If you need to make a great number of updates for things like increasing product view counters, database performance might start to suffer. In cases like this, it’s a good idea to augment your relational database with a data store like Redis that can handle frequent updates.
Testing with Small Data Sets
A query may run just fine when there are 1K rows in a table, but what happens when there are 100K, 1M, or 10M rows? Critical parts of the code must be tested with realistic data sets.
Using ORMs
Unlike SQL query builders, using Object-Relational Mapping (ORM) frameworks like Hibernate in Java that automatically generate SQL queries for you to populate objects might add an unacceptable overhead for some use cases. A seemingly simple query in an ORM can easily trigger tens of side queries. It’s also difficult to optimize or customize the generated queries to achieve better performance.
Change Management
Not Using Migrations
Changes are made to the database schema over the lifetime of an application. All schema changes must be stored in immutable migration files so that database schema changes can be made and tracked automatically. Ad-hoc changes without explicit change management might lead to a lot of headaches during deployment.
That’s a wrap. If there are other common SQL mistakes I’ve missed, please let me know, and I will update the article.
Related: