Indexing and Performance Optimization

This article will dive deep into the world of database indexing and performance optimization. It will also look at how to ensure that your database stays quick as you scale, using the performance tips in this article.

Don't worry if you find some concepts challenging at first. Database optimization is a complex topic that even experienced developers continue to learn about throughout their careers. We'll take it slow, build your knowledge gradually. Revisit this chapter whenever you need to review it to optimize your own database.

In this section we'll use an expanded version of our bookstore database to illustrate more complex scenarios:

CREATE TABLE Authors (
    AuthorID INT PRIMARY KEY,
    Name VARCHAR(100),
    BirthDate DATE,
    Nationality VARCHAR(50),
    Biography TEXT
);

CREATE TABLE Books (
    BookID INT PRIMARY KEY,
    Title VARCHAR(200),
    AuthorID INT,
    ISBN VARCHAR(13),
    Price DECIMAL(10, 2),
    PublicationYear INT,
    Genre VARCHAR(50),
    Description TEXT,
    CoverImageURL VARCHAR(255),
    FOREIGN KEY (AuthorID) REFERENCES Authors(AuthorID)
);

CREATE TABLE Sales (
    SaleID INT PRIMARY KEY,
    BookID INT,
    CustomerID INT,
    SaleDate DATE,
    Quantity INT,
    TotalAmount DECIMAL(10, 2),
    FOREIGN KEY (BookID) REFERENCES Books(BookID)
);

CREATE TABLE Customers (
    CustomerID INT PRIMARY KEY,
    Name VARCHAR(100),
    Email VARCHAR(100),
    RegistrationDate DATE
);

-- Let's add some sample data
INSERT INTO Authors (AuthorID, Name, BirthDate, Nationality) VALUES 
(1, 'J.K. Rowling', '1965-07-31', 'British'),
(2, 'George Orwell', '1903-06-25', 'British'),
(3, 'Jane Austen', '1775-12-16', 'British');

INSERT INTO Books (BookID, Title, AuthorID, ISBN, Price, PublicationYear, Genre) VALUES
(1, 'Harry Potter and the Philosopher''s Stone', 1, '9780747532743', 15.99, 1997, 'Fantasy'),
(2, '1984', 2, '9780451524935', 12.99, 1949, 'Dystopian'),
(3, 'Pride and Prejudice', 3, '9780141439518', 9.99, 1813, 'Romance'),
(4, 'Animal Farm', 2, '9780451526342', 11.99, 1945, 'Satire');

INSERT INTO Customers (CustomerID, Name, Email, RegistrationDate) VALUES
(1, 'Alice Johnson', 'alice@email.com', '2023-01-15'),
(2, 'Bob Smith', 'bob@email.com', '2023-02-20'),
(3, 'Charlie Brown', 'charlie@email.com', '2023-03-10');

INSERT INTO Sales (SaleID, BookID, CustomerID, SaleDate, Quantity, TotalAmount) VALUES
(1, 1, 1, '2023-04-01', 1, 15.99),
(2, 2, 2, '2023-04-15', 2, 25.98),
(3, 3, 3, '2023-05-01', 1, 9.99),
(4, 1, 2, '2023-05-15', 1, 15.99);

This setup gives us an excellent foundation to work with as we explore indexing and optimization. Now, let's dive in!

1. Understanding Indexes: The Library Catalog Analogy

What is an Index?

Imagine you're in a vast library with millions of books. You're looking for a specific book about SQL databases. How would you find it? You have two options:

a) Start at one end of the library and look at every book until you find the one you want. b) Use the library's catalog system to locate the exact shelf where your book is stored quickly.

In database terms, the first method is like a "full table scan," while the second method is like using an index. An index in a database is very similar to the index in a book or a library's catalog system. It's a separate data structure that allows the database to find specific rows quickly without looking at every row in the table.

Let's break this down further:

  • Without an index: The database has to perform a "full table scan". It starts at the table's beginning and reads every row, checking if it matches your criteria. This is like searching for a book by walking through the entire library and reading the title of every book.

  • With an index: The database can use the index to locate the relevant rows quickly. This is like using the library's catalog to find the section and shelf where your book is located and then going directly to that shelf.

How Do Indexes Work?

When you create an index on a column (or set of columns), the database creates a separate structure that contains two things:

  1. The values from the indexed column(s)
  2. Pointers to the corresponding rows in the table

This structure is sorted, which allows for fast searching, usually through an algorithm like Binary Search.

Let's use our Books table as an example. Say we frequently search for books by their publication year. Without an index, if we wanted to find all books published in 1997, the database would need to check the PublicationYear of every single book in the Books table.

SELECT * FROM Books WHERE PublicationYear = 1997;

With millions of books, this could take a long time.

Now, let's create an index on the PublicationYear column:

CREATE INDEX idx_publication_year ON Books (PublicationYear);

After creating this index, the database builds a sorted structure that might look something like this (simplified for illustration):

PublicationYear | Pointer
-----------------------------
1813            | (pointer to 'Pride and Prejudice' row)
1945            | (pointer to 'Animal Farm' row)
1949            | (pointer to '1984' row)
1997            | (pointer to 'Harry Potter' row)

Now, when you run the same query to find books from 1997, the database can:

First it will quickly find 1997 in this sorted list (using binary search) second, it will then use the pointer to go directly to the correct row in the Books table

This is much faster than checking every row, especially as your table grows.

Types of Indexes

There are several types of indexes, each with its use cases. Let's explore them:

Single-Column Indexes: These are created on just one column. For example, our index on PublicationYear is a single-column index.

CREATE INDEX idx_publication_year ON Books (PublicationYear);

This is useful when you frequently search or sort by this one column.

Multi-Column Indexes: These are created on two or more columns. They're useful for queries that frequently filter or sort by multiple columns.

CREATE INDEX idx_author_year ON Books (AuthorID, PublicationYear);

This index would be useful for queries like:

SELECT * FROM Books WHERE AuthorID = 1 AND PublicationYear > 2000;

The database can use this index to quickly find books by a specific author from a specific year range.

Unique Indexes: These ensure that the indexed columns do not contain any duplicate values. The primary key of a table automatically has a unique index.

CREATE UNIQUE INDEX idx_unique_isbn ON Books (ISBN);

This prevents any two books from having the same ISBN, which makes sense as ISBN is supposed to be unique for each book.

Partial Indexes: These index only a subset of a table's data. They're useful when you frequently query for a specific subset of your data.

CREATE INDEX idx_recent_books ON Books (PublicationYear) WHERE PublicationYear > 2000;

This index only includes books published after 2000. It's useful if you often query for recent books but rarely for older ones.

Clustered vs. Non-Clustered Indexes: In some database systems, you can specify whether an index is clustered or non-clustered.

  • A clustered index determines the physical order of data in a table. There can be only one clustered index per table.
  • Non-clustered indexes have a structure separate from the data.

In PostgreSQL, all indexes are non-clustered, but you can use the CLUSTER command to physically reorder a table based on an index.

The Cost of Indexes

While indexes can dramatically speed up data retrieval, they come with some costs:

Storage: Each index requires additional storage space. It's essentially creating a copy of the indexed data plus the row pointers. For large tables, this can add up to a significant amount of extra storage.

Write Performance: When you insert, update, or delete data in an indexed column, the database also needs to update the index. This means write operations (INSERT, UPDATE, DELETE) can be slower on tables with many indexes.

For example, suppose you have indexes on Title, AuthorID, PublicationYear, and Price, every time you insert a new book. In that case, the database needs to update all these indexes in addition to the main table.

Maintenance: As your data changes over time, indexes may become fragmented or unbalanced, which can reduce their effectiveness. Periodically, indexes may need to be rebuilt or reorganized to maintain their efficiency.

Query Optimizer Overhead: The database's query optimizer must consider all available indexes when planning to execute a query. Having too many indexes can actually slow down this planning phase.

Therefore, it's important to create indexes judiciously, focusing on columns that are frequently used in search conditions, joins, or sorting operations. Don't just create indexes on every column "just in case" – each index should serve a specific purpose based on your common query patterns.

Creating and Using Indexes

We've seen the syntax in examples previously but it wasn't explained. So now that we understand what indexes are and how they work, let's look at how to create and use them effectively.

Creating an Index

You create an index in SQL using the CREATE INDEX statement. Here's the basic syntax:

CREATE INDEX index_name ON table_name (column1, column2, ...);

Let's create some indexes on our bookstore database:

-- Single-column index on PublicationYear
CREATE INDEX idx_books_year ON Books (PublicationYear);

-- Multi-column index on AuthorID and PublicationYear
CREATE INDEX idx_books_author_year ON Books (AuthorID, PublicationYear);

-- Unique index on ISBN
CREATE UNIQUE INDEX idx_books_isbn ON Books (ISBN);

-- Partial index on recent books
CREATE INDEX idx_recent_books ON Books (PublicationYear) WHERE PublicationYear > 2000;

After creating these indexes, let's break down what each one does:

  1. idx_books_year: This index will speed up any queries that search or sort by PublicationYear.

  2. idx_books_author_year: This index will be useful for queries that filter or sort by both AuthorID and PublicationYear. The order of columns in a multi-column index matters. This index will be most effective for queries that filter by AuthorID first, then PublicationYear.

  3. idx_books_isbn: This unique index ensures that no two books can have the same ISBN. It will also speed up searches by ISBN.

  4. idx_recent_books: This partial index only includes books published after the year 2000. It will be smaller than a full index on PublicationYear and can be more efficient for queries that only care about recent books.

When to Use Indexes

Knowing when to create an index is as important as knowing how. Here are some guidelines:

Columns used in WHERE clauses: If you often search for books by their publication year, indexing the PublicationYear column can help. For example:

SELECT * FROM Books WHERE PublicationYear = 2023;

With the idx_books_year index, this query can quickly find the relevant rows without scanning the entire table.

Columns used in JOIN conditions: If you frequently join the Books and Authors tables on AuthorID, indexing this column in both tables can speed up these operations. For example:

SELECT Books.Title, Authors.Name 
FROM Books 
JOIN Authors ON Books.AuthorID = Authors.AuthorID;

Having indexes on Books.AuthorID and Authors.AuthorID can significantly speed up this join operation.

Columns used in ORDER BY clauses: If you often sort books by price, an index on the Price column can help. For example:

SELECT * FROM Books ORDER BY Price DESC;

An index on Price would allow the database to retrieve the rows in sorted order without having to perform a separate sorting operation.

Columns with a high degree of uniqueness: Indexing a column like ISBN, or any unique IDs for each book, can be very effective. Searches on this column can quickly narrow down to a single row:

SELECT * FROM Books WHERE ISBN = '9780747532743';

Foreign key columns: These are often used in joins, so indexing them can improve performance. In our schema, Books.AuthorID is a foreign key to Authors.AuthorID, so both columns are good candidates for indexing.

However, remember:

  • Don't create indexes on columns that are updated frequently. The overhead of updating the index might outweigh the benefits.
  • Small tables (less than a few hundred rows) might not benefit much from indexing, as a full table scan might be just as fast.
  • An index might not be helpful if a column has very low selectivity (i.e., most queries retrieve a large percentage of the rows). For example, if you have a Gender column with only 'M' and 'F' values, an index probably won't help much.

Using Indexes Effectively

Creating indexes is only part of the story. To get the most benefit from your indexes, you need to write your queries in a way that can utilize them effectively. Here are some tips:

Be specific in your WHERE clauses: The more specific your WHERE clause, the more an index can help. For example:

-- Less efficient
SELECT * FROM Books WHERE PublicationYear > 2000;

-- More efficient
SELECT * FROM Books WHERE PublicationYear = 2023;

The second query can use an index on PublicationYear more effectively because it's looking for an exact match.

Use covering indexes when possible: A covering index is one that includes all the columns needed by a query. This allows the database to satisfy the query using only the index, without having to access the table itself. For example:

CREATE INDEX idx_books_year_price ON Books (PublicationYear, Price);

SELECT PublicationYear, Price FROM Books WHERE PublicationYear = 2023;

This query can be satisfied entirely from the index, without needing to access the Books table.

Be aware of the order of columns in multi-column indexes: In a multi-column index, the leftmost column(s) can be used independently, but the same is not true for the rightmost column(s). For example, with an index on (AuthorID, PublicationYear):

-- Can use the index
SELECT * FROM Books WHERE AuthorID = 1;

-- Can use the index
SELECT * FROM Books WHERE AuthorID = 1 AND PublicationYear = 2023;

-- Cannot use the index effectively
SELECT * FROM Books WHERE PublicationYear = 2023;

In the last query, the database can't use the index effectively because PublicationYear is not the leftmost column in the index.

Avoid using functions on indexed columns in WHERE clauses: This can prevent the use of indexes. For example:

-- Can use an index on PublicationYear
SELECT * FROM Books WHERE PublicationYear = 2023;

-- Cannot use an index on PublicationYear
SELECT * FROM Books WHERE YEAR(PublicationDate) = 2023;

In the second query, the function YEAR() prevents the use of an index on PublicationDate.

Be cautious with wildcard characters at the beginning of LIKE patterns: Using a wildcard at the beginning of a LIKE pattern (e.g., '%Smith') prevents the use of indexes. For example:

-- Can use an index on LastName
SELECT * FROM Authors WHERE LastName LIKE 'Smith%';

-- Cannot use an index on LastName
SELECT * FROM Authors WHERE LastName LIKE '%Smith';

The second query would require a full table scan, regardless of whether an index exists on LastName.

Query Optimization Techniques

While indexes are a powerful tool for optimization, there are many other techniques you can use to improve query performance. Let's explore some of these:

Use Specific Column Names

Instead of using SELECT *, specify only the columns you need. This reduces the amount of data that needs to be retrieved and transmitted.

-- Less efficient
SELECT * FROM Books WHERE AuthorID = 1;

-- More efficient
SELECT Title, Price FROM Books WHERE AuthorID = 1;

In the second query, if there's an index that covers both Title and Price, the database might be able to satisfy the query using only the index, without having to access the table data at all.

Optimize JOINs

Joins can be expensive operations, especially when dealing with large tables. Here are some tips for optimizing joins:

a) Use appropriate join types: Make sure you're using the right type of join for your query. INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN all have different use cases.

b) Join on indexed columns: Ensure that the columns you're joining on are indexed in both tables.

c) Consider the order of joins: In complex queries with multiple joins, the order of the joins can affect performance. Generally, start with the largest tables and join to progressively smaller ones.

Here's an example of an optimized join:

SELECT Books.Title, Authors.Name, Sales.SaleDate
FROM Books
INNER JOIN Authors ON Books.AuthorID = Authors.AuthorID
LEFT JOIN Sales ON Books.BookID = Sales.BookID
WHERE Books.PublicationYear > 2000
ORDER BY Sales.SaleDate DESC
LIMIT 10;

In this query:

  • We're using INNER JOIN between Books and Authors because we only want books with known authors.
  • We're using LEFT JOIN with Sales because we want all books, even those without sales.
  • We've added indexes on Books.AuthorID, Authors.AuthorID, Books.BookID, and Sales.BookID to speed up the join operations.
  • We've added an index on Books.PublicationYear for the WHERE clause.
  • We've added an index on Sales.SaleDate for the ORDER BY clause.

Use LIMIT to Restrict Results

If you don't need all results, use LIMIT to restrict the number of rows returned. This can significantly reduce the amount of data processed and returned.

SELECT * FROM Books ORDER BY Price DESC LIMIT 10;

This query will stop when it finds the top 10 most expensive books rather than sorting the entire table.

Use EXISTS Instead of IN for Subqueries

Generally, EXISTS is more efficient than IN for large datasets. EXISTS stops when it finds a match, while IN evaluates the entire subquery.

-- Less efficient for large datasets
SELECT * FROM Authors WHERE AuthorID IN (SELECT AuthorID FROM Books WHERE Price > 20);

-- More efficient for large datasets
SELECT * FROM Authors WHERE EXISTS (SELECT 1 FROM Books WHERE Books.AuthorID = Authors.AuthorID AND Price > 20);

Avoid Correlated Subqueries

Correlated subqueries can be slow because they must be re-executed for each row in the main query. Often, they can be rewritten as joins.

-- Correlated subquery (less efficient)
SELECT Title FROM Books b
WHERE Price > (SELECT AVG(Price) FROM Books WHERE AuthorID = b.AuthorID);

-- Join (more efficient)
SELECT b.Title
FROM Books b
JOIN (SELECT AuthorID, AVG(Price) as AvgPrice FROM Books GROUP BY AuthorID) a
ON b.AuthorID = a.AuthorID
WHERE b.Price > a.AvgPrice;

Optimize GROUP BY Queries

When using GROUP BY, ensure you have indexes on the grouped columns. Also, if you're grouping by multiple columns, the order of columns in your index should match the order in your GROUP BY clause for optimal performance.

CREATE INDEX idx_author_genre ON Books (AuthorID, Genre);

SELECT AuthorID, Genre, AVG(Price) as AvgPrice
FROM Books
GROUP BY AuthorID, Genre;

Explaining and Analyzing Queries

Understanding how your database executes queries is crucial for optimization. Most database systems provide tools to analyze query performance. In PostgreSQL, you can use the EXPLAIN command to see how a query will be executed.

Basic EXPLAIN

Let's start with a simple query:

EXPLAIN SELECT * FROM Books WHERE PublicationYear = 2023;

This might produce output like:

                          QUERY PLAN
---------------------------------------------------------------
 Seq Scan on books  (cost=0.00..17.50 rows=4 width=160)
   Filter: (publicationyear = 2023)

Let's break this down:

  • Seq Scan on books: This means the database scans the entire books table sequentially.
  • cost=0.00..17.50: This estimates how expensive the operation is. The first number (0.00) is the startup cost, and the second (17.50) is the total cost.
  • rows=4: The database estimates it will return 4 rows.
  • width=160: This is the estimated average width of each row in bytes.
  • Filter: (publicationyear = 2023): This is the condition being applied to each row.

EXPLAIN ANALYZE

For more detailed information, including actual execution times, you can use EXPLAIN ANALYZE:

EXPLAIN ANALYZE SELECT * FROM Books WHERE PublicationYear = 2023;

This might produce:

                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Seq Scan on books  (cost=0.00..17.50 rows=4 width=160) (actual time=0.019..0.021 rows=3 loops=1)
   Filter: (publicationyear = 2023)
   Rows Removed by Filter: 7
 Planning Time: 0.083 ms
 Execution Time: 0.045 ms

This gives us additional information:

  • actual time=0.019..0.021: The actual time (in milliseconds) it took to start returning rows and to finish.
  • rows=3: The actual number of rows returned.
  • Rows Removed by Filter: 7: How many rows were eliminated by the WHERE condition.
  • Planning Time and Execution Time: How long it took to create and execute the query plan.

Interpreting Query Plans

When looking at query plans, here are some things to watch out for:

  1. Seq Scan: If you see this for queries that should be using an index, you might need to create an index or optimize your query.

  2. Index Scan or Index Only Scan: These mean the database is using an index, which is generally faster for selective queries.

  3. Bitmap Heap Scan: This is often used for OR conditions or when the database expects to return a larger portion of the table.

  4. Nested Loop, Hash Join, Merge Join: These are different methods of joining tables. The choice depends on the size of the tables and the join conditions.

  5. Cost and Rows: Your query might need optimization if these numbers are much higher than you expect.

Remember, query optimization often involves trade-offs. A change that speeds up one query might slow down others. For every optimization there is potential pitfalls so try to think what is optimal for the situation.

Practice Exercises

Now that we've covered these concepts in depth let's put them into practice:

  1. Create an index on the Price column of the Books table. Then, write a query to find all books priced above $15, and use EXPLAIN to see if your index is being used.

  2. Write a query to find the top 5 most expensive books along with their authors' names. Use EXPLAIN ANALYZE to examine its performance. Then, try to optimize it by adding appropriate indexes.

  3. Create a multi-column index on (Genre, Price) in the Books table. Then, write a query to find the average price of 'Mystery' books, and use EXPLAIN to see if your index is being used.

  4. Write a query to find all authors who have written books in multiple genres. Use EXPLAIN ANALYZE to examine its performance and then try to optimize it.

  5. Create a query that joins the Books, Sales, and Customers tables to find the top 3 customers who have spent the most on books. Use EXPLAIN ANALYZE to examine its performance and optimize it as much as possible.

Remember, query optimization is often a process of trial and error. Don't be afraid to experiment with different indexes and query structures to see what works best for your specific use case. The key is to understand the tools at your disposal and how to interpret the results of your experiments.

In our next article, we'll explore another extremely useful feature of SQL databases: "Transactions."

PostgresqlSqlDatabase
Avatar for Niall Maher

Written by Niall Maher

Founder of Codú - The web developer community! I've worked in nearly every corner of technology businesses: Lead Developer, Software Architect, Product Manager, CTO, and now happily a Founder.

Loading

Fetching comments

Hey! 👋

Got something to say?

or to leave a comment.