Index
Index is a data structure, which is separate from the heap, and it has pointers to the heap. It contains part of our actual data, allowing us to search for something. A single table might index one or more columns. Thanks to the index, we might avoid the SCAN operation on the heap. So, index is like a list of pointers to pages on the heap. When an Index Scan is performed, the DB goes to the index, and then to the heap. The data from the heap (pages) might contain more rows than the ones that satisfy our query. Therefore, the retrieved pages need to be evaluated to discard the wrong rows.
Index is most popularly implemented as a B-Tree, although there are also other data structures possible.
Small index is better for performance, because such index might live in memory.
Primary key usually has an index created automatically for it. It also usuablly becomes a clustered index. A clustered index is a special case of indexing where the index is actually stored within the table itself. It’s done by making the table ordered with the primary index. So, anytime you want to use the clusted index, you actually go directly into the table, since it’s already sorted by that index. Anytime you insert some data into the table, it gets inserted in the right place for the clustered index to maintain the order.
Having an index does not mean that it will always be used in queries that involve the indexed column. It really depends on the cost analysis that the database planner prepares. Sometimes, it might be cheaper to actually go directly to the table and scan. An example of that would be indexing a “Name” column, and then invoking something like:
In this case, the index would have to be scanned as a whole, and since we also need other columns, it’s probably more efficient to go to the table directly, skipping the index data structure.
Multiple Column Indexes
Index might contain data of other columns than the indexed one(s), and it might help avoid going to the table when all we need in the result are the columns that are part of that index (Index Only Scan). Here’s how to create a multi-column index in Postgres:
We can also ceate a composed index that is built from multiple columns. We define such indices from left to right, e.g.:
Such an index will be useful when we have a condition on column a, or on both a and b (a OR b
will
not work). It will not be useful when we have a condition just on b. In such case, the planner
will probably decide to just scan the table. This is because, the index is ordered by
a first, and only then by b.
Scans
Index Scan
When we execute a query that includes some condition on an indexed column, and the DB’s analysis engine decides to use an index, there’s going to be an Index Scan. The DB will look into the index, and as soon as the value satisfies the condition, it will go to the heap to acquire the appropriate page(s).
Bitmap Index Scan
When the query may use the index, and we’re looking for some condition that wil
potentially draw multiple rows (e.g., SELECT * WHERE age > 30
, assuming that
age
is indexed), a bitmap will be used to store the pages that we’ll have to
look at. DB will not go into the heap per row. Instead, it will first scan the
index, collect the pages (in bitmap), and then collect all the pages it needs.
There could be queries that involve multiple indices. In such case, it’d build two bitmaps. Then, it’d just AND them together, creating a single bitmap.
Index Only Scan
When the query asks only for the data that is contained within the index, the DB does not have to even go to the heap. It will just scan the index, and build the response based on that. It’s the fastest possible query kind.
Sequential Table Scan
When index is not used for the query (e.g., when there’s no index, or the query would not benefit from indexing), the DB is going to scan the table for the results. It’s going to look at each row one by one. It’s the worst kind of query.
Planner Heuristics
The existence of an index does not guarantee that it wil be used for the queries. Even when the index is created on the column that we’ve included in our condition, in some cases it might be “ignored”. It could be that the planner incurrs that the result will probably require it to go and fetch most of the pages anyway, so it will not even bother going to the index, expecially if it’d require to load the index from the disk. A similar case would be when we have very small amount of data. Any query would require the DB engine to fetch the whole page, so it doesn’t make sense to consult the index first.