Full-text Search (FTS) with PostgreSQL and SQLAlchemy
If you are building a text-heavy web platform such as a blog or a documentation platform, your users expect a search functionality. One of the simplest and effective way to build search experiences is to leverage the full-text search (FTS) capabilities exposed by your database.
In this blog, we will look at building a search system for a web app that indexes various videos. Each video has a title, description, and associated metadata. We want our system to be searchable over video title and description. We will be using SQLAlchemy as our ORM layer for the underlying PostgreSQL database.
PostgreSQL and FTS
PostgreSQL is a very capable RDBMS with batteries included. It comes with full-text functionality baked-in with tsquery
and tsvector
constructs.
Let’s build a table to store our videos.
CREATE TABLE video (
id PRIMARY KEY BIGSERIAL,
title VARCHAR(100) NOT NULL,
description TEXT NOT NULL,
published_at TIMESTAMP DEFAULT NOW()
);
The Video
model will look like the following snippet.
class Video(db.Model):
id = db.Column(db.Integer(), primary_key=True,
autoincrement=True)
title = db.Column(db.String(255), nullable=False)
description = db.Column(db.Text(), nullable=False)
thumbnail_url = db.Column(db.String(2048), nullable=False)
published_at = db.Column(db.DateTime(), nullable=False)
To implement a basic search, we can use the SQL LIKE
operator.
select id, title, description, published_at FROM video WHERE title LIKE %term% OR description LIKE %term%
To do that in SQLAlchemy,
results = Video.query.filter(Video.title.like(term) | Video.description.like(f'%{term}%')).all()
It’s a simple pattern matching. However, this is a start. For many purposes like username autocomplete, this is enough.
However, they have several shortcomings when we try to build a practical search solution.
- There is no language awareness. It is not practically feasible to search based on synonyms and derived words using patterns.
- It is difficult to provide a ranking of search results, which renders them ineffective when thousands of matches are found.
- They do not support indexes and hence are slow.
Enter tsquery
and tsvector
A tsvector
is a sorted list of distinct lexemes, which have been normalized to merge different variants of the same word. It is a form optimized for search. It is used to store the target contents in a search system.
A tsquery
is a set of lexemes meant to be searched against a group of vectors. They are used to represent search queries and can use boolean operators to combine the match effect.
Let’s make the above search a little more efficient.
select id, title, description, published_at FROM video WHERE to_tsvector(title || ' ' || description) @@ to_tsquery(term)
This does not have a direct SQLAlchemy equivalent, but you can perform full-text searches for a single column using match
function.
results = Videos.query.filter(Video.description.match(term)).all()
First, this query concatenates the title
and description
columns. It then converts the result to a tsvector
and matches it against the search term. This gives good results, but given an average-sized database, such ad-hoc searches will not provide acceptable results.
Inverted Indices, TSVector Columns, and Generated Columns
To speed up the search, we need to pre-compute the tsvectors
and store them in a gin
index. A GiN or Generalized Inverted Index is a type of index used for full-text search. Like the “glossary” pages at the end of the book, an inverted index says which term is present in which documents.
PostgreSQL provides us with a column type called tsvector
which we will be using to store our pre-computed search document. We will use generated columns which will automatically populate this column whenever we insert a new row. The contents will be to_tsvector('english', title || ' ' || description)
The english
is essential as it enables PostgreSQL to generate a predictable tsvector
every time. However, this also means that you will be limited in handling English documents only. To work around this limitation, you will need to build your own indexing mechanism, which is out of the scope of this article.
So our final alter table looks like this:
ALTER TABLE video ADD COLUMN __ts_vector__ tsvector GENERATED ALWAYS AS (to_tsvector('english', title || ' ' || description)) STORED;
SQLAlchemy natively does not support tsvector
so we need to create a custom type called TSVector
, as shown in the following snippet.
import sqlalchemy as sa
from sqlalchemy.dialects.postgresql import TSVECTORclass TSVector(sa.types.TypeDecorator):
impl = TSVECTOR
TypeDecorator
is an abstract class with few abstract methods, your IDE and/or linter will warn about unimplemented methods, but you can safely ignore them. Since we will not be manipulating TSVector
columns directly from Python, we can leave those undefined.
Now we will use the TSVector
type in your Video
model.
from sqlalchemy import desc, Indexfrom .ts_vector import TSVector
from ..db import dbclass Video(db.Model):
id = db.Column(db.Integer(), primary_key=True,
autoincrement=True)
title = db.Column(db.String(255), nullable=False)
description = db.Column(db.Text(), nullable=False)
thumbnail_url = db.Column(db.String(2048), nullable=False)
published_at = db.Column(db.DateTime(), nullable=False) __ts_vector__ = db.Column(TSVector(),db.Computed(
"to_tsvector('english', title || ' ' || description)",
persisted=True)) __table_args__ = (Index('ix_video___ts_vector__',
__ts_vector__, postgresql_using='gin'),)
We have defined a new column __ts_vector__
, auto-generated from the title
and description
.
We can now use this with the match
function to perform full-text searches. By default, it will be ordered by rank (relevance). However, you can add additional ORDER BY
clauses as well.
results = Video.query.filter(Video.__ts_vector__.match(term)).all()
PostgreSQL full-text search is quite powerful and is sufficient for many use cases. It also allows further customization, such as weighted columns and keywords. Head over to the documentation to learn more. For more advanced use-cases such as high-concurrency or analytics, you may need to invest in a dedicated search setup such as Elasticsearch or a SaaS such as Algolia.
You can find the demo project using PostgreSQL, SQLAlchemy and implementing full-text search at