Full-text Search (FTS) with PostgreSQL and SQLAlchemy

Amitosh Swain Mahapatra
4 min readApr 26, 2021

--

If you are building a text-heavy web platform such as a blog or a documentation platform, your users expect a search functionality. One of the simplest and effective way to build search experiences is to leverage the full-text search (FTS) capabilities exposed by your database.

Photo by Loverna Journey on Unsplash

In this blog, we will look at building a search system for a web app that indexes various videos. Each video has a title, description, and associated metadata. We want our system to be searchable over video title and description. We will be using SQLAlchemy as our ORM layer for the underlying PostgreSQL database.

PostgreSQL and FTS

PostgreSQL is a very capable RDBMS with batteries included. It comes with full-text functionality baked-in with tsquery and tsvector constructs.

Let’s build a table to store our videos.

CREATE TABLE video (
id PRIMARY KEY BIGSERIAL,
title VARCHAR(100) NOT NULL,
description TEXT NOT NULL,
published_at TIMESTAMP DEFAULT NOW()
);

The Video model will look like the following snippet.

class Video(db.Model):
id = db.Column(db.Integer(), primary_key=True,
autoincrement=True)
title = db.Column(db.String(255), nullable=False)
description = db.Column(db.Text(), nullable=False)
thumbnail_url = db.Column(db.String(2048), nullable=False)
published_at = db.Column(db.DateTime(), nullable=False)

To implement a basic search, we can use the SQL LIKE operator.

select id, title, description, published_at FROM video WHERE title LIKE %term% OR description LIKE %term%

To do that in SQLAlchemy,

results = Video.query.filter(Video.title.like(term) | Video.description.like(f'%{term}%')).all()

It’s a simple pattern matching. However, this is a start. For many purposes like username autocomplete, this is enough.

However, they have several shortcomings when we try to build a practical search solution.

  • There is no language awareness. It is not practically feasible to search based on synonyms and derived words using patterns.
  • It is difficult to provide a ranking of search results, which renders them ineffective when thousands of matches are found.
  • They do not support indexes and hence are slow.

Enter tsquery and tsvector

A tsvector is a sorted list of distinct lexemes, which have been normalized to merge different variants of the same word. It is a form optimized for search. It is used to store the target contents in a search system.

A tsquery is a set of lexemes meant to be searched against a group of vectors. They are used to represent search queries and can use boolean operators to combine the match effect.

Let’s make the above search a little more efficient.

select id, title, description, published_at FROM video WHERE to_tsvector(title || ' ' || description) @@ to_tsquery(term)

This does not have a direct SQLAlchemy equivalent, but you can perform full-text searches for a single column using matchfunction.

results = Videos.query.filter(Video.description.match(term)).all()

First, this query concatenates the title and description columns. It then converts the result to a tsvector and matches it against the search term. This gives good results, but given an average-sized database, such ad-hoc searches will not provide acceptable results.

Inverted Indices, TSVector Columns, and Generated Columns

To speed up the search, we need to pre-compute the tsvectors and store them in a gin index. A GiN or Generalized Inverted Index is a type of index used for full-text search. Like the “glossary” pages at the end of the book, an inverted index says which term is present in which documents.

PostgreSQL provides us with a column type called tsvector which we will be using to store our pre-computed search document. We will use generated columns which will automatically populate this column whenever we insert a new row. The contents will be to_tsvector('english', title || ' ' || description)

The english is essential as it enables PostgreSQL to generate a predictable tsvector every time. However, this also means that you will be limited in handling English documents only. To work around this limitation, you will need to build your own indexing mechanism, which is out of the scope of this article.

So our final alter table looks like this:

ALTER TABLE video ADD COLUMN __ts_vector__ tsvector GENERATED ALWAYS AS (to_tsvector('english', title || ' ' || description)) STORED;

SQLAlchemy natively does not support tsvector so we need to create a custom type called TSVector, as shown in the following snippet.

import sqlalchemy as sa
from sqlalchemy.dialects.postgresql import TSVECTOR
class TSVector(sa.types.TypeDecorator):
impl = TSVECTOR

TypeDecorator is an abstract class with few abstract methods, your IDE and/or linter will warn about unimplemented methods, but you can safely ignore them. Since we will not be manipulating TSVector columns directly from Python, we can leave those undefined.

Now we will use the TSVector type in your Video model.

from sqlalchemy import desc, Indexfrom .ts_vector import TSVector
from ..db import db
class Video(db.Model):
id = db.Column(db.Integer(), primary_key=True,
autoincrement=True)
title = db.Column(db.String(255), nullable=False)
description = db.Column(db.Text(), nullable=False)
thumbnail_url = db.Column(db.String(2048), nullable=False)
published_at = db.Column(db.DateTime(), nullable=False)
__ts_vector__ = db.Column(TSVector(),db.Computed(
"to_tsvector('english', title || ' ' || description)",
persisted=True))
__table_args__ = (Index('ix_video___ts_vector__',
__ts_vector__, postgresql_using='gin'),)

We have defined a new column __ts_vector__, auto-generated from the title and description.

We can now use this with the match function to perform full-text searches. By default, it will be ordered by rank (relevance). However, you can add additional ORDER BY clauses as well.

results = Video.query.filter(Video.__ts_vector__.match(term)).all()

PostgreSQL full-text search is quite powerful and is sufficient for many use cases. It also allows further customization, such as weighted columns and keywords. Head over to the documentation to learn more. For more advanced use-cases such as high-concurrency or analytics, you may need to invest in a dedicated search setup such as Elasticsearch or a SaaS such as Algolia.

You can find the demo project using PostgreSQL, SQLAlchemy and implementing full-text search at

--

--