

PostgreSQL and Rails, sitting in a tree
source link: https://evilmartians.com/chronicles/postgresql-and-rails-sitting-in-a-tree
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

The Closure Table pattern has something that other tree storage methods don’t—a good design. Let’s see why it rocks.
Some time ago I was involved in a project aiming to systemize a pandemonium of databases. Especially hard were the toponym ones. The source data was a deep XML tree with regions and cities at the top and streets and houses at the bottom.
| id | parent_id | name |
| ---------- |------------|------------|
| 1 | 2 | Kremlin |
| 2 | | Moscow |
| 3 | 2 | Red Square |
| … (1M+ unsorted records) … |
I had to put this tree into PostgreSQL to allow queries. A typical query was, for example, to fetch all the streets of the cities within a specific region or to fetch the whole path from a particular street at the bottom up to the root node, etc.
Since there were quite a few queries to make, it was a must to process them as fast as possible. Therefore, I needed a hierarchical storage pattern to speed them up.
So I did the following:
- Imported the table;
- Chose the suitable hierarchical pattern;
- Calculated the extra reference data used by the chosen pattern.
At each of the three stages, it was the speed that mattered. I wanted to do everything in next to no time, because there are not many things worse than having your server not responding, and downtimes are unacceptable.
Importing the initial data
In PostgreSQL, a table is best populated by the COPY FROM
statement. It works significantly faster than multiple INSERT
statements. Ensure to prepare the input data for COPY FROM
. It accepts CSV or a binary file/string at the input. Binary is faster.
The pg_data_encoder gem can be used to convert data to a binary format to be used by PostgreSQL. It is designed to work with postgres-copy, but I prefer Sequel because it is very lightweight compared to ActiveRecord and allows to work with with tables directly, without using the application model layer.
The import script might look like this:
### import.rb
# Will connect to the database configured in Rails' database.yml
db = Sequel.connect(ActiveRecord::Base.connection_config)
encoder = PgDataEncoder::EncodeForCopy.new
# Suppose the data have already been read.
# In my case, I had a set of DBF's (doh! at 2015!).
[
[1, 2, "Kremlin"],
[2, nil, "Moscow"],
[3, 2, "Red Square"]
# …
].each { |line| encoder.add(line) }
db.copy_into(
"tax_authorities",
columns: %i[id parent_id name],
format: :binary,
data: encoder.get_io
)
# Run it with something like:
# > bundle exec rails runner import.rb
If your data set does not fit in memory, you have two options:
- Import incrementally, part by part (read-copy loop).
- Specify
use_tempfile: true
as an option forPgDataEncoder#new
to enable file buffering.
Incremental import might be faster.
Choosing a tree pattern
There are only a few ready to use ActiveRecord hierarchy plugins:
- awesome_nested_set implements Nested Set;
- ancestry implements Materialized Path;
- and closure_tree is for Closure Table.
Let’s try to choose one.
Nested Set is the worst option:
Even though I am going to read the tree way more often than update it, still such problems scare me. Anything but this.
Materialized Path is much better:
- The fastest one.
- Can be used natively with PostgreSQL.
- However, the internals are still complicated.
Reference columns are required
Two additional columns have to be added and calculated to make everything work: ancestry
and ancestry_depth
. The gem contains #build_ancestry_from_parent_ids!
method which calculates initial values.
It took about 44 minutes to calculate 1 188 483 records and save the values in the two columns. The production server won’t be so write-friendly as the local machine, the source database needs to be updated and will grow in size. And the more it grows, the slower it works. Much better but still unacceptable.
Closure Table is something in-between:
- Slower than Materialized Path but competitive.
- The easiest algorithm which best fits in with the relational model.
- Can use foreign keys to maintain referential integrity.
- No need to modify existing table schema.
In this case, an extra table is to be created. This table will contain the links between each parent and all its descendants. Сall the #rebuild!
method to fill this table with the initial values.
Default #rebuild!
method works slow
By default, it enumerates each record in the source table and creates corresponding records in the hierarchies table by calling the INSERT INTO … SELECT
statement. Therefore, it does an N number of inserts per each source row. It is never fast with millions of inserts. It takes hours to rebuild the hierarchy from scratch. Unacceptable.
Boosting up the closure_tree #rebuild!
The reference table can be populated much quicker if calculated in memory and stored by the COPY FROM
statement.
The data for your reference table might look like this:
| ancestor_id | descendant_id | generations |
|-------------------------------------------|
| 1 | 1 | 0 |
| 2 | 2 | 0 | # Moscow itself
| 3 | 3 | 0 |
| 2 | 1 | 1 | # Moscow/Kremlin
| 2 | 3 | 1 | # Moscow/R. Sq.
| … (long long array) … |
Eventually, we need to:
- Fetch the source table
id
andparent_id
columns via a singleSELECT
. - Find all the descendants of each fetched
id
. - Make an in-memory array with the resulting values as shown above.
- Save it to the reference table using the
COPY FROM
statement.
I have encapsulated these steps into a reusable library.
My pg_closure_tree_rebuild gem will do the job!
You can use it from the command line:
> bundle exec rake closure_tree:rebuild DATABASE_URL=postgres://localhost/tax_authorities TABLE=tax_authorities
or from your code:
# seeds.rb
PgClosureTreeRebuild::Runner.new(
Sequel.connect(ActiveRecord::Base.connection_config), 'tax_authorities'
).run
Result
It only took 3.5 minutes to deal with 1 188 483 records, which otherwise would have taken hours:
138.100000 4.060000 142.160000 (203.340859)
12.5 times faster compared to Materialized Path. Awesome.
The result was indeed worth the time and effort spent. It was exactly the solution I wanted. Of course, the fact that it did pay off in my case does not mean that you should stick to it too. If your task is different from the one I had, then another pattern or a set of patterns might do a better job. Anyway, whatever option you prefer, remember to avoid Nested Set. Below is a summary of what I’ve learned. I hope you will find it useful.
- Use
COPY FROM
to populate tables whenever possible. - Use the Closure Table pattern (and the closure_tree gem) as it is the simplest tree pattern especially if you don’t do anything but read.
- Use the pg_closure_tree_rebuild gem instead of the standard
#rebuild!
method of closure_tree as it is faster.
Recommend
-
26
Keep correct posture while sittingPoseAlert.com lets you define a correct posture and notifies you with a beep in case you are not holding that posture for 20 seconds....
-
41
I’ve been told by several sources (but not by Google directly, heh ) that from Christmas onwards the “Designed for ChromeBook” sticker requires hardware vendors to use fwupd rather than random non-free binaries....
-
14
Adventures in Server Sitting To support more of my tinkering in an effort to test various Postgr...
-
6
Retail traders become ‘sitting ducks’ as sell-off triggers $1.4B liquidation – HodlalertRetail traders become ‘sitting ducks’ as sell-off triggers $1.4B liquidation...
-
8
Retail traders become 'sitting ducks' as sell-off triggers $1.4B liquidation Retail traders have been using high leverage throughout the current bull market, but that's not the real reason for today's marketwide sell-off.
-
14
Neo4j and Gatling sitting in a tree, Performance T-E-S-T-ing I was introduced to...
-
11
PostgreSQL 14 B-Tree Index: Reduced Bloat with Bottom-Up Deletion Back to the Blog Concurrent access to data within PostgreSQL is managed with the Multiversion Concurrenc...
-
17
Go Time – Episode #228 Go and PHP sitting in a tree... with Valery Piashchynski & Anton Titov...
-
4
Effective Queries with Rails and PostgreSQL Getting data *into* your database is easy, but querying large datasets is challenging—especially without the right indexes. Pavel Tk...
-
5
How to Monitor and Fix PostgreSQL Database Locks in Rails Updated Dec 20, 2022 9 minute read
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK