Case study: using a recursive CTE to traverse an employee hierarchy
A hierarchy is a specialization of the general notion of a graph—and, as such, it's the simplest kind of graph that still deserves that name. The taxonomy of successive specializations starts with the most general (the undirected cyclic graph) and successively descends to the most restricted, a hierarchy. The taxonomy refers to a hierarchy as a rooted tree. All this is explained in the section Using a recursive CTE to traverse graphs of all kinds.
The representation of a general graph requires an explicit, distinct, representation of the nodes and the edges. Of course, a hierarchy can be represented in this way. But because of how it's restricted, it allows a simpler representation in a SQL database where only the nodes are explicitly represented, in a single table, and where the edges are inferred using a self-referential foreign key:
- A "parent ID" column [list] references the table's primary key—the "ID" column [list] enforced by a foreign key constraint.
This is referred to as a one-to-many recursive relationship, or one-to-many "pig's ear", in the jargon of entity-relationship modeling. The ultimate, unique, root of the hierarchy has the "parent ID" set to NULL
.
The SQL that this section presents uses the simpler representation. But the section Using a recursive CTE to traverse graphs of all kinds shows, for completeness, that the general SQL that you need to follow edges in a general graph works when the general representation happens to describe a hierarchy. See the section Finding the paths in a rooted tree.
All of the .sql
scripts that this case study presents for copy-and-paste at the ysqlsh
prompt are included for download in a zip-file.
Download recursive-cte-code-examples.zip
.
After unzipping it on a convenient new directory, you'll see a README.txt
. It tells you how to start the master-script. Simply start it in ysqlsh
. You can run it time and again. It always finishes silently. You can see the report that it produces on a dedicated spool directory and confirm that your report is identical to the reference copy that is delivered in the zip-file.
Create and populate the "emps" table
The following code creates a table of employees that records each employee's manager using a self-referential foreign key as explained above. This is a stylized example where "name" serves as the primary key column, "mgr_name" serves as the self-referential foreign key column, and there are no other columns.
cr-table.sql
set client_min_messages = warning;
drop table if exists emps cascade;
create table emps(
name text primary key,
mgr_name text);
-- The order of insertion is arbitrary
insert into emps(name, mgr_name) values
('mary', null),
('fred', 'mary'),
('susan', 'mary'),
('john', 'mary'),
('doris', 'fred'),
('alice', 'john'),
('bill', 'john'),
('joan', 'bill'),
('george', 'mary'),
('edgar', 'john'),
('alfie', 'fred'),
('dick', 'fred');
-- Implement the one-to-many "pig's ear".
alter table emps
add constraint emps_mgr_name_fk
foreign key(mgr_name) references emps(name)
on delete restrict;
-- The ultimate manager has no manager.
-- Enforce the business rule "Maximum one ultimate manager".
create unique index t_mgr_name on emps((mgr_name is null)) where mgr_name is null;
Inspect the contents:
select name, mgr_name from emps order by 2 nulls first, 1;
This is the result:
name | mgr_name
--------+----------
mary |
joan | bill
alfie | fred
dick | fred
doris | fred
alice | john
bill | john
edgar | john
fred | mary
george | mary
john | mary
susan | mary
The blank lines were added by hand to make the results easier to read.
Stress the constraints with this attempt to insert a second ultimate manager:
insert into emps(name, mgr_name) values ('bad', null);
It causes this error:
23505: duplicate key value violates unique constraint "t_mgr_name"
Stress the constraints with this attempt to delete an employee with reports:
delete from emps where name = 'fred';
It causes this error:
23503: update or delete on table "emps" violates foreign key constraint "emps_mgr_name_fk" on table "emps"
Key (name)=(fred) is still referenced from table "emps".
This constraint, together with the fact that the "mgr_name" column is nullable, guarantees the invariant that there is exactly one ultimate manager—in other words that the employee graph is a rooted tree (a.k.a. hierarchy). Check it thus:
do-assert-is-hierarchy.sql
do $body$
begin
assert (select count(*) from emps where mgr_name is null) = 1,
'Rule "Employee graph is a hierarchy" does not hold';
end;
$body$;
List the employees top-down with their immediate managers in breadth first order
The naïve formulation of the query to list the employees with their immediate managers uses a WITH
clause that has a recursive CTE like this:
cr-view-top-down-simple.sql
drop view if exists top_down_simple cascade;
create view top_down_simple(depth, mgr_name, name) as
with
recursive hierarchy_of_emps(depth, mgr_name, name) as (
-- Non-recursive term.
-- Select the exactly one ultimate manager.
-- Define this emp to be at depth 1.
(
select
1,
'-',
name
from emps
where mgr_name is null
)
union all
-- Recursive term.
-- Treat the emps from the previous iteration as managers.
-- Join these with their reports, if they have any.
-- Increase the emergent depth by 1 with each step.
-- Stop when none of the current putative managers has a report.
-- Each successive iteration goes one level deeper in the hierarchy.
(
select
h.depth + 1,
e.mgr_name,
e.name
from
emps as e
inner join
hierarchy_of_emps as h on e.mgr_name = h.name
)
)
select
depth,
mgr_name,
name
from hierarchy_of_emps;
Each successive iteration of the recursive term accumulates the set of direct reports, where such a report exists, of the employees produced first by the non-recursive term (i.e. the ultimate manager) and then by the employees produced by the previous iteration of the recursive term. The iteration stops when none of the employees produced by the previous iteration of the recursive term has a report.
Notice that the choice to define the ultimate manager to be at depth 1 is just that: a design choice. You might prefer to define the ultimate manager to be at depth 0, arguing that it's better to interpret this measure as the number of managers up the chain that the present employee has.
The result set that this view represents is usually ordered first by the calculated "depth" column and then by usefully orderable actual columns—the manager name and then the employee name in the case of the "emps" example table:
do-breadth-first-query.sql
select
depth,
mgr_name,
name
from top_down_simple
order by
depth,
mgr_name nulls first,
name;
This produces a so-called "breadth first" order. This is the result:
depth | mgr_name | name
-------+----------+--------
1 | - | mary
2 | mary | fred
2 | mary | george
2 | mary | john
2 | mary | susan
3 | fred | alfie
3 | fred | dick
3 | fred | doris
3 | john | alice
3 | john | bill
3 | john | edgar
4 | bill | joan
The blank lines were added by hand to make the results easier to read.
List the path top-down from the ultimate manager to each employee in breadth first order
The term of art "path" denotes the list of managers from the ultimate manager through each next direct report down to the current employee. It is easily calculated by using array concatenation as described in the || operator subsection of the Array data types and functionality major section. Yet again, "array" functionality comes to the rescue.
cr-view-top-down-paths.sql
drop view if exists top_down_paths cascade;
create view top_down_paths(path) as
with
recursive paths(path) as (
select array[name]
from emps
where mgr_name is null
union all
select p.path||e.name
from emps as e
inner join paths as p on e.mgr_name = p.path[cardinality(path)]
)
select path from paths;
The cardinality of the path represents the depth. The result set that this view represents can be easily ordered first by the emergent "depth" and then by the employee name:
do-breadth-first-path-query.sql
select cardinality(path) as depth, path
from top_down_paths
order by
depth,
path[2] asc nulls first,
path[3] asc nulls first,
path[4] asc nulls first,
path[5] asc nulls first;
The design of the ORDER BY
clause relies on the following fact—documented in the Array data types and functionality major section:
If you read a within-array value with a tuple of index values that puts it outside of the array bounds, then you silently get
NULL
.
This is the result:
depth | path
-------+-----------------------
1 | {mary}
2 | {mary,fred}
2 | {mary,george}
2 | {mary,john}
2 | {mary,susan}
3 | {mary,fred,alfie}
3 | {mary,fred,dick}
3 | {mary,fred,doris}
3 | {mary,john,alice}
3 | {mary,john,bill}
3 | {mary,john,edgar}
4 | {mary,john,bill,joan}
The blank lines were added by hand to make the results easier to read.
Notice that the "top_down_paths" view has the same information content as the "top_down_simple" view. And the queries that were used against each listed the information in the same order. The results from querying the "top_down_paths" view show, in the last-but-one and last array elements, the identical values to the "mgr_name" and "name" columns in the output from querying the "top_down_simple" view. But it's easier to read the results from the "top_down_paths" view because you don't need mentally to construct the path by looking, recursively, for the row that has the "mgr_name" of the row of interest as its "name" until you reach the ultimate manager.
This assert
confirms the conclusion:
do-assert-top-down-path-view-and-top-down-simple-view-same-info.sql
do $body$
declare
c constant int not null :=
(
with
a1(depth, mgr_name, name) as (
select depth, mgr_name, name from top_down_simple),
a2(depth, mgr_name, name) as (
select
cardinality(path),
case cardinality(path)
when 1 then '-'
else path[cardinality(path) - 1]
end,
path[cardinality(path)]
from top_down_paths),
a1_except_a2(depth, mgr_name, name) as (
select depth, mgr_name, name from a1
except
select depth, mgr_name, name from a2),
a2_except_a1(depth, mgr_name, name) as (
select depth, mgr_name, name from a2
except
select depth, mgr_name, name from a1)
select count(*) from
(
select depth, mgr_name, name from a1_except_a2
union all
select depth, mgr_name, name from a2_except_a1
) as n
);
begin
assert c = 0, 'Unexpected';
end;
$body$;
This "UNION ALL
of two complementary EXCEPT
queries" is the standard pattern for checking if two different relations have the same content. Notice how the use of a WITH
clause with ordinary (non-recursive) CTEs lets you express the logic in a maximally readable way.
List the path top-down from the ultimate manager to each employee in depth-first order
Do this:
do-depth-first-query.sql
select cardinality(path) as depth, path
from top_down_paths
order by
path[2] asc nulls first,
path[3] asc nulls first,
path[4] asc nulls first,
path[5] asc nulls first;
This query is almost identical to the query shown at do-breadth-first-path-query.sql
. The only difference is that the first ORDER BY
rule (to order by the depth) has been omitted.
This is the result:
depth | path
-------+-----------------------
1 | {mary}
2 | {mary,fred}
3 | {mary,fred,alfie}
3 | {mary,fred,dick}
3 | {mary,fred,doris}
2 | {mary,george}
2 | {mary,john}
3 | {mary,john,alice}
3 | {mary,john,bill}
4 | {mary,john,bill,joan}
3 | {mary,john,edgar}
2 | {mary,susan}
You can see that the results are sorted in depth-first order with the employees at the same depth ordered by "name".
Notice that this:
select max(cardinality(path)) from top_down_paths;
returns the value 4 for the present example. This means that path[5]
returns NULL
—as would, for example, path[6]
, path[17]
, and path[42]
. When such a query is issued programmatically, you can determine the maximum path cardinality and build the ORDER BY
clause to have just the necessary and sufficient number of terms. Alternatively, for simpler code, you could write it with a number of terms that exceeds your best estimate of the maximum cardinality of the arrays that your program will have to deal with, ensuring safety with a straightforward test of the actual maximum cardinality.
Pretty-printing the top-down depth-first report of paths
Users often like to use indentation to visualize hierarchical depth. This is easy to achieve, thus:
do-indented-depth-first-query.sql
select
rpad(' ', 2*cardinality(path) - 2, ' ')||path[cardinality(path)] as "emps hierarchy"
from top_down_paths
order by
path[2] asc nulls first,
path[3] asc nulls first,
path[4] asc nulls first,
path[5] asc nulls first;
This is the result:
emps hierarchy
----------------
mary
fred
alfie
dick
doris
george
john
alice
bill
joan
edgar
susan
You've probably seen how the Unix tree
command presents the hierarchy that it computes using the long vertical bar, the long horizontal bar, the sideways "T", and the "L" symbols: │
, ─
, ├
, and └
. It's easy to output an approximation to this, that omits the long vertical bar, with a single SQL statement. The trick is to use the lead()
window function to calculate the "next_depth" for each row as well as the "depth". If the "next_depth" value is equal to the "depth" value: then output the sideways "T"; else output the "L" (at the appropriate indentation level).
do-unix-tree-query.sql
with
a1 as (
select
cardinality(path) as depth,
path
from top_down_paths),
a2 as (
select
depth,
lead(depth, 1, 0) over w as next_depth,
path
from a1
window w as (
order by
path[2] asc nulls first,
path[3] asc nulls first,
path[4] asc nulls first,
path[5] asc nulls first))
select
case depth = next_depth
when true then
lpad(' ├── ', (depth - 1)*5, ' ')
else
lpad(' └── ', (depth - 1)*5, ' ')
end
||
path[depth] as "Approx. 'Unix tree'"
from a2
order by
path[2] asc nulls first,
path[3] asc nulls first,
path[4] asc nulls first,
path[5] asc nulls first;
Using 0 as the actual for the third (optional) lag()
parameter means that the function will return 0 for the last row in the sorted order rather than NULL
. This allows a simple equality test to be used reliably.
This is the result:
mary
└── fred
├── alfie
├── dick
└── doris
├── george
└── john
├── alice
└── bill
└── joan
└── edgar
└── susan
It's easy to transform this result manually into the format that Unix uses by filling in vertical bars as appropriate, thus:
mary
├── fred
│ ├── alfie
│ ├── dick
│ └── doris
├── george
├── john
│ ├── alice
│ ├── bill
│ │ └── joan
│ └── edgar
└── susan
Of course, when you do this, you're intuitively following a rule. So it could certainly be implemented in procedural code, using this harness:
cr-function-unix-tree.sql
drop function if exists unix_tree() cascade;
create function unix_tree()
returns table(t text)
language plpgsql
as $body$
<<b>>declare
results text[] not null := '{}';
begin
with
a1 as (
select
cardinality(path) as depth,
path
from top_down_paths),
a2 as (
select
depth,
lead(depth, 1, 99) over w as next_depth,
path
from a1
window w as (
order by
path[2] asc nulls first,
path[3] asc nulls first,
path[4] asc nulls first,
path[5] asc nulls first)),
results as (
select
case depth = next_depth
when true then
lpad(' ├── ', (depth - 1)*5, ' ')
else
lpad(' └── ', (depth - 1)*5, ' ')
end
||
path[depth] as result
from a2
order by
path[2] asc nulls first,
path[3] asc nulls first,
path[4] asc nulls first,
path[5] asc nulls first)
select array_agg(r.result)
into b.results
from results as r;
foreach t in array results loop
-- Implement the logic to transform the present results
-- into the Unix "tree" format by adding vertical bars
-- as approapriate.
end loop;
foreach t in array results loop
return next;
end loop;
end b;
$body$;
select t from unix_tree();
This is a sketch of the required logic, expressed crudely and somewhat approximately:
- Scan each result row looking for an "L" in any of the character positions that it might occur.
- When an "L" is found, look forward over as many result rows as it takes until you find the first non-space character in the character position where the "L" was found.
- If this is found on the immediately next row, then do nothing; move to the next row, and start again.
- Otherwise, and when the next non-space character is an "L" or a "T", substitute a "T" for the starting "L" or "T" and substitute a vertical bar for the remaining spaces. When the next non-space is not an "L" or "T", leave the spaces as is.
- Repeat this for each character position where an "L" is found.
Implementing, and testing, the logic is left as an exercise for the reader.
List the path bottom-up from a chosen employee to the ultimate manager
The essential property of a hierarchy is that each successive upwards step defines exactly one parent (in this example, exactly one manager) by traversing the foreign key reference in the "many" to "one" direction. It's this property that distinguishes a hierarchy from a more general graph. This means that there is no distinction between breadth-first and depth-first when traversing a hierarchy upwards.
Here is the query. It's presented using the prepare-and-execute paradigm so that you can supply the starting employee of interest as a parameter. Notice that "path" is not yet defined in the non-recursive term. This means that the only practical design for the "depth" notion here is different from what was used in the top-down approach: the design chosen has it starting at 0 and increasing by 1 with each step up the hierarchy.
do-bottom-up-simple-query.sql
deallocate all;
prepare bottom_up_simple(text) as
with
recursive hierarchy_of_emps(depth, name, mgr_name) as (
-- Non-recursive term.
-- Select the exactly one employee of interest.
-- Define the depth to be zero.
(
select
0,
name,
mgr_name
from emps
where name = $1
)
union all
-- Recursive term.
-- Treat the employee from the previous iteration as a report.
-- Join this with its manager.
-- Increase the depth with each step upwards.
-- Stop when the current putative report has no manager, i.e. is
-- the ultimate manager.
-- Each successive iteration goes one level higher in the hierarchy.
(
select
h.depth + 1,
e.name,
e.mgr_name
from
emps as e
inner join
hierarchy_of_emps as h on h.mgr_name = e.name
)
)
select
depth,
name,
coalesce(mgr_name, null, '-') as mgr_name
from hierarchy_of_emps;
execute bottom_up_simple('joan');
This is the result:
depth | name | mgr_name
-------+------+----------
0 | joan | bill
1 | bill | john
2 | john | mary
3 | mary | -
Alternatively, you could encapsulate the query in a function to deliver the same benefit. In this scheme, the result is a single array that represents the path from bottom to top.
cr-function-bottom-up-path.sql
drop function if exists bottom_up_path(text) cascade;
create function bottom_up_path(start_name in text)
returns text[]
language sql
as $body$
with
recursive hierarchy_of_emps(mgr_name, path) as (
(
select mgr_name, array[name]
from emps
where name = start_name
)
union all
(
select e.mgr_name, h.path||e.name
from
emps as e
inner join
hierarchy_of_emps as h on e.name = h.mgr_name
)
)
select path
from hierarchy_of_emps
order by cardinality(path) desc
limit 1;
$body$;
select bottom_up_path('joan');
This is the result:
bottom_up_path
-----------------------
{joan,bill,john,mary}
You can produce a prettier output like this:
cr-function-bottom-up-path-display.sql
drop function if exists bottom_up_path_display(text) cascade;
create function bottom_up_path_display(start_name in text)
returns text
language plpgsql
as $body$
declare
path constant text[] not null := (bottom_up_path(start_name));
t text not null := path[1];
begin
for j in 2..cardinality(path) loop
t := t||' > '||path[j];
end loop;
return t;
end;
$body$;
select bottom_up_path_display('joan');
This is the result:
bottom_up_path_display
---------------------------
joan > bill > john > mary