This article explains why union types are essential, how they are supported in PTS, and which benefits they offer.
Table of Contents
Note
This is part 4 in a series of articles titled How to Design a Practical Type System to Maximize Reliability, Maintainability, and Productivity in Software Development Projects.
It is recommended (but not required for experienced programmers) to read the articles in their order of publication, starting with Part 1: What? Why? How?.
For a quick summary of previous articles, you can read Summary of the Practical Type System (PTS) Article Series.
Union type example
Introduction
Union types (aka sum types, variants, choice types) are a prime feature in a modern type system.
This article explains why union types are essential, how they are supported in PTS, and which benefits they offer.
As you'll see, union types are surprisingly useful and versatile, despite their simplicity. For example, they provide an elegant foundation for two critical, recurring, but often problematic aspects of software development: null
- and error-handling.
Why Do We Need Union Types?
Consider a function that reads text stored in a file. The function takes a file path as input and returns one of the following:
-
a string representing the text in the file
-
null
if the file is empty
-
an error if the file doesn't exist or if there was any other I/O error
Note
In most software libraries, a function like this would return an empty string if the text file is empty.
However, our example function does not return an empty string if the file is empty, it returns null
— for reasons explained in a subsequent article.
The above specification evokes a compelling question: How should the three output alternatives ("text", "no text", and "error") be expressed in the function signature?
Let's see!
Existing Solutions
To get an overview of different approaches used in popular programming languages, let's have a look at the signature of this function in JavaScript, Java, Kotlin, and Rust.
Note
Readers only interested in the PTS solution can skip the following sections.
JavaScript
Here's the JavaScript code of our example function:
function readTextFile ( filePath ) {
return "dummy";
}
Note
We're not interested in the function body, just its signature — that's why "dummy"
is returned.
In JavaScript, the return type of a function is not specified in the function signature.
Every JavaScript function can return anything (including null
and undefined
).
Section Return value on MNDN states:
Quote:
By default, if a function's execution doesn't end at a return
statement, or if the return
keyword doesn't have an expression after it, then the return value is undefined
. The return
statement allows you to return an arbitrary value from the function.
This means that the only reliable way to know what the function returns is to look at its body — if we have access to it. Worse, if the function calls other functions, we might also need to inspect the body of all these other functions involved in its call tree.
If we're lucky, the developer(s) left behind a comment annotating the function and its return type. Furthermore, if the function underwent changes later on, the developer(s) hopefully were kind enough to also update its comment.
If the return type is changed later on, and we (or other developers) forget to update one or more function calls, then the application is at risk of breaking in undefined and unanticipated ways.
Surely, we want a better solution.
Let's move on.
Java
Our example function looks like this in Java:
public static String readTextFile ( Path filePath ) throws IOException {
return "dummy";
}
The method clearly states that it returns a String
or throws an IOException
. However, we don't know if null
might be returned, because reference types in Java are all nullable.
Note
To state that the function might return null
(in case of an empty file), we could add a Nullable
annotation (metadata added to Java source code):
public static @Nullable String readTextFile2 ( Path filePath ) throws IOException {
return "dummy";
}
However, Nullable
is a non-standard Java annotation. We need to create it ourselves, or use a third-party library that provides it. Therefore a Nullable
annotation results in non-idiomatic Java code.
Moreover, the Java compiler doesn't take into consideration this annotation, and doesn't check for potential null pointer errors (because Java is not a null-safe language). There are, however, very useful tools and IDE plugins that report potential null pointer errors, by leveraging annotations.
Unfortunately, we would be using three different techniques now for the three outcomes:
What's more, the Nullable
annotation is just a workaround for an important concept (i.e. the absence of a value) that should be supported natively in the language.
Conclusion: A function signature in idiomatic Java doesn't tell us if null
might be returned.
Kotlin
This is the code written in Kotlin, a modern JVM language:
fun readTextFile(filePath: Path): String? {
return "dummy"
}
While reference types in Java are nullable, they are non-null in Kotlin. The ?
suffix after a type name must be used to state that a type is nullable. Hence the String?
return type clearly states that the function returns a String
object or null
. Moreover, Kotlin is null-safe (no null-pointer errors in idiomatic Kotlin code), which is possibly the prevalent reason for some people to prefer Kotlin over Java.
Java uses checked exceptions for anticipated runtime errors. On the other hand, Kotlin doesn't support checked exceptions — it only supports unchecked exceptions (for a quick explanation of the differences, read What are checked vs. unchecked exceptions in Java?). Therefore, a Kotlin function signature doesn't tell us if an exception might be thrown. To understand why the creators of Kotlin decided against checked exceptions, you can read section Checked exceptions in the official Kotlin documentation.
Note
Kotlin provides a Throws
annotation in its standard library to state exceptions that might be thrown:
@Throws(IOException::class)
fun readTextFile2(filePath: Path): String? {
return "dummy"
}
However, this annotation targets the Java environment and is not used in idiomatic Kotlin code. The official Kotlin documentation states: "This annotation indicates what exceptions should be declared by a function when compiled to a JVM method."
Conclusion: A function signature in idiomatic Kotlin doesn't tell us if a function call might fail.
Note
The same is true in C# and a few other languages: exceptions are not declared in function signatures.
Rust
In Rust, our function looks like this:
fn read_text_file(_file_path: String) -> Result<Option<String>, io::Error> {
Ok(Some(String::from("dummy")))
}
Rust doesn't support null
. To handle the absence of a value, Rust uses the Option
type. We can think of an Option
instance as a container that is either empty or contains a value: it is either an instance of Some
, containing a value, or an instance of None
, which has no content. A few other languages adopt a similar concept: for example, F# uses an Option
monad, and Haskell uses a Maybe
monad.
Furthermore, Rust doesn't throw exceptions if functions fail. The Rust Programming Language states:
Quote:
Rust doesn't have exceptions. Instead, it has the type Result<T, E>
for recoverable errors and the panic!
macro that stops execution when the program encounters an unrecoverable error.
Thus, the function returns a Result
type, which is either an instance of Ok
, containing a valid return value, or an instance of Err
containing an error object. This construct is similar to the Result
monad in F#, or the Either
monad in Haskell and other languages.
It's nice to see that:
-
All three outcomes are clearly stated in the function signature: the function returns Some
, None
, or Err
.
-
A single technique is used for all outcomes: a return type (Result<Option<String>, io::Error>
) that expresses the three alternatives.
Wrap-up
The following table summarizes the outcomes stated in the function signatures:
| Text
| No text
(empty file)
| Error
|
JavaScript
| ⨯
| ⨯
| ⨯
|
Java
| ✔
| ⨯
| ✔
|
Kotin
| ✔
| ✔
| ⨯
|
Rust
| ✔
| ✔
| ✔
|
Rust is clearly the winner, because all three possible outcomes are covered in its function signature.
The Rust compiler also ensures that all three outcomes are being handled by the code that calls the function. If we forget to handle a case, the compiler gently reminds us to do so. Even better, if the return type of the function is changed later on, the compiler also checks that all function calls are updated accordingly. The advantages are obvious: more reliable and maintainable code; less time wasted finding and fixing bugs.
Here is an example of calling the function in Rust, using a match
expression to handle the three outcomes:
match read_text_file(String::from("file.txt")) {
Ok(Some(string)) => println! ( "{}", string ),
Ok(None) => println! ( "Empty" ),
Err(_) => println! ( "Error" ),
};
A Better Solution
Rust Option
and Result
types are wrappers. Instances of these types contain a value — except None
(an instance of Option
), which doesn't contain a value. For example, if the function returns a string
(the most common case for this function), then the string
is wrapped in an Option
instance, which is itself wrapped in a Result
instance. The wrapping becomes obvious when we look at the dummy body of the function: Ok(Some(String::from("dummy")))
. We can't simply write "dummy"
, or return "dummy"
, as in other languages.
Note
The need to write String::from("dummy")
instead of just "dummy"
is irrelevant to the topic at hand (but you can read Rust - String for an explanation).
Looking at these examples in different languages begs the question: Why can't we simply state what we want, i.e., a function that returns a string
or null
or an error?
Well, we could — if the type system supported union types.
And that's one of the reasons why union types are supported in PTS. Here's a preview of the function in PTS:
fn read_text_file ( file file_path ) -> string or null or file_error
return "dummy"
.
The output type string or null or file_error
is a key point. Null-handling (or, more generally, handling the absence of a value) and error-handling are both crucial aspects in pretty much all software development projects. And now, we have a straightforward and elegant solution that utilizes a single, simple concept (union types) for both aspects. This is a solid foundation that allows us to simplify null
- and error-handling, ultimately leading to increased reliability, maintainability, and productivity.
What's more, union types have other interesting use cases, as we'll see soon.
Note
Given the preponderance of null
- and error-handling, these topics will be covered extensively in the next two PTS articles.
How Does It Work?
In this section, we'll explore PTS union types, and illustrate each point via simple source code examples.
Note
The basic idea of union types (aka sum types, variants, choice types) is roughly the same in different programming languages. However, the implementation and usage of union types vary largely. The following description only applies to union types in PTS.
Basic Idea
In a statically typed programming language, an object reference (variable, input parameter, etc.) is restricted to a single type. For example, input parameter name
being of type string
.
The basic idea of a union type is amazingly simple: Instead of restricting an object reference to a single type, any type among a defined set of types is allowed — type_1 or type_2 or type_3 or ...
. We are all familiar with this concept in daily life: Your birthday present will be a dog, a bicycle, or a violin (i.e., a dog or a bicycle or a violin). Next weekend, we'll go to the beach, the mountains, or the city.
Here is an example of a PTS function that uses union types:
fn foo ( item string or character or number ) -> boolean or null or error
// function body
.
This function has a single input parameter named item
, whose type can be string
, character
, or number
. Hence, the following function calls are all valid: foo ( "abc" )
, foo ( 'a' )
, and foo ( 123 )
.
Moreover, the function can return a boolean
, null
, or an error
. Thus, the following statements in the function body are all valid: return true
, return null
, and return error.create(...)
. In a subsequent section we'll see how to handle the returned value in the code that called the function.
The individual types declared in a union type are its member types. Thus, union type string or character or number
has three member types: string
, character
, and number
.
Note
A PTS union type is conceptually similar to a sum type, and a PTS record type is similar to a product type. The terms sum and product (predominant in functional programming) are rooted in the cardinality of these types. (Remember: the cardinality of a type is the number of allowed values.)
The cardinality of a union type is the sum of the cardinalities of its member types. For example, type boolean or null
has a cardinality of 3, because type boolean
has a cardinality of 2 (true
, false
), and type null
has a cardinality of 1 (null
). Hence, boolean or null
has a cardinality of 2 + 1 = 3.
The cardinality of a record type is the product of the cardinalities of its attribute/field types. For example, a record type with an attribute of type boolean or null
and another attribute of type boolean
has a cardinality of 3 * 2 = 6.
PTS adopts the term union, borrowed from set theory, a branch of mathematical logic.
Helpful Syntax Constructs
PTS provides three syntax constructs to handle union types in source code:
-
operator is
, to check the type of a value
-
a case type of
statement, to execute code that depends on the type of a given value
-
a case type of
expression, to compute a value that depends on the type of another value
Let's look at examples.
Operator is
In section A Better Solution, we introduced the following function which returns a string
, null
, or a file_error
:
fn read_text_file ( file file_path ) -> string or null or file_error
// function body
.
After calling read_text_file
, we first need to check the type returned by this function, and then execute code that depends on this type.
To check the type of a value, PTS provides the infix operator is
. The syntax is:
<expression> "is" <type>
Operator is
evaluates to a boolean
value which is true
if the type of the expression on the left-hand side is equal to the type specified on the right-hand side.
Suppose we call read_text_file
, and store its returned value in constant result
:
const result string or null or file_error = read_text_file ( file_path.create ( "example.txt" ) )
We can then use the expression result is string
to check wether the function returned a value of type string
.
The is
operator can be used in a classic if then else
statement to handle the returned value:
const result = read_text_file ( file_path.create ( "example.txt" ) )
if result is string then
write_line ( "Content of file:" )
write_line ( result )
else if result is null then
write_line ( "The file is empty." )
else
write_line ( """The following error occurred: {{result.message}}""" )
.
In this code, we used operator is
to check the type of the value stored in constant result
. For example, result is string
evaluates to true
if the function returns a string
.
Note how the above code benefits from two helpful features:
- Type inference
The type of constant result
is inferred by the compiler to be string or null or file_error
, since that's the return type of function read_text_file
.
- Flow-sensitive typing
Within the three if
branches, the compiler adapts the type of result
as follows:
-
In the first then
branch, the compiler deduces the type of result
to be string
, because this branch is only executed if result is string
evaluates to true
.
-
In the second branch (result is null
), the type of result
is deduced to be null
.
Thus a compile-time error would occur if we accidentally used an expression like result.message
in this branch.
-
In the final else
branch, the type of result
is deduced to be file_error
, because this is the remaining member type not yet covered in the previous branches.
And that's the reason why we can write result.message
without first casting result
to type file_error
. The expression result.message
is valid because here result
is guaranteed to be of type file_error
, and message
is an attribute defined in type file_error
(inherited from type error
, as will be explained in a subsequent article).
Flow-sensitive typing (also called flow typing or occurrence typing) is practical because it allows us to write succinct code that remains type-safe. We'll see more examples in subsequent articles.
Note
Type inference should not be overused, because there is a risk of hiding information that would be useful to keep in the source code, especially for developers who didn't write the code but need to understand and maintain it (e.g., in case of open source libraries, large code bases, etc.).
Obviously, a statement like this:
const name string = "Albert"
... can be shortened to:
const name = "Albert"
... without reducing readability.
However, look at this code:
const price = get_product_price ( "123" )
What does the function return? An integer? A decimal? Something else? We simply can't know by just looking at this line of code. Ambiguities like this disappear, and readability increases if the type of price
is stated explicitly:
const price money_amount or inexistent_product_id_error = get_product_price ( "123" )
Yes, the code is more verbose — but it's also more expressive. Now it clearly states that get_product_price
returns the union type money_amount or inexistent_product_id_error
.
case type of
Statement
Instead of using the is
operator in an if then else
statement for conditional execution, there is a much better way to execute type-dependent code: pattern matching.
The idiomatic way to check the type returned by a function is to use pattern matching via a case type of
statement, as follows:
case type of read_text_file ( file_path.create ( "example.txt" ) )
is string as text // the string is stored in constant 'text'
write_line ( "Content of file:" )
write_line ( text ) // the previously defined constant 'text' is now used
is null
write_line ( "The file is empty." )
is file_error as error
write_line ( """The following error occurred: {{error.message}}""" )
.
While the above code is semantically equivalent to the previous version that uses an if then else
statement, it has the following advantages:
-
The code is shorter and easier to read.
-
The compiler ensures that all members of the union type are covered in the branches of the case type of
statement: Leaving out any of the three is
branches of the above example results in a compile-time error.
This feature is invaluable also when working with third party libraries. Furthermore, if the members of a union type change later on (e.g., a member is added or removed), the compiler ensures that all case type of
statements have been updated in the code — a most welcome aid in complex, multi-developer projects.
-
The compiler can optimize the generated binary code to render it smaller and faster (depending on implementation details not covered here).
Instead of handling each member type individually (as shown above), sometimes only a few member types need individual handling, while all remaining member types can be handled in the same way. In such situations, the last branch of a case type of
statement can be an otherwise
branch, which covers all remaining member types not yet handled in preceding is
branches:
case type of read_text_file ( file_path.create ( "example.txt" ) )
is file_error
write_line ( "An error occurred!" )
otherwise
write_line ( "Ok" )
.
In the above code, member type file_error
is handled individually, and member types string
and null
are handled the same way in the otherwise
branch.
Instead of using an otherwise
branch, a better approach is to use a union type in an is
branch:
case type of read_text_file ( file_path.create ( "example.txt" ) )
is file_error
write_line ( "An error occurred!" )
is string or null
write_line ( "Ok" )
.
An advantage of this style is that the code explicitly and reliably mentions all types possibly returned by read_text_file
(which is not the case if an otherwise
branch is used).
Plus, if the members of the union type change later on, the compiler reminds us to adapt any case type of
statement if we forget to do so. This eliminates the risk of handling a new member type in the otherwise
branch, when it actually needs to be handled individually.
As a general rule, the otherwise
branch should be used sparingly — we should think twice and anticipate potential maintenance problems.
case type of
Expression
Besides a case type of
statement, there is also a case type of
expression available:
const message = case type of read_text_file ( file_path.create ( "example.txt" ) )
is string: "a string"
is null: "null"
is error: "an error"
write_line ( "The result is " + message )
In this code, the value "a string"
, "null"
, or "an error"
is assigned to constant message
(inferred to be of type string
), depending on the type returned by function read_text_file
.
if type of
Expression
PTS also provides an if type of
expression that can be used as follows:
const message = if type of read_text_file ( file_path.create ( "example.txt" ) ) is string \
then "a string" \
else "null or an error"
write_line ( "The result is " + message )
Behind the Scenes
Hopefully, the previous sections demonstrated that union types are simple to understand and easy to use.
That doesn't mean, however, that the compiler has an easy job too. On the contrary, the compiler needs to ensure type compatibility, infer types, deduce types in the branches of control flow statements, take into account type inheritance and type parameters, etc.
The compiler must prevent any misuses of types, and display helpful error messages whenever rules are violated.
Note
Readers not interested in this excursion can skip this section.
In the previous section, we already saw how the compiler ensures that all members of a union type are covered in pattern matching statements and expressions.
Now let's have a quick look at a few additional compiler tasks related to union types.
In the following examples, we assume that types fruit
and vegetable
are child-types of product
.
Union Type Declaration
The compiler checks the coherence of members declared in a union type. Suppose we declare union type product or fruit or null
. This declaration is invalid and reported by a comprehensible error message like this:
Union type 'product or fruit or null' is invalid.
Reason: 'fruit' is a child-type of 'product', and therefore 'fruit' is already covered
by member 'product' in the union type 'product or fruit or null'.
Possible solution: Change to 'product or null'.
Type Compatibility Checks
Type string
is compatible with type string or null
.
But the inverse is not true — string or null
is not compatible with string
, because null
is valid for type string or null
, but not for type string
.
More generally:
Type compatibility checks can get complex when several factors need to be taken into account.
For example, fruit or vegetable or null
is compatible with product or null
. However, product or null
is compatible with fruit or vegetable or null
only if the following two conditions are fulfilled:
-
product
is an abstract type, which means that no instances of product
can be created — i.e., only instances of child-types are allowed.
-
fruit
and vegetable
are the only direct child-types of product
.
Note
The above two conditions would be defined as follows in PTS code:
type product \
factories: none \
child_types: fruit, vegetable
// more code
.
Type Inference
Suppose that:
Now consider the following code that uses an if then else
expression:
const c = if condition then foo else bar
In this case the compiler infers the type of constant c
to be string or number or null
— the compiler merges the possible return types of foo
and bar
.
Now suppose that:
The expression if condition then foo else bar
is then inferred to be of type product or null
, because fruit
and vegetable
are covered already by product
. The compiler first merges the output types of foo
and bar
to product or null or fruit or vegetable
, and then normalizes the result to product or null
.
Further Examples and Benefits
Besides being essential for null
- and error-handling, union types have other interesting use cases. For instance, they can help to simplify APIs, provide type safety that couldn't be achieved without union types, and simplify eager/lazy evaluation.
Let's look at a few examples.
Simpler APIs
Let's say we have a function that checks text. It should be possible to provide the text directly as a string
or indirectly via a file path or an URL pointing to text content. Here is the function signature:
fn check_text ( source string or file_path or URL ) -> text_error or null
// function body
.
If union types weren't supported for input parameters, we would need three functions to cover the three types for input parameter source
:
fn check_text ( text string ) -> text_error or null
// function body
.
fn check_text_file ( file_path file_path ) -> text_error or null
// function body
.
fn check_text_URL ( URL URL ) -> text_error or null
// function body
.
Note
Various languages support function overloading (e.g. C++, C#, and Java), which allows all three functions to have the same name, differing only in their parameter type.
Union Types in Record Types
Since union types can be used wherever single (non-union) types can be used, they can also be used for attributes in record types. Here is a record type with two attributes using union types:
record type text_source
att name string or null default:null
att source string or file_path or URL
.
Now we could improve function check_text
to also accept a text_source
record as input:
fn check_text ( source string or file_path or URL or text_source ) -> text_error or null
...
.
Note
Union types should not be overused, because they require type-dependent code (e.g., case type of
statements) in function bodies. Besides increasing cyclomatic complexity, there is also a tiny performance penalty involved, which could be an issue in performance-critical parts of an application.
Hence, instead of having a single function check_text
that accepts four types as its input parameter (string or file_path or URL or text_source
), it might be better (depending on the context) to use four individual functions to cover each case individually.
Type-Safety
Suppose we need a type-safe list that contains only strings and characters, such as: ["abc", 'a', "hi", '!']
. In this context, type-safe means that only objects of type string
or character
can be added to the list and retrieved from it. A compile-time error occurs whenever some code violates this rule.
Without union types, we typically have two options:
The first solution is simple, but not type-safe, since objects of any type can be added — neither a compile- nor a run-time error is generated if we accidentally add a number or a pink elephant.
The second solution requires boilerplate code to be written, tested, and maintained. Moreover, compile-time type safety is only guaranteed when elements are added, but not when they are retrieved (e.g., looped over), because they must be casted to string
or character
, and the compiler doesn't report an error if we accidentally cast to the wrong type (e.g., number
).
In practice, most developers (including myself) would therefore opt for the first solution, sacrificing type-safety at the altar of convenience.
A union type removes the quandary. We can use the standard syntax to declare the element type of the list, while keeping the list type-safe: list<string or character>
.
A type-safe list literal looks like this:
[list<string or character> "abc" 'a' "hi" '!']
Here is an example of how we could create a list programmatically and then iterate over its elements:
const list = mutable_list<string or character>.create
list.add ( "abc" ) // OK
list.add ( 'a' ) // OK
// list.add ( 123 ) <- compile-time error !!!
// Loop without type check
repeat for each element in list
write_line ( element.to_string )
.
// Loop with type check
repeat for each element in list
case type of element
is string
write_line ( """String: {{element}}""" )
is character
write_line ( """Char: {{element.to_string}}""" )
.
.
Output:
abc
a
String: abc
Char: a
Eager vs Lazy Evaluation
Union types allow either eager or lazy (i.e., immediate or delayed) evaluation of input parameters.
Consider the error_message
input parameter in the following function, which defines a specific error message to be used whenever the function fails and returns a file_read_error
object:
fn read_text_file (
file file_path
error_message string ) -> string or null or file_read_error
The point is that input parameter error_message
is used only when execution of the function fails.
Now consider an application serving an international audience that retrieves locale-dependent error messages from a database. The application executes function calls like this:
const result = read_text_file (
file = "example.txt"
error_message = get_error_message_from_DB ( error_id = "123" ) )
Obviously, function calls like this can cause serious performance penalties, because each time read_text_file
is called, the error message is retrieved from the database, although it is needed only if something goes wrong.
A simple solution is to use a union type for input parameter error_message
, as follows:
fn read_text_file (
file file_path
error_message string or string_supplier ) -> string or null or file_read_error
Type string_supplier
is defined as follows:
type string_supplier
fn get -> string
.
As we can see, type string_supplier
has a single function, named get
, which returns a string
.
Here's a simplified excerpt of the read_text_file
body:
fn read_text_file (
file file_path
error_message string or string_supplier ) -> string or null or file_read_error
...
if something_went_wrong then
const message string = if error_message is string then error_message else error_message.get
// create and return file_read_error
.
.
The advantage is that error_message.get
is now evaluated only if something goes wrong.
Instead of providing a string
, the application must now provide a string_supplier
. This can easily be done with a closure:
const result = read_text_file (
file = "example.txt"
error_message = { get_error_message_from_DB ( error_id = "123" ) } )
Explaining PTS closures is beyond the scope of this article. However, as shown in the above code, eager evaluation can now easily be turned into lazy evaluation by just embedding an expression in a pair of curly braces ( {...}
). For general information about closures, you can read the Wikipedia article Closure (computer programming).
The performance bottleneck has been eliminated, since the error message is now retrieved from the database only if a file read error occurs.
Summary
Union types are simple to understand, easy to use, and they provide an elegant solution for frequent programming tasks.
They provide a sound foundation for uniform null
- and error-handling — two critical aspects of a practical type system.
Moreover, union types help to simplify APIs, increase type-safety (by minimizing cardinality and thus supporting the PTS Coding Rule), facilitate eager/lazy evaluation, and provide additional benefits.
What's Next?
The next two PTS articles will be dedicated to null
-handling and error-handling.
Acknowledgment
Many thanks to Tristano Ajmone for his useful feedback to improve this article.
History
- 30th November, 2023: Initial version