Skip to main content

PhD Research - Koka Part 1

· 8 min read

As I'm starting my I'm going to be writing mostly just thoughts that I want to get down in writing. Not everything will be coherent or polished. In the future I might go back to a post and update it and polish it up. But for now, I'm just going to write. I'll clearly mark if a post is polished, and if it's been updated since the posting date with some sort of tag probably updated or finished.

For my PhD, I'm happy to say that I'll be working with the Koka programming language and static analysis.

My Advisor Dr. Kimball Germane is a computer science faculty at BYU where I'm at. https://kimball.germane.net/

He has a lot of experience with static analysis, and especially with making control flow analysis practical for real world use.

Koka is a wonderful language and I've started using it more and more for my personal projects. It's a research language, and as such is not recommended for general use at this point. But I hope to help change that.

Static Control Flow Analysis

Static analysis is a way to analyze properties about a program without running it. Some properties of a program are restricted by the type system, but often you want to know a bit more about the program than a type system can tell you. For example, you might want to know if an integer is always positive, or if a function never throws an exception.

While some of these properties can be determined by a more restrictive type system, there is always a tradeoff when you make a type system more restrictive. The more restrictive a type system is, the more difficult it is to write programs that may be valid, but don't fit the type system. On the other hand, some type systems are super expressive allowing the programmer to express rich types where types can depend on values (see Idris for an example). Both by making a type system more restrictive and by making it more expressive, you completely change the developer experience which can make it much easier or much harder to write programs.

Even with an expressive type system, there are still properties that are interesting to a compiler that the type system is unlikely to help you with. A common example is inlining. Inlining functions can be very beneficial to runtime performance – it can save you the overhead of a function call (including all the low level details such as saving registers, etc). Additionally, in the case of higher order functions such as closures, it can save you de/allocation of the closure. Another example of a property that is useful, but not typically expressed in type systems are security properties such as ensuring that secrets within a program are not leaked, and that the program doesn't accept any input from the user that could compromise the program.

Flow analysis is one way of solving many static analysis questions. Essentially the common questions that we ask is about what values can flow to which points in the program. Due to recursion, infinite loops, and arbitrary input, we cannot just run the program and put in some logging statements. Instead we really want to know all possible ways the program can execute given any input or path through the program. This is a very difficult problem, and in general is undecidable. However, there are many cases where we can get a good approximation of the answer, and that is what we are interested in. There are many ways of approximating the solution to this problem, but what we are interested in is a sound solution. For the solution to be sound it means that we need to consider all possible executions of the program. Because it is an approximation we can either underestimate all possible executions (leaving out some possibilities) or overestimate all possible executions (including some possibilities that are in reality not possible). We are interested in the latter case, because we want to be sure that we are not missing any possible execution paths. If we don't consider all possible executions that could mean that our analysis is unsound, for example it might say that a value is always positive, when in reality it could be negative. You could imagine that it might be useful to know when something has to happen, in which case you would want a complete analysis. A complete analysis guarantees that if it says something happens then it really does. Soundness and completeness are not opposites, but they are both useful properties to think about when discussing formal reasoning systems.

However just following where values flow – such as into and out of functions or into branches of a switch or if statement – is not typically enough. Many languages have higher order properties such as closures. With closures, functions are first class values and can be passed around the program. This means that the flow of a value can depend on the flow of a function and vice versa. This is what we call Control Flow Analysis.

In the past control flow analyses have been really expensive, so in practice, they are not often used in compilers. Our aim is to change that. While there are theoretic barriers to making control flow analyses as fast as reasoning about local properties such as types, we believe that there is a lot of room for improvement in the practicality of control flow analyses, and that they can solve many of the problems that compiler would like to know about.

A great paper by my advisor on the topic is Demand Control-Flow Analysis. It is a great read, especially for understanding the approach we are planning to take with my PhD research.

Koka

Koka is a functional language, with many of the common features of such a language – algebraic datatypes, closures, aggressive type inference. It differs from other functional languages in a few important ways though. First it is designed to be extremely practical, while still being a pure language. Its notion of purity is even stronger than Haskell's notion – even divergence or exceptions are explicit. This means that the language is designed to be easy to reason about, and easy to analyze. At the same time, however, it makes typically verbose or unwieldy features of functional languages much more ergonomic. For example, Koka has a very powerful effect system that allows you to express many of the common effects that you might want to express in a program. This includes things like exceptions, divergence, mutable state, and IO. However, unlike Haskell, you don't have to explicitly thread effects through your program. Instead, the compiler will automatically insert the necessary code to thread the effects through your program. This means that you can write code that looks like it is impure, but the compiler will automatically insert the necessary code to make it pure. This is a very powerful feature, and it makes it much easier to write code that is pure, but still has side effects. It also makes it much easier to reason about the code, because you can see exactly where the effects are happening. It also has syntax that is very friendly to both Python and C programmers, which is a flexibility that is highly appreciated.

Koka's main focus other than a robust algebraic effect system is speed. It is designed to be fast, and it is. It is much faster than most other functional languages, and for many benchmarks it approaches the speed of C, but with much better readability. Due to the pure nature of the language, there are still some places where it lacks performance, and there are many performance critical libraries that need to be developed for it, but in general it is more than fast enough for common use cases. Underneath the hood, Koka uses reference counting, however, it uses powerful local analyses to eliminate most of the overhead of reference counting. The compiler and runtime attempts to reuse as much of the memory that would be deallocated for new allocations within the same stack frame. Due to common patterns in functional programming this is very powerful. Additionally it makes persistent data structures perform as quickly as their mutable counterparts when used in a non-persistent context, while adapting gracefully and performantly to persistent contexts. While this strategy tends to outperform C programs in many use cases, the approach cannot cover all use cases. Eventually there will likely be some sort of support for more mutable patterns for when the reference counting approach is not sufficient.

Although we could use static analysis to try to directly improve some reference counting, we are interested into looking into a variety of ways to improve Koka's speed.

A few ideas that we came up with include the following:

  • Determining whether int values could be specialized to a more performant type such as int32 or int64
    • This would likely involve using the Z3 Theorem Prover
  • Determining whether effects are always used linearly, and if so, using a more performant representation
    • Previous experiments have shown this to not make a difference, and sometimes cause a performance hit, but it does improve the code size
  • Determining whether a local variable could be stack allocated instead of heap allocated
  • Better closure allocations or inlines