A Deep Dive Into How to Implement a Custom Load Balancer in gRPC Go

Aug 19, 2024 17 min read

Preliminary

In this post, I want to talk about a great feature of the gRPC project, which is custom load balancing. gRPC, developed by Google, is considered a general-purpose RPC solution that offers high performance through various means, which I’m going to talk about some of them in this post. Nevertheless, my focus is on load balancing, and specifically, how to develop a custom load balancing in this project.

Before getting our hands dirty with details, let me explain why I decided to write this post. Despite having a great ecosystem, gRPC notoriously suffers from a lack of comprehensive documentation, especially on the internals of the project and advanced features, thereby forcing people to dig into the code if they want to know the internals of advanced features. This also happened to me when I tried to learn how to implement a custom load balancer for a project of mine but I didn’t find any guides on how to do it, just an example code showcasing a custom load balancer which is not documented at all.

After putting the pieces together through code examination and some experimentation, I finally figured out the internals of gRPC that are related to load balancing (the balancer package). By writing this guide I hope other folks find their way around load balancing in gRPC easier.

In my last post about Raft implementation, we saw an example of how Remote Procedure Calls (RPC) are indispensable for distributed systems, which are everywhere these days. However, in our simple Raft implementation, there was no need for load balancing. Let’s first review load balancing and why it plays a critical role in distributed systems.

Load Balancing

As the name suggests, load balancing means balancing the load between different instances (or replicas) of a single service or functionality. Note that the target functionality can take a variety of forms: networking, storage, and computing. In this post, I’m focusing on computing resources because that’s what we are in charge of when designing a load balancer for gRPC. The important question to ask ourselves is why do we even bother evening the load between replicas?

From the operational point of view, not evening the load between a set of replicas increases the chances of overloading an instance (sending it more load than it can handle). Consequently, increasing the chances of failure for overloaded instances. Failure will lead to a reduction of computing capacity, which directly translates to not efficiently using the computing hardware or in other words, wasting the invested money.

Although controlling overload is one of the goals for load balancing, it’s certainly not the most important one. To understand this, let’s consider different sources of delay in a simple computer network. Packets in the network are exposed to four sources of delay:

Transmission delay: The time it takes for the transmitter at the client-side to insert all the bytes of the packet on the transmission link.
Propagation delay: The time it takes for a single bit to traverse the distance between the client and server in the transmission link. The transmission speed is usually a fraction of the speed of light.
Processing delay: The time it takes to router, switch, or even the destination server, to process the packet. For example, the switch needs to determine the next hop for the packet.
Queuing delay: At any point in the network or even the hosts, packets might have to wait in a queue to receive their requested service.

Out of all of these sources of delay, only one of them is not bounded: The queueing delay. All other delays are the underlying hardware performing the functionality, but the queueing delay is dependent on other factors, including the number of packets requiring the same service.

The queuing delay was so important that it encouraged people to study it in a dedicated discipline. As a result, people have tried to model the nature of this delay and quantify it. For example, let’s assume a single server with one queue attached to it. Requests arrive at this server according to an exponential distribution with the rate of $\lambda$ and each request takes a variable time to complete its processing at the server (after waiting in the queue), which is modeled by an exponential distribution with the rate of $\mu$. According to the queueing theory, the end-to-end response time of a request (defined by the difference between when the request leaves the system and when the request arrives at the system) is:

$$T = \frac{1}{\mu - \lambda}$$

According to this request, we can say that regardless of the scheduling policy for the process at this system, as the speed of new requests’ arrival gets closer to the speed of processing at the processor, the end-to-end delay will increase to infinity.

It is really important in a cluster to be careful about the load balancing of all replicated services because a poor load balancing scheme for even a small subset of services in a cluster or datacenter can lead to a performance issue named tail-latency. To understand this issue assume that a client requires results of three different services (A, B, and C) in the cluster to compute its output (the client issues requests to these services in parallel). Due to proper load balancing, services A and B take up to 30ms to respond, but because of load imbalance directed to instances of C, some instances can take up to 900ms to respond (these instances have large queues). Note that on average instances of service C respond within 20ms (not all replicas have large queues). In this situation, the average response time for the client is 30ms (dominated by A and B delays). However, the worst-case delay is 900ms (dominated by service C).

Having a high tail-latency is a serious issue because customers who experience this high latency will not be satisfied with their experience, and potentially not use the service again, which translates to less profit.

Load Balancing in gRPC

Now that we have covered why load balancing is important, let’s talk about how gRPC implements it. Although load balancing can be implemented in many forms (read this post on different variants of load balancing), gRPC opts for client-side load balancing. I guess that since gRPC itself is independent of service mesh projects (Istio, Cilium, Linkerd, etc.), designers wanted gRPC to have its own load balancing mechanism without depending on other projects and client-side load balancing fits the bill here. In client-side load balancing the client is responsible for selecting a destination for a new request to be transmitted to for processing. As I stated before, gRPC has a lot of functionalities and one of them is service discovery, which is responsible for providing a list of backend addresses (IP address) for the load balancer to choose from (we will get to this later).

As of now, gRPC Go has some load balancing schemes included in the project. These schemes are:

pick first: The load balancer picks the first backend in the backend list (Essentially does no load balancing).
round robin: this policy will iterate over all backends, sending each successive RPC to the next successive backend in the list, wrapping around to the start of the list when needed (See here for more info on round robin scheduling in general)
weighted round robin: This policy, which is formally defined here is a more advanced form of round robin policy that uses metrics sent directly from backends to the client to calculate a weight for each backend and then prioritizes certain backends over others based in their weight.
least request: This policy will pick the backend that has the least amount of pending requests sent to it.

Before starting to criticize gRPC’s decision about out-of-the-box load balancers, we should note that gRPC’s transport is different from most other RPC implementations because it uses HTTP/2. I’m not going to get into the details of this protocol (you can read this, and this for more info on HTTP/2 and why gRPC chose it), but the main difference between HTTP/2 and HTTP/1.1 is that HTTP/2 multiplexes multiple requests on top of a single end-to-end connection. This provides a lot of benefits in terms of efficiency and performance but also poses a challenge for load balancing: gRPC connections are sticky and the client has to resort to request-level load balancing between existing connections instead of connection-level load balancing that has less constraints.

This is exactly why gRPC has invested a lot of round robin policies since they fit pretty well with request-level load balancing over sticky connections. However, round robin policies exacerbate fan-out and fan-in in clusters and datacenters which can decrease efficiency when there are a lot of instances available.

Performance of Load Balancing Policies

At first glance, you might think out-of-the-box policies might be the best ones in the research and industry, but unfortunately, they are far from optimal in many cases. I certainly don’t want to get into state-of-the-art load balancing policies in research in this post, but I will mention some starting points for people who are interested in this line of research:

R2P2: Making RPCs ﬁrst-class datacenter citizens: This paper explains why optimal load balancing policy might not be operationally feasible to implement and proposes an approximation to the optimal policy through a novel design.
When to Hedge in Interactive Services: This paper discusses the best approaches to request hedging and how hedging is related to load balancing.
The Power of Two Choices in Randomized Load Balancing: This paper discusses random-based policies and how to significantly improve them by relying on multiple random choices.

By reading these two, you will quickly understand the need for a custom load balancing scheme in gRPC that might be more suitable for you specific workload and cluster.

Terminology

Connection (ClientConn in the code) — A logical connection between a client and a set of instances providing a service. A connection contains a set of sub-connections.
Sub-connection (SubConn in the code) — A single physical connection between a client and a backend.
Resolver — It is the component responsible for resolving a target service URL to a list of backend addresses implementing the target service.

Main Involved Types

First, let me introduce the main types that we are going to deal with

Balancer

This is an interface that should be implemented as part of our custom load balancer. Formally, the Balancer interface is defined as follows:

type Balancer interface {
    UpdateClientConnState(ClientConnState) error
    ResolverError(error)
    UpdateSubConnState(SubConn, SubConnState)
    Close()
}

The responsibilities of this implementation is:

Managing Connections: It maintains and manages the connections to the servers. This involves creating, updating, and closing connections as needed.
Handling Updates: It processes updates from the Resolver. The UpdateClientConnState method is crucial here, as it updates the internal state of the balancer with the new addresses.
State Management: It keeps track of the state of each sub-connection (i.e., the connections to the servers). The UpdateSubConnState method is responsible for updating the state of these connections and ensuring they are healthy.
Error Handling: It handles errors from the resolver through the ResolverError method, ensuring that the balancer can respond appropriately to issues.
Resource Cleanup: When the balancer is no longer needed, the Close method is called to clean up resources and gracefully shut down the balancer.

Let’s consider when each of Balancer’s methods is called and why:

UpdateClientConnState — This method is called by the gRPC internals (not by us) whenever the Resolver updates the list of backend addresses. The new list is passed in the ClientConnState object in the argument.
UpdateSubConnState — This method is called by gRPC internals whenever the state of a sub-connection changes (This can happen due to various events, such as the connection becoming ready, encountering a transient failure, or transitioning to an idle state).
ResolverError — This method is called by the gRPC internal library whenever the Resolver returns an error.
Close — Cleans up resources when the balancer is closed (Again, it is called by the gRPC internal library).

If you are following the discussion carefully, you will notice that the Balancer despite its name does not do any load balancing at all. If you think this way, you are 100% right! Hang in a bit longer to see the actual load balancing!!!

To summarize, the balancer receives a list of backend addresses from the Resolver, and maintains the lifecycle of the connection and sub-connections. It delegates the actual load balancing decision to Picker.

Picker

As the name suggests, Picker is responsible for picking a sub-connection to send the new RPC. Consequently, the actual load balancing policy is implemented here.

Let’s see the Picker interface to understand it better:

type Picker interface {
    Pick(info PickInfo) (PickResult, error)
}

As you can see, Picker is pretty simple and it has only one method. Let’s see the definition for PickInfo and PickResult:

type PickInfo struct {
    // FullMethodName is the method name that NewClientStream() is called
    // with. The canonical format is /service/Method.
    FullMethodName string
    // Ctx is the RPC's context, and may contain relevant RPC-level information
    // like the outgoing header metadata.
    Ctx context.Context
}


type PickResult struct {
    // SubConn is the connection to use for this pick if its state is Ready.
    // If the state is not Ready, gRPC will block the RPC until a new Picker is
    // provided by the balancer (using ClientConn.UpdateState).  The SubConn
    // must be one returned by ClientConn.NewSubConn.
    SubConn SubConn

    // Done is called when the RPC is completed.  If the SubConn is not ready,
    // this will be called with a nil parameter.  If the SubConn is not a valid
    // type, Done may not be called.  May be nil if the balancer does not wish
    // to be notified when the RPC completes.
    Done func(DoneInfo)

    // Metadata provides a way for LB policies to inject arbitrary per-call
    // metadata. Any metadata returned here will be merged with existing
    // metadata added by the client application.
    //
    // LB policies with child policies are responsible for propagating metadata
    // injected by their children to the ClientConn, as part of Pick().
    Metadata metadata.MD
}

PickInfo contains the method name that is invoked and the RPC context. These can be used for advanced policies that behave differently across various RPC methods. PickResult is more interesting. It has a SubConn field, which essentially is the result of the load balancing policy. Done is a callback that is called after the completion of the RPC. This can be used to update the internal state of our customized load balancer. Finally, Metadata contains any metadata that is sent by the receiver in the response.

Up until now, we have seen where the actual load balancing is happening (Picker) and its supportive lifecycle management, which is performed by Balancer. However, it’s not obvious how the Picker is introduced to the internals of gRPC.

To understand where Picker is introduced to gRPC, I have to explain a third interface named ClientConn, which is defined like this:

type ClientConn interface {
    // NewSubConn is called by balancer to create a new SubConn.
    // It doesn't block and wait for the connections to be established.
    // Behaviors of the SubConn can be controlled by options.
    //
    // Deprecated: please be aware that in a future version, SubConns will only
    // support one address per SubConn.
    NewSubConn([]resolver.Address, NewSubConnOptions) (SubConn, error)

    // UpdateState notifies gRPC that the balancer's internal state has
    // changed.
    //
    // gRPC will update the connectivity state of the ClientConn, and will call
    // Pick on the new Picker to pick new SubConns.
    UpdateState(State)

    // ResolveNow is called by balancer to notify gRPC to do a name resolving.
    ResolveNow(resolver.ResolveNowOptions)
}

Here, UpdateState is the one we are looking for. Let’s see the State definition:

type State struct {
    // State contains the connectivity state of the balancer, which is used to
    // determine the state of the ClientConn.
    ConnectivityState connectivity.State
    // Picker is used to choose connections (SubConns) for RPCs.
    Picker Picker
}

As you can see, the Balancer communicates its last state and Picker to the ClientConn using the UpdateState method. Note that it’s not necessary to implement the UpdateState or overwrite it because gRPC implements it internally. Therefore, our custom balancer should inherit both Balancer (to perform state and lifecycle management) and ClientConn (to communicate the up-to-date Picker to the gRPC).

You should always be careful about which operations are on the critical path. Critical path operations are performed before sending a new RPC. These operations should be fast because their delay contributes to the end-to-end latency of a RPC. Based on our discussions, the Pick method of the Picker interface is on the critical path. Thus, our implementation should be fast enough.

I should say that other interfaces like Builder might be required depending on your implementation style, but the ones that I’ve explained so far are the core of the load balancer in gRPC. I will explain other types when necessary.

A Simple Random Load Balancer

Now, let’s put together a simple random load balancer to not finish this post empty-handed. To implement this custom load balancer, I’m going to use the base sub-package, which is provided in the balancer package. base package takes care of connection and sub-connection level lifecycle management and lets us to focus on the implementation of the Picker object. The reason for avoiding implementing Balancer methods is that they are not as simple as you might think and I will show exactly why later in this post.

To use the base package, we have to implement PickerBuilder instead of Balancer:

type PickerBuilder interface {
    // Build returns a picker that will be used by gRPC to pick a SubConn.
    Build(info PickerBuildInfo) balancer.Picker
}

Let’s implement Picker and PickerBuilder:

type simplePickerBuilder struct {
  base.PickerBuilder
}

func (b *simplePickerBuilder) Build(info base.PickerBuildInfo) balancer.Picker {
  logger.Infof("simplePickerBuilder: Build called with info: %+v", info)

  scs := make([]balancer.SubConn, 0, len(info.ReadySCs))
  for sc := range info.ReadySCs {
    scs = append(scs, sc)
 }
  return &simplePicker{
    subConns: scs,
    length:   len(scs),
 }
}

type simplePicker struct {
  balancer.Picker
  subConns []balancer.SubConn
  length   int
}

func (p *simplePicker) Pick(info balancer.PickInfo) (balancer.PickResult, error) {
  logger.Infof("Picking a new address. len: %v", len(p.length))
  index := rand.Intn(p.length)
  return balancer.PickResult{SubConn: p.subConns[index]}, nil
}

In the Build method, we are receiving a map of SubConns (provided in base.PickerBuildInfo). Then we create a Picker that takes the list of sub-connections along the length of this list. In the Pick method, we simply pick a random index and return the corresponding SubConn. Let’s test this load balancer in a setup with two backends. Here is the result:

panic: invalid argument to Intn

So, the test failed because of a panic. The panic stems from the Intn function used to generate a random index, but why? Let’s turn on the logs and see the argument of Intn, which is p.length. Here is the log showing exactly that:

INFO: [simple_balancer] Picking a new address. len: 0

This is weird because a few lines before, the log says this:

INFO: [balancer] base.baseBalancer: got new ClientConn state:  {{[{Addr: "localhost:50051", ServerName: "", } {Addr: "localhost:50052", ServerName: "", }] [{[{Addr: "localhost:50051", ServerName: "", }] <nil>} {[{Addr: "localhost:50052", ServerName: "", }] <nil>}] <nil> <nil>} <nil>}

It basically says that the ClientConn has received a new state with two backend addresses named localhost:50051 and localhost:50052.

Before explaining the source of this issue, let me first introduce the fix in Build method:

func (b *simplePickerBuilder) Build(info base.PickerBuildInfo) balancer.Picker {
  logger.Infof("simplePickerBuilder: Build called with info: %+v", info)


  // BEGINNING OF FIX \\
  if len(info.ReadySCs) == 0 {
    return base.NewErrPicker(balancer.ErrNoSubConnAvailable)
 }
  // END OF FIX \\

  scs := make([]balancer.SubConn, 0, len(info.ReadySCs))
  for sc := range info.ReadySCs {
    scs = append(scs, sc)
 }
  return &simplePicker{
    subConns: scs,
    length:   len(scs),
 }
}

As you can see, the fix is just checking the number of given sub-connections and returning ErrNoSubConnAvailable error none is given. You might ask: How does returning an error fixes this issue? The answer is that we are returning this error to gRPC internals not to the gRPC client, and gRPC will do something special when receiving this. Let’s see the official documentation for this error:

ErrNoSubConnAvailable indicates no SubConn is available for pick(). gRPC will block the RPC until a new picker is available via UpdateState().

It says that gRPC will halt everything until it receives a new update that includes potentially some READY sub-connections. Therefore, the underlying reason for our previous failure was that although the ClientConn has received the correct backend list, sub-connections to these two backends weren’t READY yet.

This is exactly why I said implementing Balancer methods for the lifecycle management of connections is not that easy because these methods interact with the internals of gRPC and the developer should be aware of the expected behavior of these methods. For example, it should know exactly when and where to return a specific error as we saw in our simple random load balancer.

Final Notes

The source code for this post is publicly available on Github.
There are a lot of features that I haven’t talked about in this post. For example, a gRPC client can receive custom metrics from backends in many ways to influence its load balancing decisions, as it is done in weighted round robin load balancer.
gRPC can be configured by the control plane to do a lot of cool things using xDS APIs. Specifically, the load balancer can also communicate with the control plane during load balancing using xDS API.

I will probably write a more advanced post on these cool features and many more in the near future.