At Locai, we host a diverse set of deep learning models for both internal and external users. The recent boom of large language models (LLMs) brought a set of new technical challenges and pod cold start is one of the most important. By reducing cold start time, we were able to reduce cost while maintaining a stable latency SLA. We’d like to share our learnings on why cold start is an important problem and how we reduced the cold start time of LLMs with LLM Engine.
Why Cold Start Time Is Important
Without a fast enough cold start, users will often permanently provision GPUs for peak traffic. GPU hosting costs are significantly more expensive than the average CPU-based microservice. To formalize this idea in equations,
If we can cold start pods and make predictions within the latency SLA, we won’t need any warm pods; otherwise we need to keep some warm pods based on the max amount of traffic we want to keep within SLA and the throughput per node.
Then we could calculate compute seconds per wall clock second by dividing the total amount of requests and per pod throughput and plus number of warm pods. Lastly we could get the cost for serving all requests by multiplying compute seconds with the cost per second and duration.
As a concrete example, the chart below illustrates the hosting costs of one of our products as a function of user traffic. The X-axis shows the number of requests and the Y-axis is cost. The blue line is the cost curve for a configuration that has 3x the cold start time than the red line while all other configuration values are the same. When cold start time is high (blue line), we need to keep more warm pods to maintain latency SLA when there are more requests; however if cold start time is short enough (red line), we could spin up pods right when requests come in and only keep a small amount of warm pods for safety. This difference made a huge impact on cost.
Measurements
To understand where time is spent during cold start, we measured LLM endpoint cold start time for a Llama 2 7B model by looking at Kubernetes events and duration for each step that runs in the container. Here’s the time breakdown:
As shown in the chart, the actual model loading time is quite small and most of the time is spent on pulling docker images and downloading model weights.
Faster Pod Initialization
Since pulling images from repositories takes the majority of the time, we focused our efforts here first. From our prior experiences working on optimizing inference for various deep learning models like image generation models, we’ve already encountered similar problems, but with LLMs the docker images are bigger and the problem is more prominent. We tried a few ideas and eventually optimized away this portion of time by caching images onto the nodes using Kubernetes daemonsets. We utilized the same technique for LLM images. The following chart describes the process:
A cacher periodically scans through the database to get all “high priority” models, together with a full set of (GPU, docker image) pairs. We then construct and create/replace one daemonset for each type of GPU, and run each image with sleep commands like /bin/sh -ec 'while : ; do sleep 30 ; done'. This way we could dynamically maintain the set of images and preload them onto the nodes, and effectively eliminate docker image pulling time.
In addition to image caching, we reduced time to provision new nodes by creating balloon deployments per GPU type to prewarm nodes. These pods have low priority and will be preempted when actual workloads are created.
Faster Model Weights Loading
We utilize s5cmd to download model weights which is much faster than aws-cli. We used to put all files into a tarball for simplicity, but found that it’s bad for concurrent download. We instead stored model files with 2GB chunks and had s5cmd download all the files in parallel. We also did some quick benchmarks for s5cmd parameters and chose 512 for --numworkers and 10 for --concurrency. With those changes we pushed the download time of the Llama 2 7B model (12.6GB) from 85s to 7s, achieving 14.4Gbps, which is close to the EBS volume bandwidth limit (16Gbps) for our host. Here is an example how we invoke s5cmd: