When to Use This
A zero downtime update is useful any time you need to change the backend of a live endpoint, for example:- Upgrading to a newer version of your template.
- Switching to a different model (e.g., moving from
Qwen/Qwen3-8BtoQwen/Qwen3-14B). - Adjusting vLLM launch arguments or other environment variables in the template.
- Adding or changing the search filter on your worker group.
How to Trigger an Update
The process requires two steps:1. Update your template
Modify the template that your endpoint uses. This could involve changing theMODEL_NAME, updating VLLM_ARGS, or selecting an entirely new template version.
2. Update the worker group configuration
Once the template is saved, update your worker group to reference the new template. This signals Vast to begin the rolling update.After you complete these two steps, Vast handles the rest automatically. No additional action is required on your part.
What Happens During the Update
Vast orchestrates the transition across your worker group in the following sequence:- Inactive workers become active and update — Any inactive workers are brought into an active state, updated to the new template and model configuration, and made available for requests.
- Active workers finish existing tasks first — Workers that are currently active and handling requests are allowed to complete all of their in-flight tasks before updating. Once an active worker finishes its current work, it updates to the new configuration and rejoins the pool.
- New requests route to updated workers — As updated workers come online, incoming requests are directed to them. This continues until every worker in the group is running the new configuration.
Best Practices
- Schedule updates during low-traffic periods — While the update process is designed to be seamless, performing it during a period of stable, low traffic reduces the number of in-flight requests that need to drain and shortens the overall transition window.
- Verify the new template independently — Before triggering a rolling update on a production endpoint, consider testing the new template on a separate endpoint to confirm that the model loads correctly and produces the expected output.
- Monitor during the rollout — Keep an eye on your endpoint’s request latency and error rate while the update is in progress. A brief increase in latency is normal as the worker pool transitions, but errors may indicate a problem with the new configuration.