[Submitted on 20 Dec 2024 (

v1

), last revised 13 Apr 2026 (this version, v2)]

Title:WebLLM: A High-Performance In-Browser LLM Inference Engine

Authors:Charlie F. Ruan

,

Yucheng Qin

,

Akaash R. Parthasarathy

,

Xun Zhou

,

Ruihang Lai

,

Hongyi Jin

,

Yixin Dong

,

Bohan Hou

,

Meng-Shiun Yu

,

Yiyan Zhai

,

Sudeep Agarwal

,

Hangrui Cao

,

Siyuan Feng

,

Tianqi Chen

View a PDF of the paper titled WebLLM: A High-Performance In-Browser LLM Inference Engine, by Charlie F. Ruan and 13 other authors

View PDF
HTML (experimental)

Abstract:Advancements in large language models (LLMs) have unlocked remarkable capabilities. While deploying these models typically requires server-grade GPUs and cloud-based inference, the recent emergence of smaller open-source models and increasingly powerful consumer devices have made on-device deployment practical. The web browser as a platform for on-device deployment is universally accessible, provides a natural agentic environment, and conveniently abstracts out the different backends from diverse device vendors. To address this opportunity, we introduce WebLLM, an open-source JavaScript framework that enables high-performance LLM inference entirely within web browsers. WebLLM provides an OpenAI-style API for seamless integration into web applications, and leverages WebGPU for efficient local GPU acceleration and WebAssembly for performant CPU computation. With machine learning compilers MLC-LLM and Apache TVM, WebLLM leverages optimized WebGPU kernels, overcoming the absence of performant WebGPU kernel libraries. Evaluations show that WebLLM can retain up to 80% native performance on the same device, with room to further close the gap. WebLLM paves the way for universally accessible, privacy-preserving, personalized, and locally powered LLM applications in web browsers. The code is available at: this https URL.

Submission history

From: Charlie Ruan [

view email

]

[v1]

Fri, 20 Dec 2024 11:24:13 UTC (278 KB)

[v2]

Mon, 13 Apr 2026 04:55:03 UTC (149 KB)