Offline Browser Extension with On-Device LLM (Hands-On)

Step-by-step guide to building a Manifest V3 extension with an on-device LLM for offline summarization, autofill and safe command execution.

Build an offline-capable browser extension that runs an on-device LLM — hands-on

Hook: If your team struggles with inconsistent AI outputs, fragile cloud dependencies, and slow onboarding for automation scripts, an offline on-device LLM inside a browser extension can change the game: private, fast, and tightly integrated with your dev workflows. This guide walks a technology-professional–level, step-by-step path to a production-ready extension that performs offline summarization, secure autofill, and safe DOM command execution — inspired by the local AI direction popularized by apps like Puma and the edge AI hardware trend from late 2025.

Executive summary (what you'll get)

Start with a Manifest V3 extension scaffold and learn two supported deployment patterns: (A) pure in-browser inference using WebAssembly / WebGPU–backed runtimes running small quantized models (best for full offline on desktops and mobile where supported), and (B) a local companion native process accessed via native messaging (for heavier models, GPU access or specialized hardware like the Raspberry Pi AI HAT+). You’ll see concrete code snippets for model loading, prompt templates for summarization, autofill, and command execution, plus a security checklist for safe execution and production hardening.

Why build an offline on-device LLM extension in 2026?

Privacy-first tooling: Local inference eliminates sending sensitive page content to a cloud API — a must-have for internal tools and regulated environments.
Lower latency and offline availability: On-device models boot instantly for short tasks (summaries, form autofill), which improves UX for mobile or flaky networks.
New runtimes and formats: By 2025–2026, GGUF adoption, WebGPU/WebNN availability across major browsers, and optimized wasm runtimes made in-browser LLM inference practical for small quantized models. For design patterns around edge microapps and composable UX, see Composable UX Pipelines for Edge‑Ready Microapps.
Edge hardware improvements: Devices like the Raspberry Pi 5 + AI HAT series offer affordable local inferencing for developer labs and kiosks — useful context is available in a mobile studio essentials field guide.

High-level architecture: two options (pick one)

Option A — Pure in-browser (single-binary extension)

Best for: full offline, simple installs, platforms where WebAssembly + WebGPU are available. This approach ships a quantized model (7B-ish Q4 variants) into IndexedDB and runs inference with a wasm/WebGPU runtime in a dedicated WebWorker.

Option B — Extension + native helper (native messaging)

Best for: running larger models or harnessing local GPU drivers, or integrating with edge hardware like Raspberry Pi HAT+ where a system-level runtime (llama.cpp, GGML, ONNX Runtime) is optimal. The extension communicates with a signed native binary over the browser's native messaging API. For low‑latency capture and helper‑style architectures in production AV workflows, see Hybrid Studio Ops 2026.

Prerequisites

Developer environment: Node.js (recommended), modern browser with MV3 support (Chrome/Edge/Firefox as of 2026), and local test devices (desktop + Android/iOS where applicable).
Model & runtime choices: a quantized GGUF or similar model (7B recommended) for in-browser; a larger GGML/ONNX model for native helper.
Knowledge: JavaScript/TypeScript, browser extension APIs (MV3), basic cryptography (WebCrypto), and prompt engineering basics.

Step 1 — Create the extension scaffold (Manifest V3)

Manifest V3 is the standard for Chromium-based browsers and widely supported in Firefox by 2026; it uses a service worker (background) instead of background pages. Keep permissions minimal to reduce security risk.

{
  "manifest_version": 3,
  "name": "Local LLM Toolkit",
  "version": "1.0.0",
  "description": "Offline summarization, autofill and safe commands via on-device LLM",
  "permissions": ["storage", "contextMenus", "activeTab", "scripting"],
  "host_permissions": ["https://*/*", "http://*/*"],
  "background": { "service_worker": "background.js" },
  "action": { "default_popup": "popup.html" },
  "icons": { "48": "icons/48.png" }
}

Security tip: avoid broad host_permissions; prefer allowlists for enterprise deployment. Also define a strict Content Security Policy (CSP) in your HTML files to prevent script injection.

Step 2 — Decide runtime: wasm/WebGPU vs native

In-browser runtime pattern (wasm + WebGPU)

Runtime: a wasm-compiled inference runtime (many open-source runtimes added WebGPU/WASM targets since 2024–2026).
Model format: GGUF or wasm-friendly quantized artifacts — store in IndexedDB and keep the model shard sizes small for fast downloads.
Threading: run inference in a WebWorker; request cross-origin isolation if you need SharedArrayBuffer for faster multithreading (COOP/COEP headers required).

Native helper pattern

Helper runs llama.cpp / ggml / ONNX Runtime on the host; uses local GPU when available (CUDA/Vulkan) or specialized HAT drivers on Raspberry Pi.
Communicates over native messaging — the extension sends JSON messages and receives streamed tokens.
This option can support much larger models and faster throughput but needs installer and OS signing for enterprise deployment.

Step 3 — Model selection and quantization

Quantization is the core performance lever for in-browser inference:

For pure-browser offline: aim at 4-bit quantized 7B models (Q4 variants) to balance quality and memory footprint. Many 2025–2026 releases standardize GGUF quant files for portability.
For native helper: you can use 13B–70B models quantized with GGML/gguf to Q4/Q5 with GPU acceleration.
Test end-to-end: smaller models often outperform larger ones for short tasks (summaries, form Q&A) because latency matters more than absolute perplexity in UI flows.

Step 4 — Load the model in the extension (example: wasm worker)

Keep the inference off the main thread to avoid jank.

// background.js (service worker)
self.addEventListener('message', async (evt) => {
  if (evt.data.type === 'init-model') {
    // postMessage the shard fetch task to the worker
    modelWorker.postMessage({ type: 'load-model', url: evt.data.url });
  }
});

// model-worker.js (web worker)
self.onmessage = async (evt) => {
  if (evt.data.type === 'load-model') {
    const resp = await fetch(evt.data.url);
    const blob = await resp.arrayBuffer();
    // runtime.loadModel is your wasm/WebGPU runtime API
    await runtime.loadModel(blob);
    postMessage({ type: 'ready' });
  }
};

In the native-helper pattern, the service worker opens a native messaging port and streams the prompt to the helper which returns chunked tokens.

Step 5 — Prompting patterns (summarization, autofill, commands)

Design strict prompt templates to reduce hallucinations and to make parsing deterministic. Use an explicit instruction-to-json output pattern for command execution so your extension can safely parse and validate actions.

Summarization prompt (template)

System: You are a concise summarizer. Output JSON: {"summary":"","bullets":[...]}.
User: Summarize the following page content for a developer team with actionable items:

[PAGE_CONTENT]

Constraints: max 280 characters for summary; produce up to 5 bullet action items.

Autofill prompt (template)

System: Extract structured field values for a form. Output JSON like: {"email":"...","firstName":"...","company":"..."}.
User: Given the page text and the target form fields below, fill values if confidently present; otherwise return null for unknowns.
Fields: [email, firstName, lastName, company]

[PAGE_TEXT]

Command execution prompt -> strict JSON command response

For DOM/autonomy tasks, instruct the model to respond only with a JSON array of safe commands. Your extension validates the commands and asks for user confirmation — never auto-execute destructive commands.

System: Output a JSON array of commands, each with an "action" and a "target" where action is one of [click, setValue, navigate].
User: Provide commands to perform the task: "create a new issue with title X" given the page DOM content.

Output example:
[ {"action":"setValue","targetSelector":"#title","value":"BUG: foo"}, {"action":"click","targetSelector":"#submit"} ]

Step 6 — Executing commands safely

Never let the model drive the extension to do dangerous operations without human consent. Implement a validation and allowlist layer:

Parse and schema-validate the JSON response using a strict JSON schema.
Check selectors against a domain allowlist and ensure they match visible elements.
Present a summarized preview to the user in the popup with a single-click confirm button.
Log the command with timestamps, source prompt hash, and user decision for auditability.

Step 7 — Autofill wiring (example content script)

// content-script.js
chrome.runtime.onMessage.addListener(async (msg, sender, sendResponse) => {
  if (msg.type === 'apply-autofill') {
    const data = msg.payload; // {email, firstName,...}
    if (data.email) {
      const e = document.querySelector('input[type="email"]') || document.querySelector('#email');
      if (e) { e.focus(); e.value = data.email; e.dispatchEvent(new Event('input', { bubbles: true })); }
    }
    // other fields...
    sendResponse({ status: 'ok' });
  }
});

Step 8 — Performance tactics

Model size: pick smallest model that meets quality thresholds — measure latency (p50, p95) for your users’ devices.
Quantization: 4-bit is a pragmatic sweet spot for in-browser; experiment with Q4_0, Q4_K_S for quality/throughput trade-offs.
Streaming: stream tokens to the UI so users see progressive answers.
Cache embeddings & results: use IndexedDB for per-page caches (summaries, autofill mappings).
Offload large tasks: for heavy summarization (long documents), fallback to a local helper or an opt-in cloud API.

Step 9 — Security, privacy & hardening checklist

Least privilege: restrict permissions and host access — use enterprise-managed policies for deployments.
Model integrity: verify model checksums and sign model releases; store signature with the model and verify on load. Enterprise procurement and compliance considerations (for example, FedRAMP/enterprise approval) can affect how you ship models.
Local encryption: encrypt model files and user secrets at rest using WebCrypto keys stored in IndexedDB with a user passphrase (or OS keychain in native helper). For concrete security controls and guidance when giving AI agents access to machines, see a security checklist.
Consent & transparency: show when local AI is active; offer easy opt-out and logs showing which page content was processed.
Audit logs: store user-accepted command histories and prompt inputs for traceability; redact PII when exporting logs. Operational dashboards and audit views are covered in the resilient dashboards playbook.
Safety: require explicit user interaction for any action outside a short allowlist; never auto-submit forms that change state without confirmation.

Step 10 — Testing, CI/CD and packaging

Automate unit tests for parsing and validation; include E2E tests using Playwright for extension flows. For the native helper, sign and notarize binaries for macOS and sign packages for Windows. CI/CD practices that preserve reproducible builds and alerting are important when you version model artifacts. CI should perform reproducible builds of quantized models (or pull them from a signed artifact repository).

2026 trends & operational considerations

Recent shifts to watch in late 2025–2026:

Wider WebGPU + WebNN availability across Chrome, Edge and Safari broadened the set of devices able to run in-browser LLMs with acceptable latency. See patterns for building edge‑ready microapps.
GGUF and standardized quant formats simplified cross-runtime model reuse — many runtimes now support GGUF out of the box.
Edge AI hardware (Raspberry Pi 5 + AI HAT+ and other modules) matured into reliable dev-lab equipment, enabling local native helper deployments affordable for POCs; a practical field guide to mobile studio & edge workspaces is useful when planning labs.
Privacy regulation and enterprise policy pressure increased demand for local inference alternatives to cloud-only solutions — ideal timing for this pattern. If you operate in regulated organisations, evaluate procurement gates such as FedRAMP/enterprise approvals.

Common pitfalls and how to avoid them

Overestimating model sizes: avoid shipping 13B+ models for in-browser — latency kills UX. Use native helper for those cases.
Insufficient validation for command execution: always validate and require user confirmation for state-changing actions. For a concrete security checklist on agent access, see security guidance.
Too many permissions: administrators will reject extensions that over-request host permissions — design for allowlists and per-origin enablement.
Assuming deterministic outputs: models can still hallucinate; pair model outputs with deterministic heuristics (regex extraction, keyword checks) for critical values like emails/IDs.

Real-world example: summarizer + autofill flow (concise implementation plan)

Popup: "Summarize page" button. When clicked, content script extracts main text (readability algorithm) and sends it to the background worker.
Background worker passes text to the in-browser model with a summarization template, streams token output to popup UI.
Popup displays summary and action bullets. User selects "Autofill" for detected fields.
Popup requests permission to run autofill on the current origin; user grants it once. Extension sends validated field data to content script which populates visible form fields and prompts user for final submit.

Metrics to track post-deployment

Latency: median and tail inference times for 1–10 prompt sizes.
Success rates: autofill matches tested fields vs manual input rate.
Human confirmation rate: percentage of model-suggested commands that users accept (high indicates usefulness; low suggests overreach).
Resource usage: memory and CPU on target devices — key for tuning quantization.

Advanced strategies & future-proofing

Hybrid augmentation: combine a small local model for intent parsing with a secure retrieval system (local vector DB) for augmented answers — keeps hallucinations down while offline. See composable UX and edge microapp patterns at Composable UX Pipelines.
Model shipping policies: decouple model from extension binary and implement signed model updates with explicit user consent and staged rollouts.
Edge fleet management: for enterprise deployment, integrate with MDM for allowlisting, model updates, and audit collection. Consider enterprise procurement gates such as FedRAMP/approval.
Local prompt tuning: experiment with lightweight instruction tuning or LoRA-style adapters stored locally to personalize behavior without sending data to the cloud.

References & resources (practical links to explore)

Look for recent wasm/WebGPU builds of open-source runtimes and GGUF model releases in public repositories (2025–2026 saw many projects adopt GGUF for portability).
Puma and other local-AI browsers popularized privacy-first browser-level LLMs on mobile — study their UX patterns for user consent and controls.
Edge hardware updates (Raspberry Pi 5 + AI HAT+ in 2025) removed some cost barriers for local GPU inference in developer labs.

Pro tip: in many developer tools, a 90–120 character summary plus 3 action bullets is far more useful than a long narrative — tune your prompt to concise outputs for higher adoption.

Conclusion: deploy, measure, iterate

By 2026, building an offline-capable extension with an on-device LLM is a practical, high-value proposition for dev teams wanting privacy, low latency and tighter automation integration. Start small: a 7B quant model for summarization and autofill provides immediate ROI and a safe path to evolve into hybrid or native-helper topologies as needs grow.

Actionable next steps (start now)

Clone an MV3 extension starter and add a WebWorker-based wasm runtime (or set up a signed native helper if you need larger models).
Quantize a 7B GGUF model and test summary latency on a target device — measure p50/p95 and adjust quant settings.
Implement the strict JSON command schema and the confirm UI — ship a 1.0 with only read-and-suggest capabilities; add exec after monitoring adoption.

Call to action

Ready to prototype? Grab our starter repo (MV3 scaffold + wasm worker template + prompt library) or sign up for a trial workspace at myscript.cloud to manage model artifacts, share prompt templates, and track audit logs across your team. Build locally, iterate quickly, and keep your workflows private and reproducible.

Build an offline-capable browser extension that runs an on-device LLM — hands-on

Executive summary (what you'll get)

Why build an offline on-device LLM extension in 2026?

High-level architecture: two options (pick one)

Option A — Pure in-browser (single-binary extension)

Option B — Extension + native helper (native messaging)

Prerequisites

Step 1 — Create the extension scaffold (Manifest V3)

Step 2 — Decide runtime: wasm/WebGPU vs native

In-browser runtime pattern (wasm + WebGPU)

Native helper pattern

Step 3 — Model selection and quantization

Step 4 — Load the model in the extension (example: wasm worker)

Step 5 — Prompting patterns (summarization, autofill, commands)

Summarization prompt (template)

Autofill prompt (template)

Command execution prompt -> strict JSON command response

Step 6 — Executing commands safely

Step 7 — Autofill wiring (example content script)

Step 8 — Performance tactics

Step 9 — Security, privacy & hardening checklist

Step 10 — Testing, CI/CD and packaging

2026 trends & operational considerations

Common pitfalls and how to avoid them

Real-world example: summarizer + autofill flow (concise implementation plan)

Metrics to track post-deployment

Advanced strategies & future-proofing

References & resources (practical links to explore)

Conclusion: deploy, measure, iterate

Actionable next steps (start now)

Call to action

Related Reading

Related Topics

myscript

Up Next

Prompt Injection Prevention Checklist for AI Apps

Best AI Tools for Extracting Keywords, Entities, and Sentiment from Text

How to Build Text Summarization Pipelines That Stay Consistent at Scale

From Our Network

AI Content Refresh Workflow: How to Update Old Articles with LLMs Safely

How to Add Human-in-the-Loop Review to AI Workflows Without Slowing Everything Down

Best Vector Databases for RAG: Performance, Pricing, and Developer Experience

Best Prompt Templates for Social Media Graphics with Text-to-Image Tools

How to Evaluate AI Image Quality: A Checklist for Sharpness, Anatomy, Text, and Brand Fit

How to Generate Better AI Thumbnails for YouTube, Blogs, and Social Posts