How DuckDB WebAssembly Makes In-Browser Analytics Possible

TL;DR: DuckDB compiles to WebAssembly and runs a full columnar SQL engine inside a browser tab. It handles CSV, Parquet, Excel, and JSON files at near-native speed, with query performance that's surprisingly competitive with server-side databases for datasets under a few hundred MB. This post covers the architecture, performance characteristics, memory management, and practical API usage.

Why this matters

Browser-based data tools have historically been limited to JavaScript-based solutions: small-scale filtering, D3 visualizations, maybe a lightweight SQLite port. Running real analytical SQL — window functions, aggregations over millions of rows, joins across tables — was server territory.

DuckDB-WASM changes this. It's a full-featured columnar database compiled to WebAssembly, which means it runs inside the browser's sandboxed JavaScript engine. No plugins, no native installs, no server. Open a URL and you have an analytical SQL database.

This enables a new category of data tools: browser-native analytics that process data locally on the user's machine, with zero network dependency for the compute layer.

WebAssembly: a quick primer

WebAssembly (WASM) is a binary instruction format that runs in the browser's virtual machine alongside JavaScript. Key properties:

Near-native speed. WASM code runs at 70-90% of native C/C++ performance, depending on the workload. For CPU-bound tasks like scanning columnar data and computing aggregations, this is fast enough to be practical.

Memory-safe sandbox. WASM runs in the browser's security sandbox. It can't access the filesystem, network, or other browser tabs. All I/O must go through JavaScript APIs. This is a feature, not a bug — it means a WASM module can't exfiltrate data even if it's compromised.

Language-agnostic. WASM is a compilation target. DuckDB is written in C++ and compiles to WASM via Emscripten. The C++ code doesn't change — the compiler handles the translation to WASM bytecode.

Linear memory model. WASM allocates a single contiguous block of memory (a WebAssembly.Memory object). DuckDB manages its own memory allocation within this block, similar to how it manages memory on native platforms.

The relevant implication: a browser tab running DuckDB-WASM is, for practical purposes, running a native-speed C++ database in a sandbox. The data stays in that sandbox.

Why DuckDB compiles well to WASM

Not every database would work well as a WASM module. DuckDB has several properties that make it a good fit:

1. Embeddable by design

DuckDB was built as an in-process database — it runs inside your application, not as a separate server. This means no socket connections, no client-server protocol, no daemon to manage. The entire database is a library you link against. For WASM, this is ideal: the database runs in the same JavaScript context as the rest of the application.

2. Columnar execution engine

DuckDB uses a columnar, vectorized execution engine. Data is processed in batches (called "vectors") of up to 2048 values. Each batch is a contiguous array of a single data type.

Why this matters for WASM: columnar batch processing maps well to CPU cache behavior and SIMD-style operations. DuckDB can process an entire column of integers with tight loops over contiguous memory — exactly the pattern where WASM performs closest to native speed.

Row-by-row processing (like traditional row-store databases) involves more pointer chasing and branch prediction misses, which amplify the overhead of running in WASM.

3. Zero external dependencies for core engine

DuckDB's core SQL engine has no hard dependencies on operating system services, file systems, or network stacks. File I/O, HTTP access, and extension loading are abstracted behind interfaces that can be swapped out. The WASM build replaces the native file system interface with one that works through the browser's File API.

4. Extension system

DuckDB's extension architecture lets the WASM build include only what's needed. The base WASM bundle includes CSV, Parquet, and JSON support. Extensions like spatial (which provides Excel reading via the st_read function) can be loaded on demand.

The base WASM bundle is approximately 10-12MB gzipped, which is large by web standards but acceptable for an application that replaces a database server.

Architecture of DuckDB-WASM

Here's how the pieces fit together in the browser:

+--------------------------------------------------------------+
|  Browser Tab                                                  |
|                                                               |
|  +------------------+    +-----------------------------+      |
|  | Main Thread      |    | Web Worker                  |      |
|  |                  |    |                             |      |
|  | - UI rendering   |    | +-------------------------+ |      |
|  | - User input     |    | | DuckDB-WASM Module      | |      |
|  | - Result display |    | |                         | |      |
|  |                  |    | | - SQL parser            | |      |
|  | JavaScript       |    | | - Query optimizer       | |      |
|  | glue code        |    | | - Columnar executor     | |      |
|  |                  |    | | - Buffer manager        | |      |
|  +--------+---------+    | | - CSV/Parquet readers   | |      |
|           |              | +-------------------------+ |      |
|           | postMessage  |                             |      |
|           +<------------>+ WebAssembly.Memory          |      |
|                          | (linear memory block)       |      |
|                          +-----------------------------+      |
|                                                               |
|  +----------------------------------------------------------+ |
|  | File API (no network)                                     | |
|  | - FileReader for user-selected files                      | |
|  | - Blob URLs for in-memory data                            | |
|  +----------------------------------------------------------+ |
+--------------------------------------------------------------+

Key design decisions:

Web Worker isolation. DuckDB-WASM runs in a Web Worker, not the main thread. This prevents long-running queries from blocking the UI. Communication between the main thread and the worker uses postMessage with structured cloning or SharedArrayBuffer (where available).

File registration. Files aren't "uploaded" to DuckDB. Instead, the browser's File object is registered with DuckDB-WASM, which reads it on demand through the File API. This is a pull model: DuckDB requests bytes as needed during query execution, rather than loading the entire file into WASM memory upfront.

Buffer management. DuckDB manages its own buffer pool within the WASM linear memory. When processing a query, it reads data in chunks, processes them, and releases the buffers. This is critical for keeping memory usage reasonable in the browser.

File format support

DuckDB-WASM reads multiple formats, each with different performance characteristics:

CSV

await conn.query(`
  CREATE TABLE data AS SELECT * FROM read_csv_auto('file.csv')
`);

DuckDB parses the CSV, infers types by sampling rows, and creates a columnar in-memory table. The entire file must be read and parsed, which makes CSVs the slowest format to load but universally compatible.

Type inference samples the first 10,000 rows by default. You can override with explicit column types if inference gets it wrong:

await conn.query(`
  CREATE TABLE data AS SELECT * FROM read_csv('file.csv',
    columns={'id': 'VARCHAR', 'amount': 'DOUBLE', 'date': 'DATE'}
  )
`);

Parquet

await conn.query(`
  CREATE TABLE data AS SELECT * FROM 'file.parquet'
`);

Parquet is where DuckDB-WASM really shines. Parquet files are columnar and self-describing (schema is embedded in the file), which means:

Column pruning: A query that only accesses 3 of 50 columns only reads those 3 columns from the file.
Row group filtering: Parquet stores min/max statistics per row group. A WHERE amount > 1000 filter can skip entire row groups without reading them.
No type inference needed: The schema is in the file metadata.

For a 200MB Parquet file with 50 columns, a query accessing 3 columns might only read 12MB from the file. This is transformative for browser-based analytics.

Excel

Excel support is available through DuckDB's spatial extension:

await conn.query(`
  CREATE TABLE data AS SELECT * FROM st_read('file.xlsx')
`);

Excel parsing is slower than CSV (the .xlsx format is a compressed XML archive), but it handles multiple sheets, named ranges, and typed cells.

JSON

await conn.query(`
  CREATE TABLE data AS SELECT * FROM read_json_auto('file.json')
`);

DuckDB handles both regular JSON and newline-delimited JSON (NDJSON). Nested objects are flattened into columns.

Performance benchmarks

These benchmarks were run on a 2023 MacBook Pro (M3, 16GB RAM) using Chrome 130. The dataset is the NYC Taxi Trip Records (January 2023, Parquet format, ~3M rows, ~200MB).

Query 1: Simple aggregation

SELECT COUNT(*) FROM trips;

Environment	Time
DuckDB native (CLI)	45ms
DuckDB-WASM (browser)	82ms
pandas (Python)	210ms

Query 2: Group by with aggregation

SELECT
    PULocationID,
    COUNT(*) as trip_count,
    AVG(trip_distance) as avg_distance,
    SUM(total_amount) as total_revenue
FROM trips
GROUP BY PULocationID
ORDER BY total_revenue DESC
LIMIT 10;

Environment	Time
DuckDB native	180ms
DuckDB-WASM	340ms
pandas equivalent	890ms

Query 3: Window function

SELECT
    PULocationID,
    tpep_pickup_datetime::DATE as pickup_date,
    SUM(total_amount) as daily_revenue,
    SUM(SUM(total_amount)) OVER (
        PARTITION BY PULocationID
        ORDER BY tpep_pickup_datetime::DATE
    ) as cumulative_revenue
FROM trips
GROUP BY PULocationID, pickup_date
ORDER BY PULocationID, pickup_date;

Environment	Time
DuckDB native	520ms
DuckDB-WASM	980ms
pandas equivalent	2.4s

Key observations

WASM overhead is roughly 1.5-2x native. This is consistent with general WASM benchmarks. For analytical queries, the columnar execution engine minimizes the impact because the tight inner loops (scanning arrays, computing aggregates) are exactly where WASM performs best.

DuckDB-WASM beats pandas for most analytical queries. This surprises people, but it makes sense. DuckDB's columnar engine is specifically optimized for analytical workloads, while pandas uses row-based operations with Python interpreter overhead.

Parquet files are 3-5x faster to query than CSVs at scale because DuckDB can skip irrelevant columns and row groups. If you plan to analyze the same dataset repeatedly, converting to Parquet first is worth it.

Memory management in the browser

Browser tabs typically get 1-4GB of memory depending on the browser and system. DuckDB-WASM needs to work within this constraint.

WASM linear memory grows but doesn't shrink. Once WebAssembly.Memory is expanded, it stays at that size until the page is unloaded. DuckDB manages its own buffer pool within this memory, reusing buffers across queries, but the total WASM memory high-water mark persists.

Practical file size limits:

File format	Comfortable limit	Possible with care	Beyond this, use CLI
CSV	~200MB	~500MB	>500MB
Parquet	~500MB	~1GB	>1GB
Excel	~100MB	~200MB	>200MB

For Parquet, the comfortable limit is higher because column pruning means DuckDB doesn't need to materialize the entire file in memory.

What happens when you hit the limit: The browser kills the tab. There's no graceful degradation — the tab crashes. Good DuckDB-WASM applications should estimate memory requirements before running a query and warn the user if it's likely to exceed available memory.

API usage: building with DuckDB-WASM

If you're building a tool on top of DuckDB-WASM, here's the initialization pattern:

import * as duckdb from '@duckdb/duckdb-wasm';
import duckdb_wasm from '@duckdb/duckdb-wasm/dist/duckdb-mvp.wasm?url';
import duckdb_worker from '@duckdb/duckdb-wasm/dist/duckdb-browser-mvp.worker.js?url';

// Initialize with Web Worker
const bundle = {
  mainModule: duckdb_wasm,
  mainWorker: duckdb_worker,
};

const logger = new duckdb.ConsoleLogger();
const worker = new Worker(bundle.mainWorker);
const db = new duckdb.AsyncDuckDB(logger, worker);
await db.instantiate(bundle.mainModule);

// Open connection
const conn = await db.connect();

// Register a file from user input
const fileInput = document.getElementById('file-input');
const file = fileInput.files[0];
await db.registerFileHandle(file.name, file, duckdb.DuckDBDataProtocol.BROWSER_FILEREADER, true);

// Create table from file
await conn.query(`CREATE TABLE data AS SELECT * FROM read_csv_auto('${file.name}')`);

// Run queries
const result = await conn.query(`
  SELECT column_name, column_type
  FROM information_schema.columns
  WHERE table_name = 'data'
`);

// Result is an Arrow Table
console.log(result.toArray());

// Clean up
await conn.close();
await db.terminate();

Important API details:

Results are returned as Apache Arrow tables (result.toArray() converts to JavaScript objects).
DuckDBDataProtocol.BROWSER_FILEREADER tells DuckDB to read the file via the File API, not by copying it into WASM memory.
The registerFileHandle with true as the last argument means the file handle is kept open for lazy reading.

Limitations and honest tradeoffs

No persistent storage. DuckDB-WASM databases exist in memory. Close the tab and the data is gone. You can export results, but there's no "save database" button (OPFS — Origin Private File System — is an emerging solution for this, but browser support is still inconsistent).

No multi-tab sharing. Each tab runs its own DuckDB instance. Two tabs analyzing the same file maintain separate copies in memory. SharedArrayBuffer can help, but it requires specific CORS headers.

Extension loading is slower. Loading extensions (like spatial for Excel support) requires downloading additional WASM modules. This adds startup latency and bundle size.

No SIMD everywhere. WASM SIMD instructions (which DuckDB can use for vectorized operations) are supported in Chrome and Firefox but not in all browsers. Without SIMD, performance drops ~20-30% for numeric workloads.

String-heavy workloads are slower. WASM's linear memory model means string operations (comparison, hashing, regex) incur overhead from crossing the WASM/JS boundary. Numeric aggregations are the sweet spot.

What this enables

DuckDB-WASM makes a specific category of applications possible:

Privacy-first data tools. Data stays in the browser sandbox. No server-side processing means no data exfiltration risk from the infrastructure layer.
Zero-install analytics. Send someone a URL and they have a SQL database. No Python, no Docker, no database server. Onboarding friction drops to zero.
Offline-capable analysis. Once the WASM module is cached, DuckDB-WASM works without internet. Analysts on planes, in secure facilities, or on unreliable connections can still work.
Serverless architecture. No compute servers to provision, scale, or pay for. The user's machine does the work. This fundamentally changes the economics of building data tools.

The gap between "browser toy" and "real database" has effectively closed for datasets under a few hundred MB. That covers a surprising range of real-world analytical workloads.

QueryVeil is built on DuckDB-WASM with a schema-only AI layer for natural language queries. If you want to see the performance firsthand, the live demo runs DuckDB-WASM in your browser — open DevTools and watch it work. No signup required.