DuckDB: The Awesome Tiny & Powerful Analytics Database

Most people assume that analytical databases, or better known as online analytical processing (OLAP) databases are super massive, powerful beasts — and they are correct about assuming that. Systems like Snowflake, Redshift, or Postgres involve a lot of setup and maintenance from systems administrators – this is even in their cloud-hosted incarnations. The question is what if all you want is just enough analytics for a dataset on your desktop? In that case, DuckDB is definitely worth exploring as an option for a database that could fit your needs.

DuckDB Can Run on a Laptop

DuckDB is a super tiny but very powerful analytics database engine — a single, self-contained executable – which can even run as a standalone instance, and or as a loadable library inside a host process. There's honestly very little you need to set up or maintain with DuckDB. If you really want to think about it, then it's more comparable to using SQLite than the bigger analytical databases.

DuckDB is designed for column-oriented data querying. It ingests data from sources like cvs, json, and Apache Parquet. This allows you to enable fast querying using familiar SQL syntax. DuckDB supports libraries for all the major programming languages, so you can work with it programmatically using the language of your choice. You can also use DuckDB's command-line interface. This is either on its own or as part of a shell pipeline.

Do you like what you are reading so far? When you are done with this article, we recommend reading Simple Implementation to Understand worker_threads in NodeJS as the next article you read.

Loading Data into DuckDB

When a developer chooses to work with data in DuckDB, there are two modes you can use for that data. The first is Persistent Mode, which writes the data to disk so it can handle workloads bigger than system memory would allow. This approach comes at the cost of some speed and or a small performance loss. Then you have In-memory mode, which keeps the data set entirely in memory, which is a lot faster but retains nothing once the program has ended. Please note that SQLite can be used the same way also.

DuckDB can ingest data from a variety of formats. CSV, JSON, and Apache Parquet files are three of the most common. With CSV and JSON, DuckDB by default attempts to figure out the columns and data types on its own, but you can override that process as needed—for instance, to specify a format for a date column.

Other databases, like MySQL or Postgres, can also be used as data sources. You'll need to load a DuckDB extension (more information on this later in the article) and provide a connection string to the database server; DuckDB doesn't read the files for those databases directly. With using SQLite, though, you connect to the SQLite database file as though it were just another data file.

To load data into DuckDB from an external source, you can use an SQL string, passed directly into DuckDB like in the following SQL example.

SELECT * FROM read_csv('data.csv');

You can also use methods in the DuckDB interface library for a given language. With the Python library for DuckDB, ingesting looks like this example.

import duckdb
duckdb.read_csv("data.csv")

You can also query certain file formats directly, like Parquet, if you would like.

SELECT * FROM 'test.parquet';

You can also issue file queries to create a persistent data view, which is usable as a table for multiple queries.

CREATE VIEW test_data AS SELECT * FROM read_parquet('test.parquet');

DuckDB has optimizations for working with Parquet files, so that it reads only what it needs from the file, instead of all data presented in the files.

Other interfaces like ADBC and ODBC can also be used. ODBC serves as a connector for data visualization tools like Tableau.

Data imported into DuckDB can also be re-exported in many common formats: CSV, JSON, Parquet, Microsoft Excel, and others. This makes DuckDB useful as a data-conversion tool in a processing pipeline.

Querying Data in DuckDB

Once you've loaded some data into DuckDB, you can query it using SQL expressions. The format for such expressions is no different from regular SQL queries:

SELECT * FROM users WHERE ID>1000 ORDER BY Name DESC LIMIT 5;

If you're using a client API to query DuckDB, you can pass SQL strings through the API, or you can use the client's relational API to build up queries programmatically. In Python, reading from a JSON file and querying it could look like the following code example.

import duckdb
file = duckdb.read_json("users.json")
file.select("*").filter("ID>1000").order("Name").limit(5)

If you are using Python, you can use the PySpark API to query DuckDB directly, although DuckDB's implementation of PySpark doesn't yet support the full feature set.

DuckDB's dialect of SQL closely follows most common SQL dialects, although it comes with a few gratuitous additions for the sake of analytics. For instance, placing the SAMPLE clause in a query lets you run a query using only a subset of the data in a table. The resulting query runs faster but it may be less accurate. DuckDB also supports the PIVOT keyword (for creating pivot tables), window functions and QUALIFY clauses to filter them, and many other analytics functions in its SQL dialect.

What About Extensions?

DuckDB isn't just limited to the data formats and behaviors baked into it. It also has a good extension API makes it possible to write third-party addons for DuckDB to support new data formats or other behaviors.

Some of the functionality included with DuckDB is implemented through first-party addons, like support for Parquet files. Others, like MySQL or Postgres connectivity, or vector similarity search, are also maintained by DuckDB's team but provided separately.

Do you like what you're reading from the CoderOasis Technology Blog? We recommend reading our Implementing RSA in Python from Scratch series next.

The CoderOasis Community

Did you know we have a Community Forums and Discord Server? which we invite everyone to join us? Want to discuss this article with other members of our community? Want to join a laid back place to chill and discuss topics like programming, cybersecurity, web development, and Linux? Consider joining us today!

DuckDB: The Awesome Tiny & Powerful Analytics Database

DuckDB Can Run on a Laptop

Loading Data into DuckDB

Querying Data in DuckDB

What About Extensions?

The CoderOasis Community

Article Comments

Read next

Capturing Garbage Collection Traces in NodeJS Applications

Deno 1.43 Released: Language Server Performance Boosts

What is Docker? Containerize Applications to Run on Any System