Skip to content
Blog

Scan data from various sources

There are two ways to work with external data in Ladybug:

  1. Ingest - Copy data into Ladybug for storage and querying
  2. Query in place - Query external data sources directly without importing

Ingest data into Ladybug

Use COPY FROM to bulk import data from files or external databases into Ladybug node or relationship tables.

COPY Person FROM "person.csv";

Supported formats:

  • Parquet (recommended for performance)
  • CSV (slower but commonly used)
  • JSON (slower but commonly used)
  • NumPy
  • Pandas/Polars DataFrames
  • PyArrow Tables

Load from (scan in memory)

Use LOAD FROM to scan data from various sources into memory without inserting it into the database. This is useful for inspecting data structure and applying simple transformations before importing.

When scanning with LOAD FROM, only the WHERE clause is supported for filtering:

LOAD FROM "user.csv" (header = true)
WHERE age > 25
RETURN COUNT(*);

Example: CSV file

Say you have a user.csv file that looks like this:

user.csv
name,age
Adam,30
Karissa,40
Zhang,50
Noura,25

You can scan the file and count the number of rows:

LOAD FROM "user.csv" (header = true)
RETURN COUNT(*);

This counts the number of rows in the file.

┌──────────────┐
│ COUNT_STAR() │
│ INT64 │
├──────────────┤
│ 4 │
└──────────────┘

You can also apply filter predicates via the WHERE clause, like this:

LOAD FROM "user.csv" (header = true)
WHERE age > 25
RETURN COUNT(*);

The above query counts only the rows where the age column is greater than 25.

┌──────────────┐
│ COUNT_STAR() │
│ INT64 │
├──────────────┤
│ 3 │
└──────────────┘

Note that when scanning from CSV files, Ladybug will attempt to auto-cast the data to the correct type when possible. For example, the age column is cast to an INT64 type.

You can reorder the columns by simply returning them in the order you want. The LIMIT keyword can be used to limit the number of rows returned. The example below returns the first two rows, with the age and name columns in the order specified.

LOAD FROM "user.csv" (header = true)
RETURN age, name
LIMIT 2;
┌───────┬─────────┐
│ age │ name │
│ INT64 │ STRING │
├───────┼─────────┤
│ 30 │ Adam │
│ 40 │ Karissa │
└───────┴─────────┘

Query external data in place

Query data directly from external sources without importing. Unlike LOAD FROM, this supports any Cypher query that works on native node/rel tables.

Supported sources

  • Parquet files
  • Arrow memory
  • DuckDB tables
  • PostgreSQL (coming soon)
  • SQLite (coming soon)

Syntax

Create a table that references an external data source using the WITH (storage="...") option:

CREATE NODE TABLE Employee (id INT64, name STRING, PRIMARY KEY (id))
WITH (storage='path/to/employee.parquet');
CREATE REL TABLE WorksIn (from Person, to Company, since INT32)
WITH (storage='path/to/works_in.parquet');

Arrow memory

You can query Arrow memory directly by registering it with the database. The arrowId is obtained when you register Arrow memory (e.g., via the Python API):

CREATE NODE TABLE Employee (id INT64, name STRING, PRIMARY KEY (id))
WITH (storage='arrow://my_arrow_table_id');

DuckDB

First, attach a DuckDB database:

ATTACH 'university.db' AS uw (dbtype duckdb);

Then create tables that reference the DuckDB tables:

NODE tables:

CREATE NODE TABLE Employee (id INT64, name STRING, PRIMARY KEY (id))
WITH (storage='uw.employee');
CREATE NODE TABLE Company (id INT64, name STRING, PRIMARY KEY (id))
WITH (storage='uw.company');

REL table:

CREATE REL TABLE WorksIn (from Employee, to Company, since INT32)
WITH (storage='uw.worksIn');

Once created, you can query the tables just like any other Ladybug table:

MATCH (e:Employee)-[w:WorksIn]->(c:Company)
RETURN e.name, c.name, w.since
LIMIT 5;