Skip to contents

This function opens a dataset from a variety of sources, including Parquet, CSV, etc, using either local file system paths, URLs, or S3 bucket URI notation.

Usage

open_dataset(
  sources,
  schema = NULL,
  hive_style = TRUE,
  unify_schemas = FALSE,
  format = c("parquet", "csv", "tsv", "sf"),
  conn = cached_connection(),
  tblname = tbl_name(sources),
  mode = "VIEW",
  filename = FALSE,
  recursive = TRUE,
  ...
)

Arguments

sources

A character vector of paths to the dataset files.

schema

The schema for the dataset. If NULL, the schema will be inferred from the dataset files.

hive_style

A logical value indicating whether to the dataset uses Hive-style partitioning.

unify_schemas

A logical value indicating whether to unify the schemas of the dataset files (union_by_name). If TRUE, will execute a UNION by column name across all files (NOTE: this can add considerably to the initial execution time)

format

The format of the dataset files. One of "parquet", "csv", "tsv", or "sf" (spatial vector files supported by the sf package / GDAL). if no argument is provided, the function will try to guess the type based on minimal heuristics.

conn

A connection to a database.

tblname

The name of the table to create in the database.

mode

The mode to create the table in. One of "VIEW" or "TABLE". Creating a VIEW, the default, will execute more quickly because it does not create a local copy of the dataset. TABLE will create a local copy in duckdb's native format, downloading the full dataset if necessary. When using TABLE mode with large data, please be sure to use a conn connections with disk-based storage, e.g. by calling cached_connection(), e.g. cached_connection("storage_path"), otherwise the full data must fit into RAM. Using TABLE assumes familiarity with R's DBI-based interface.

filename

A logical value indicating whether to include the filename in the table name.

recursive

should we assume recursive path? default TRUE. Set to FALSE if trying to open a single, un-partitioned file.

...

optional additional arguments passed to duckdb_s3_config(). Note these apply after those set by the URI notation and thus may be used to override or provide settings not supported in that format.

Value

A lazy dplyr::tbl object representing the opened dataset backed by a duckdb SQL connection. Most dplyr (and some tidyr) verbs can be used directly on this object, as they can be translated into SQL commands automatically via dbplyr. Generic R commands require using dplyr::collect() on the table, which forces evaluation and reading the resulting data into memory.

Examples

if (FALSE) { # interactive()
# A remote, hive-partitioned Parquet dataset
base <- paste0("https://github.com/duckdb/duckdb/raw/main/",
             "data/parquet-testing/hive-partitioning/union_by_name/")
f1 <- paste0(base, "x=1/f1.parquet")
f2 <- paste0(base, "x=1/f2.parquet")
f3 <- paste0(base, "x=2/f2.parquet")

open_dataset(c(f1,f2,f3), unify_schemas = TRUE)

# Access an S3 database specifying an independently-hosted (MINIO) endpoint
efi <- open_dataset("s3://neon4cast-scores/parquet/aquatics",
                    s3_access_key_id="",
                    s3_endpoint="data.ecoforecast.org")
}