There are numerous packages that already interface with the AWS S3 protocol for object storage. Most rely directly on calls to the low-level S3 REST API through R packages such as
httr, which requires significant amounts of code to provide high-level functionality (e.g. handling authentication, paging over results, parsing returned XML), and is thus prone to inefficiency and bugs. Many also implicitly assume that Amazon is the underlying provider, making it difficult or impossible to work with a substantial and growing number of object stores now conform to the AWS S3 standard. These include NSF’s OpenStorageNetwork, Jetstream2 (both based on open source Redhat CEPH), NCAR’s Stratus (based on Western Digital S3), and MinIO Servers (another open source implementation popular with companies and developers), as well as Google Cloud Storage’s S3 compatibility mode.
In contrast, the MinIO Client, an open-source, AGPL-v3 software developed in the Go language by the MinIO team, provides a high-performance utility with intuitive design for working across multiple cloud-based object stores as well as local filesystems. This package provides a thin R wrapper around that client – maximizing performance and minimizing potential for maintenance and bugs. A helper utility provides a convenient way to install and update the golang binary across operating systems and architectures. The client supports parallel threads by default, intuitive handling of bucket permissions such as granting or revoking anonymous access, and persistent configurations across multiple clouds. After struggling against the limitations of many different R wrappers for S3 object stores, this is now my go-to.
You can install the development version of
minioclient from GitHub with:
# install.packages("devtools") devtools::install_github("cboettig/minioclient")
At first use, all operations will attempt to install the client (after prompting) if not already installed. Users can also install latest version of the minio client can be installed using
The MinIO client is designed to support multiple endpoints for cloud storage, including AWS, Google Cloud Storage (via S3-compatibility), and other S3 compatible clients such as open source MinIO or Redhat CEPH storage systems. MinIO uses a syntax based around aliases to allow access across multiple platforms. Aliases can be configured using access key pairs to allow authenticated access.
By default, the client comes pre-configured with credentials for the MinIO
play platform, designed for public experimental storage and examples. We can use
mc_alias_ls() to see all clients, specify the client we want:
Some S3 object storage systems allow access without credentials. Confusingly, attempting to access public data with invalid credentials will still fail, so we need to specify an anonymous endpoint with no credentials. By default,
mc_alias_set will seek to use
AWS_SECRET_ACCESS_KEY in your environment, if set. This allows
minioclient to be used in scripts with authentication keys passed in securely as environmental variables. To set an anonymous access, simply indicate empty credentials, like so:
mc_alias_set("anon", "s3.amazonaws.com", access_key = "", secret_key = "")
Configuration of aliases is stored in a persistent configuration file, so aliases need be created only once on a given machine. All
mc functions specify which cloud provider using a filepath notation,
<ALIAS>/<BUCKET>/<PATH>. For instance, we can list all objects found in the bucket
gbif-open-data-us-east-1, which is a public bucket included in the AWS Open Data Registry:
mc_ls("anon/gbif-open-data-us-east-1") #>  "index.html" "occurrence/"
mc functions can also understand local filesystem paths. Any absolute path (path starting with
/), or any relative path not recognized as a registered alias (Note: be careful not to have local folders using the same name as remote aliases!) will be interpreted as a local path. For instance, we can list the contents of the local
mc_ls("R") #>  "install_mc.R" "mc.R" "mc_alias.R" "mc_anonymous.R" #>  "mc_cat.R" "mc_config_set.R" "mc_cp.R" "mc_diff.R" #>  "mc_du.R" "mc_head.R" "mc_ls.R" "mc_mb.R" #>  "mc_mirror.R" "mc_mv.R" "mc_rb.R" "mc_rm.R" #>  "mc_sql.R" "mc_stat.R"
This notation makes it easy to move data between local and remote systems, or even between two remote systems. Let’s copy the
index.html file from GBIF to our local file system.
Just to prove this is indeed a local copy, we can list local directory:
fs::file_info("gbif.html") #> # A tibble: 1 × 18 #> path type size permissions modification_time user group device_id #> <fs::path> <fct> <fs::b> <fs::perms> <dttm> <chr> <chr> <dbl> #> 1 gbif.html file 31.6K rw-r--r-- 2023-11-05 22:54:15 cboe… cboe… 66307 #> # ℹ 10 more variables: hard_links <dbl>, special_device_id <dbl>, inode <dbl>, #> # block_size <dbl>, blocks <dbl>, flags <int>, generation <dbl>, #> # access_time <dttm>, change_time <dttm>, birth_time <dttm>
For any object store where we have adequate permissions, we can create new buckets:
We can copy files or directories to the remote bucket:
mc_cp("anon/gbif-open-data-us-east-1/index.html", play_bucket) mc_cp("R/", play_bucket, recursive = TRUE, verbose = TRUE) #> `/home/cboettig/cboettig/minioclient/R/mc.R` -> `play/play-hmdzuvevfzdi/mc.R` #> `/home/cboettig/cboettig/minioclient/R/install_mc.R` -> `play/play-hmdzuvevfzdi/install_mc.R` #> `/home/cboettig/cboettig/minioclient/R/mc_alias.R` -> `play/play-hmdzuvevfzdi/mc_alias.R` #> `/home/cboettig/cboettig/minioclient/R/mc_anonymous.R` -> `play/play-hmdzuvevfzdi/mc_anonymous.R` #> `/home/cboettig/cboettig/minioclient/R/mc_config_set.R` -> `play/play-hmdzuvevfzdi/mc_config_set.R` #> `/home/cboettig/cboettig/minioclient/R/mc_cat.R` -> `play/play-hmdzuvevfzdi/mc_cat.R` #> `/home/cboettig/cboettig/minioclient/R/mc_cp.R` -> `play/play-hmdzuvevfzdi/mc_cp.R` #> `/home/cboettig/cboettig/minioclient/R/mc_diff.R` -> `play/play-hmdzuvevfzdi/mc_diff.R` #> `/home/cboettig/cboettig/minioclient/R/mc_du.R` -> `play/play-hmdzuvevfzdi/mc_du.R` #> `/home/cboettig/cboettig/minioclient/R/mc_head.R` -> `play/play-hmdzuvevfzdi/mc_head.R` #> `/home/cboettig/cboettig/minioclient/R/mc_ls.R` -> `play/play-hmdzuvevfzdi/mc_ls.R` #> `/home/cboettig/cboettig/minioclient/R/mc_mb.R` -> `play/play-hmdzuvevfzdi/mc_mb.R` #> `/home/cboettig/cboettig/minioclient/R/mc_mirror.R` -> `play/play-hmdzuvevfzdi/mc_mirror.R` #> `/home/cboettig/cboettig/minioclient/R/mc_mv.R` -> `play/play-hmdzuvevfzdi/mc_mv.R` #> `/home/cboettig/cboettig/minioclient/R/mc_rb.R` -> `play/play-hmdzuvevfzdi/mc_rb.R` #> `/home/cboettig/cboettig/minioclient/R/mc_rm.R` -> `play/play-hmdzuvevfzdi/mc_rm.R` #> `/home/cboettig/cboettig/minioclient/R/mc_sql.R` -> `play/play-hmdzuvevfzdi/mc_sql.R` #> `/home/cboettig/cboettig/minioclient/R/mc_stat.R` -> `play/play-hmdzuvevfzdi/mc_stat.R` #> Total: 0 B, Transferred: 22.00 KiB, Speed: 314.03 KiB/s
Note the use of
recursive = TRUE to transfer all objects matching the pattern. In S3 object stores, file paths are really just prefixes, thus this query includes not only everything in the
R folder, but also
README.md, since it also matches the prefix. (Had we used the prefix
README.md would not be matched and the R scripts would go directly into
play_bucket root instead of an
We can examine disk usage of remote objects or directories:
We can also adjust permissions for anonymous access:
Public objects can be accessed directly over HTTPS connection using the endpoint URL, bucket name and path:
Any command supported by the minio client can be accessed using the function
mc(). This function can be used in place of any of the above methods, or to access additional methods where no wrapper exists, see
mc("-h") for complete list. R functions such as
mc_ls() are merely helpful wrappers around the more generic
mc() utility, e.g.
mc("ls play") is equivalent to
mc_ls("play"). Providing helper methods allows tab-completion discovery of functions, R-based documentation, and improved handling of display behavior (e.g.
verbose=FALSE by default on certain commands.) See official mc client docs for details.
In addition to usual R documentation, users can display full help information for any method using the argument
"-h". This includes details on optional flags and further examples.
We can now use arbitrary
mc commands (see quickstart). For example, examine file information to confirm that eTags (md5sums here) match for these objects: