vignettes/technotes/identifiers-for-iterative-forecasting.Rmd
identifiers-for-iterative-forecasting.Rmd
This document proposes the use of content-based identifiers for publishing products associated with automated, iterative forecasts. Iterative forecasting will frequently involve automatically running code which ingests public data products and generates output forecast products, along with associated metadata. Consequently, the forecasts produced may depend on the code and software which defines the forecast algorithm, as well as the input data used. Researchers must be able to uniquely identify and access each forecast generated by running the algorithm, as well as the associated input data files and code.
We propose that forecast products be identified by their SHA-256 checksum in the Hash-URI format:
hash://sha256/<HASH>
Note that this is an un-salted hash, containing no additional metadata beyond the pure file hash (in contrast to other content-based storage systems such as dat
or IPFS
). Consequently, the URI tells us everything we need to know to generate the hash (i.e. the algorithm used is a sha256 hash). For example, we can create an identifier for the csv
serialization of popular example dataset, mtcars
, (Henderson and Velleman 1981) in R (R Core Team 2020) as follows:
readr::write_csv(mtcars, "mtcars.csv")
hash <- openssl::sha256(file("mtcars.csv"))
paste0("hash://sha256/", hash)
## [1] "hash://sha256/c802190c43e02246da9c6c9c3f13a58f076cc6b77922f4d9766a3c6bdb1b52bd"
Here we have used the openssl
package’s implementation of the sha256 algorithm, which binds a fast and widely used C library (Ooms 2020). Many other implementations are readily available (e.g. the digest
package in R, (Antoine Lucas et al. 2020), sha256sum
(Drepper, Miller, and Madore 2018) ), and will produce the identical hash.
https://doi.org
resolution service), even though most robust archival storage requires backup copies of content be stored in other archives (e.g. DataONE network for data, or the LOCKSS or CLOCKSS networks used by scientific publishers). The Hash Archive, https://hash-archive.org, provides a service similar to the https://doi.org
resolver for content-based identifiers, but instead return all registered locations. This same property of content-based identifiers underlies other distributed storage algorithms such as “torrents.” Note that this is not a replacement for archival storage repositories: ideally at least one registered location corresponds to an archival repository for any data that needs to be permanently archived.sha256
identifier for it.To facilitate the use of content-based identifiers, we provide a simple R package implementation, contentid
. To illustrate a trivial forecasting workflow, we will begin with table of Carabid beetle species richness derived at biweekly sampling intervals for each site in the National Ecological Observatory Network, NEON (carabid?). We resolve the species richness data using its content id:
Sys.setenv("CONTENTID_REGISTRIES"="https://hash-archive.carlboettiger.info")
library(contentid)
richness <- readr::read_csv(resolve("hash://sha256/280700dbc825b9e87fe9e079172d70342e142913d8fb38bbe520e4b94bf11548"))
For illustrative purposes, let us make a baseline probabilistic forecast using the historical mean and standard deviation as our prediction for the monthly species richness that will be observed at each site in 2021:
library(dplyr)
library(tidyr)
richness_forecast <- richness %>%
group_by(month, siteID) %>%
summarize(mean = mean(n, na.rm = TRUE),
sd = sd(n, na.rm = TRUE)) %>%
mutate(sd = replace_na(sd, mean(sd, na.rm=TRUE))) %>%
mutate(year = 2021)
readr::write_csv(richness_forecast, "richness_forecast.csv")
We can then compute the content identifier for our forecast using the function content_id()
:
content_id("richness_forecast.csv")
## [1] "hash://sha256/7351b00997cc8832db5251af5efdfdc700fccd4d57990d10ea9c67758849a957"
In order to resolve this identifier, we must first register it: Note that our call to register()
also returns the file’s content identifier, so we don’t need to call content_id()
:
## [1] "hash://sha256/7351b00997cc8832db5251af5efdfdc700fccd4d57990d10ea9c67758849a957"
We can now resolve this id:
resolve(id, registries = "local.tsv")
## [1] "/home/cboettig/Documents/cboettig/contentid/vignettes/technotes/richness_forecast.csv"
Because we registered only a local path to the file, this simply returns the relative, local path. This is still sufficient to use within a script:
Eventually we may make this data file available at some public URL to share with colleagues or other computational resources before we are ready to publish it, such as GitHub or an S3 bucket. To illustrate this, I’ve placed a copy on an S3 bucket on my MINIO server. We can go ahead and register this new public URL:
register("https://minio.carlboettiger.info/shared-data/richness_forecast.csv")
Note this again returns the same identifier, which has been freshly calculated from the file. resolve()
will work as before, and will still return our local path as long as that file exists and matches the identifier. But if we delete the file, or worse, accidentally overwrite it with some other data, resolve()
will detect the identifier does not match, and resort to our local URL:
## [1] NA
Note that this time, resolve()
has not returned the local file richness_forecst.csv
this time, but instead the path to a temporary file. Internally, resolve()
has first confirmed that while the local path richness_forecast.csv
still exists, the hash doesn’t match the requested id. Fortunately, because we also registered a URL for this identifier, resolve()
has fallen back on that alternative source, downloaded the file at that URL to the temporary directory, and then computed the content id of the downloaded file to confirm it still matched the requested identifier. This all happens behind the scenes, such that our workflow,
continues to work unchanged, despite the local copy being corrupted and the data now coming from the remote URL. In similar fashion, once our data is finally uploaded to a permanent data archive, we can add this most permanent location to the registry, much as we added the less-persistent URL of the local server.
As we have just seen, using this pattern of read_csv(resolve(id))
instead of the more common pattern, read_csv("mtcars.csv")
has numerous advantages:
resolve()
will automatically verify that the file read in matches the cryptographic hash, ensuring integrity and reproducibility.resolve()
will prefer local files when available, avoiding repeated downloads when a script is frequently re-run.resolve()
will be more portable than scripts which assume a local file is available at a specific path.resolve()
can become more robust to link-rot (Elliott, Poelen, and Fortes 2020).It is also worth noting that the strategy outlined here can easily be applied independent of the contentid
R package in different computer languages and scripts. These benefits follow immediately from using content-based hashes as object identifiers. The approach taken in contentid
is based on previous implementations, including Hash Archive (written in C) and preston (java) (Poelen 2020).
Technical notes: - the Hash URI format uses hexadecimal encoding of the hash, a 64 character lower-case alpha-numeric string. Alternative content-based identifiers recognized by Hash Archive, including named information (ni
) and subresource integrity formats use base-64 encoding. While these are shorter, (43 characters), they are case-sensitive and include additional characters such as /
which can lead to confusion or errors.
- While the hash URI format is not a W3C recognized format or namespace, we have found this format to be more intuitive and practical than alternatives. - Because hashes encode the most significant characters first, it is often possible to omit many of the trailing characters and still successfully resolve the identifier uniquely. Of course using fewer characters increases the chance of a collision. For example:
content_id( resolve("hash://sha256/280700dbc825b9") )
## [1] "hash://sha256/280700dbc825b9e87fe9e079172d70342e142913d8fb38bbe520e4b94bf11548"
For example, the Software Heritage Project (Cosmo and Zacchiroli 2020) periodically archives the content of all public repositories on GitHub (and elsewhere, including the packages in the Comprehensive R Archive Network, CRAN), and also allows us to query for any object in it’s archive using the SHA-256 signature. We can query the Software Heritage index to see if anyone has already written the popular example mtcars
data to a csv
and uploaded that to a public GitHub repository or other location indexed by SoftwareHeritage:
query <- sources_swh("hash://sha256/c802190c43e02246da9c6c9c3f13a58f076cc6b77922f4d9766a3c6bdb1b52bd")
url <- query$source[[1]]
Indeed it has! While some data products will be too large to make available through GitHub or BitBucket repositories, it is worth noting that users who deposit data to those locations can trigger a Software Heritage to generate a persistent snapshot of all the content which can then be queried in this way by using the store_swh()
function from contentid
, or the Software Heritage API or web interface.
The DataONE API also allows us to query for any object in it’s system by content hash (checksum), but unlike Software Heritage Archive, many objects have only a SHA1 or MD5 sum recorded. This is not an obstacle for new uploads, which can easily opt into using sha256. Even more conveniently, the DataONE API allows us to specify our own identifiers, (provided they don’t conflict with anything already in the DataONE registry). This allows us to upload and download data to DataONE repositories such as the KNB using content-based identifiers, like so:
library(dataone)
library(datapack)
library(mime)
dataone_node <- function(){
if(!is.null(getOption("dataone_test_token")))
return( dataone::D1Client("STAGING2", "urn:node:mnTestKNB") )
dataone::D1Client("PROD", "urn:node:KNB")
}
publish_dataone <- function(file){
id <- as.character(contentid::content_id(file))
d1c <- dataone_node()
d1Object <- new("DataObject", id, format=mime::guess_type(file), filename=file)
d1Object@sysmeta@checksum <- gsub("^hash://\\w+/", "", id)
d1Object@sysmeta@checksumAlgorithm <- "SHA-256"
dataone::uploadDataObject(d1c, d1Object, public=TRUE)
id
}
Having defined our helper function, we must also create an account / log in to the DataONE portal (https://search.dataone.org for the production system, or https://search-stage-2.test.dataone.org/ for the testing system) and copy over our credential token from the user settings. Note that these tokens expire every 18 hours. Then we can use this helper to publish any CSV file to DataONE:
readr::write_csv(richness_forecast, "richness_forecast.csv")
publish_dataone("richness_forecast.csv")
Similarly, we can define a function to resolve our object from the DataONE archive using the content-based identifier:
resolve_dataone <- function(id, url_only = FALSE){
d1c <- dataone_node()
paste0(d1c@cn@baseURL, "/v2/resolve/", utils::URLencode(id, TRUE))
}
url <- resolve_dataone("hash://sha256/c802190c43e02246da9c6c9c3f13a58f076cc6b77922f4d9766a3c6bdb1b52bd")
(Note that this example is run against the testing server, and so the uploaded data will not be accessible on the production node.)
These examples illustrate only how identifiers can be registered and resolved. Published data ought to meet the FAIR principles: Findable, Accessible, Interoperable, and Reusable. Registering an identifier in this way only makes it accessible. Using a recognized, open, standard data format such as .csv
serialization promotes interoperability. To be findable and re-usable, however, requires appropriate metadata accompany the data. Such metadata files can refer to the content they describe by using the content identifiers proposed here. For iterative forecasting of ecologically relevant data, we recommend the EFI Standards extension of the Ecological Metadata Language (EML), https://github.com/eco4cast/EFIstandards.
Only for very large files do cryptographically strong algorithms such as sha256
require non-negligible computational effort (e.g. the hash of a 10 GB file takes less than a minute on a laptop machine), and will in any event represent a small fraction of computational effort required for actual analysis of the fie.↩︎