---
title: "Working with RCDF Files in R"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{rcdf}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## Introduction
The `rcdf` package is a powerful toolkit for securely working with RCDF (Encrypted Parquet) files in R. RCDF is a custom data format designed to provide strong encryption and metadata management for sensitive datasets. With `rcdf`, users can easily handle encrypted data, including reading, writing, and exporting data stored in this secure format.
This vignette will walk you through the key features of the package, including how to encrypt and save your data in RCDF format, how to decrypt and read RCDF files, and how to export data to other common formats.
## Installation
To use the `rcdf` package, you’ll need to install it first. You can install the package directly from GitHub using the `devtools` package:
```{r, eval=FALSE}
# Install the package from GitHub
devtools::install_github("yng-me/rcdf")
```
Once installed, you can load the package and start working with RCDF files.
```{r, eval=FALSE}
library(rcdf)
```
## Writing data to RCDF format
The core function for writing data to the RCDF format is `write_rcdf()`. This function encrypts your data using AES encryption, generates encrypted metadata for version control using RSA encryption, and saves the data as encrypted Parquet files inside a zip archive. This ensures that the data is stored securely and can only be decrypted using the correct key.
**Usage**:
```{r, eval=FALSE}
write_rcdf(data, path, pub_key, ..., metadata = list())
```
**Parameters**:
- `data`: A list of data frames or tables to be written to RCDF format. Each element of the list represents a record.
- `path`: The path where the RCDF file will be written. The file will be saved with a `.rcdf` extension if not already specified.
- `pub_key`: The public RSA key used to encrypt the AES encryption keys.
- `...`: Additional arguments passed to helper functions if needed.
- `metadata`: A list of metadata to be included in the RCDF file. Can contain system information or other relevant details.
```{r, eval=FALSE}
# Sample data (list of data frames)
data <- rcdf_list()
data$table1 = data.frame(x = 1:10, y = letters[1:10])
data$table2 = data.frame(a = rnorm(10), b = rnorm(10))
# Sample public RSA key (for encryption)
pub_key <- file.path(system.file("extdata", package = "rcdf"), "sample-public-key.pem")
# Write the data to an RCDF file
write_rcdf(data = data, path = "path/to/rcdf_file.rcdf", pub_key = pub_key)
```
In this example:
- `data` is a list containing two data frames. These will be encrypted and saved as separate Parquet files within the RCDF.
- `pub_key` is the RSA public key used to encrypt the AES keys. The AES keys are used for encrypting the data in a fast and secure manner.
The `write_rcdf()` function will create a zip archive containing the encrypted Parquet files and metadata, then save it to path.
## Reading and decrypting RCDF data
To read and decrypt an RCDF file, you can use the `read_rcdf()` function. This function extracts the encrypted Parquet files from the RCDF archive, decrypts them using the provided decryption key, and loads the data back into R as an RCDF object.
**Usage**:
```{r, eval=FALSE}
read_rcdf(path, decryption_key, ..., password = NULL, metadata = NULL)
```
**Parameters**:
- `path`: A string specifying the path to the RCDF archive (zip file).
- `decryption_key`: The key used to decrypt the RCDF contents. This can be an RSA or AES key, depending on how the RCDF was encrypted.
- `...`: Additional parameters passed to other functions, if needed.
- `password`: A password used for RSA decryption (optional).
- `metadata`: An optional metadata object containing data dictionaries and value sets. This metadata is applied to the data if provided.
```{r, eval=FALSE}
# Using sample RCDF data
dir <- system.file("extdata", package = "rcdf")
rcdf_path <- file.path(dir, 'mtcars.rcdf')
private_key <- file.path(dir, 'sample-private-key.pem')
rcdf_data <- read_rcdf(path = rcdf_path, decryption_key = private_key)
rcdf_data
# Using encrypted/password protected private key
rcdf_path_pw <- file.path(dir, 'mtcars-pw.rcdf')
private_key_pw <- file.path(dir, 'sample-private-key-pw.pem')
pw <- '1234'
rcdf_data_with_pw <- read_rcdf(
path = rcdf_path_pw,
decryption_key = private_key_pw,
password = pw
)
rcdf_data_with_pw
```
In this example:
- `path` is the path to the RCDF file that contains the encrypted data.
- `decryption_key` is the key used to decrypt the AES keys and Parquet files. If the RCDF was encrypted using RSA, you’ll need the private RSA key to decrypt it.
The `read_rcdf()` function returns an RCDF object, which is essentially a list of decrypted Parquet files (one for each data frame in the original data) along with metadata about the file.
## Exporting data to other formats
Once the data has been decrypted and read into R, you can export it to other formats using the `write_rcdf_as()` or `write_rcdf_*()` family of functions. These function support a wide variety of common formats, including CSV, TSV, JSON, Excel, Stata, SPSS, and SQLite.
### Exporting data to CSV format
The `write_rcdf_csv()` function allows you to export data stored in an RCDF object to CSV files. This is useful when you need to share or process the data in a non-encrypted, readable format.
**Usage**:
```{r, eval=FALSE}
write_rcdf_csv(data, path, ..., parent_dir = NULL)
```
**Parameters**:
- `data`: The RCDF object that contains the decrypted data. This is the data you obtained from calling `read_rcdf()` or other decryption methods.
- `path`: The target directory or file where the CSV files will be saved.
- `...`: Additional arguments passed to the `write.csv()` function for customizing the CSV export (e.g., setting delimiters, row names, etc.).
- `parent_dir`: An optional parent directory to be included in the path where the files will be written.
```{r, eval=FALSE}
write_rcdf_csv(data = rcdf_data, path = "path/to/output", row.names = FALSE)
```
This will save each table in the RCDF object as a separate CSV file in the specified directory.
### Exporting data to TSV format
The `write_rcdf_tsv()` function is similar to the CSV export function but writes the data as tab-separated values (TSV) files.
**Usage**:
```{r, eval=FALSE}
write_rcdf_tsv(data, path, ..., parent_dir = NULL)
```
**Parameters**:
- `data`: The decrypted RCDF object containing the data to export.
- `path`: The target directory or file for the output TSV files.
- `...`: Additional arguments for customizing the TSV export passed to the `write.table()` function (e.g., setting delimiters, handling row names).
- `parent_dir`: An optional parent directory to be included in the path where the files will be written.
```{r, eval=FALSE}
write_rcdf_tsv(data = rcdf_data, path = "path/to/output", row.names = FALSE)
```
This function will save each data frame in the RCDF object as a separate TSV file in the target location.
### Exporting data to JSON format
The `write_rcdf_json()` function allows you to export the decrypted RCDF data to JSON format. This is useful when working with APIs or other systems that require data in JSON.
**Usage**:
```{r, eval=FALSE}
write_rcdf_json(data, path, ..., parent_dir = NULL)
```
**Parameters**:
- `data`: The decrypted RCDF object.
- `path`: The target directory or file for saving the JSON files.
- `...`: Additional arguments to customize the JSON export passed to `jsonlite::toJSON()` (such as specifying indentation or compactness of the JSON output).
- `parent_dir`: An optional parent directory to be included in the path where the files will be written.
```{r, eval=FALSE}
write_rcdf_json(data = rcdf_data, path = "path/to/output", pretty = TRUE)
```
This will convert each data frame in the RCDF object into a separate JSON file and save them in the specified directory. The `pretty = TRUE` option ensures that the output JSON files are human-readable with proper indentation.
### Exporting data to Parquet format
The `write_rcdf_parquet()` function exports the decrypted data back into the Parquet format. Parquet is a columnar storage format that is highly efficient for big data processing.
**Usage**:
```{r, eval=FALSE}
write_rcdf_parquet(data, path, ..., parent_dir = NULL)
```
**Parameters**:
- `data`: The decrypted RCDF object.
- `path`: The directory or file path where the Parquet files will be saved.
- `...`: Additional arguments passed to the `write_parquet()` function for customization, such as specifying compression type.
- `parent_dir`: An optional parent directory to be included in the path where the files will be written.
```{r, eval=FALSE}
write_rcdf_parquet(data = rcdf_data, path = "path/to/output")
```
This function will write each data frame in the RCDF object into separate Parquet files, storing them in the specified directory.
### Exporting data to Excel format
The `write_rcdf_xlsx()` function is used to export the decrypted RCDF data to Excel (.xlsx) format. It’s helpful when sharing data with users who prefer spreadsheet software.
**Usage**:
```{r, eval=FALSE}
write_rcdf_xlsx(data, path, ..., parent_dir = NULL)
```
**Parameters**:
- `data`: The decrypted RCDF object.
- `path`: The directory or file path where the Excel file will be saved.
- `...`: Additional arguments to customize the Excel file export in the `openxlsx` package.
- `parent_dir`: An optional parent directory to be included in the path where the files will be written.
```{r, eval=FALSE}
write_rcdf_excel(data = rcdf_data, path = "path/to/output.xlsx", sheetName = "Sheet1")
```
### Exporting data to Stata format
The `write_rcdf_dta()` function allows you to export the data to Stata’s .dta file format. This is useful for users who need to work with the data in Stata.
**Usage**:
```{r, eval=FALSE}
write_rcdf_dta(data, path, ..., parent_dir = NULL)
```
**Parameters**:
- `data`: The decrypted RCDF object.
- `path`: The path where the Stata .dta file will be saved.
- `...`: Additional arguments passed to the `write.dta()` function (e.g., specifying version of Stata).
- `parent_dir`: An optional parent directory to be included in the path where the files will be written.
```{r, eval=FALSE}
write_rcdf_dta(data = rcdf_data, path = "path/to/output")
```
### Exporting data to SPSS format
The `write_rcdf_sav()` function is for exporting the decrypted RCDF data to SPSS’s .sav file format.
**Usage**:
```{r, eval=FALSE}
write_rcdf_sav(data, path, ..., parent_dir = NULL)
```
**Parameters**:
- `data`: The decrypted RCDF object.
- `path`: The path where the .sav file will be saved.
- `...`: Additional arguments for customizing the SPSS file export.
- `parent_dir`: An optional parent directory to be included in the path where the files will be written.
```{r, eval=FALSE}
write_rcdf_sav(data = rcdf_data, path = "path/to/output")
```
### Exporting data to SQLite database format
The `write_rcdf_sqlite()` function allows you to export the decrypted RCDF data to an SQLite database (with .db extension). Each data frame is saved as a table within the SQLite database.
**Usage**:
```{r, eval=FALSE}
write_rcdf_sqlite(data, path, ..., parent_dir = NULL)
```
**Parameters**:
- `data`: The decrypted RCDF object.
- `path`: The path where the SQLite database file will be created.
- `...`: Additional arguments for customizing the SQLite export.
- `parent_dir`: An optional parent directory to be included in the path where the files will be written.
```{r, eval=FALSE}
write_rcdf_sqlite(data = rcdf_data, path = "path/to/output")
```
### Exporting data to multiple formats simultaneously
The `write_rcdf_as()` function allows you to export decrypted RCDF data into multiple file formats simultaneously.
**Usage**:
```{r, eval=F}
write_rcdf_as(data, path, formats, ...)
```
**Parameters**:
- `data`: A named list or RCDF object. Each element should be a table or tibble-like object (typically a `dbplyr` or `dplyr` table).
- `path`: The target directory where output files should be saved.
- `formats`: A character vector of file formats to export to. Supported formats include: `"csv"`, `"tsv"`, `"json"`, `"parquet"`, `"xlsx"`, `"dta"`, `"sav"`, and `"sqlite"`.
- `...`: Additional arguments passed to the respective writer functions.