Proper dataset documentation is crucial for reproducible research and effective data sharing. The {qtkit} package provides two main functions to help standardize and automate the documentation process:
create_data_origin(): Creates standardized metadata
about data(set) sourcescreate_data_dictionary(): Generates the scaffolding for
a detailed variable-level documentation or can use AI to generate
descriptions to be reviewed and updated as necessaryLet’s start with documenting the built-in mtcars
dataset:
# Create a temporary file for our documentation
origin_file <- file_temp(ext = "csv")
# Create the origin documentation template
origin_doc <- create_data_origin(
  file_path = origin_file,
  return = TRUE
)
#> Data origin file created at `file_path`.
# View the template
origin_doc |>
  glimpse()
#> Rows: 8
#> Columns: 2
#> $ attribute   <chr> "Resource name", "Data source", "Data sampling frame", "Da…
#> $ description <chr> "The name of the resource.", "URL, DOI, etc.", "Language, …The template provides fields for essential metadata. You can either open the CSV file in a spreadsheet editor or fill it out programmatically, as shown below.
Here’s how you might fill it out for mtcars:
origin_doc |>
  mutate(description = c(
    "Motor Trend Car Road Tests",
    "Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.",
    "US automobile market, passenger vehicles",
    "1973-74",
    "Built-in R dataset (.rda)",
    "Single data frame with 32 observations of 11 variables",
    "Public Domain",
    "Citation: Henderson and Velleman (1981)"
  )) |>
  write_csv(origin_file)Create a basic data dictionary without AI assistance:
# Create a temporary file for our dictionary
dict_file <- file_temp(ext = "csv")
# Generate dictionary for iris dataset
iris_dict <- create_data_dictionary(
  data = iris,
  file_path = dict_file
)
# View the results
iris_dict |>
  glimpse()
#> Rows: 5
#> Columns: 4
#> $ variable    <chr> "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Widt…
#> $ name        <chr> NA, NA, NA, NA, NA
#> $ type        <chr> "numeric", "numeric", "numeric", "numeric", "factor"
#> $ description <chr> NA, NA, NA, NA, NAIf you have an OpenAI API key, you can generate more detailed descriptions:
# Not run - requires API key
Sys.setenv(OPENAI_API_KEY = "your-api-key")
iris_dict_ai <- create_data_dictionary(
  data = iris,
  file_path = dict_file,
  model = "gpt-4",
  sample_n = 5
)Example output might look like:
#> # A tibble: 2 × 4
#>   variable     name         type    description                       
#>   <chr>        <chr>        <chr>   <chr>                             
#> 1 Sepal.Length Sepal Length numeric Length of the sepal in centimeters
#> 2 Sepal.Width  Sepal Width  numeric Width of the sepal in centimetersFor larger datasets, you can use sampling and grouping:
The {qtkit} package provides flexible tools for standardizing dataset
documentation. By combining create_data_origin() and
create_data_dictionary(), you can create comprehensive
documentation that enhances reproducibility and data sharing.
help(package = "qtkit")