--- title: "Download Voteview data in parallel" vignette: > %\VignetteIndexEntry{Download Voteview data in parallel} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} knitr: opts_chunk: collapse: true comment: '#>' eval: false --- ```{r call library} library(filibustr) ``` The Voteview functions have the power to download lots of data on many years of Congress. One downside of this power is that downloading many large datasets from the internet can be slow. One way to speed up your data downloads is to **download data in parallel**. When you call a Voteview function to download data from multiple Congresses (i.e., when `length(congress) > 1`), `{filibustr}` will download data in parallel if you have set up that capability. **Everything described below is a purely optional way to accelerate your data imports.** If you don't set up parallel computing processes, the Voteview functions will simply download data sequentially. ## Setting up for parallel downloads Downloading data in parallel requires a short bit of setup in the beginning. ### Make sure the `{mirai}` and `{carrier}` packags are installed Under the hood, the Voteview functions use `purrr::in_parallel()` for parallel downloads. `purrr::in_parallel()` depends on two packages (`{mirai}` and `{carrier}`) that are not otherwise used in `{filibustr}`, so you may not have them installed. **To check if you have installed the required versions of these packages, run this code.** It will prompt you to install any packages you're missing. ```{r check for parallel packages} rlang::check_installed(c("carrier", "mirai"), version = c("0.3.0", "2.5.1")) ``` ### Set parallel processes **To download Voteview data in parallel, use `mirai::daemons()`** to create parallel processes (`{mirai}` calls these "daemons"). ```{r set parallel processes} # detect the number of cores available on your machine parallel::detectCores() # launch a specific number of processes, or mirai::daemons(4) # launch a process on all but one available cores mirai::daemons(parallel::detectCores() - 1) ``` #### How many processes should I create? In general, if you split the work up across more processes, the download will finish faster. Theoretically, N processes can finish the download up to N times faster. At the same time, there can be diminishing returns to creating a large number of processes. * First, **there is some overhead** involved with creating and communicating with parallel processes. * Second, **consider the number of pieces of work.** Multi-Congress data downloads get one file per Congress. That is the unit of work that the parallel processes can work on. * If you are downloading data on 5 Congresses, but create 8 parallel processes, then the last 3 processes aren't doing anything. * Similarly, if you're downloading data on 12 Congresses, there's not much difference between 7, 8, and 9 processes. Also, there is less benefit when you set more processes than the number of cores available on your machine (which you can see using `parallel::detectCores()`). A good rule of thumb (per the [purrr documentation](https://purrr.tidyverse.org/reference/in_parallel.html#daemons-settings)) is to **use (at most) one less than the number of cores on your machine**, leaving one core open for the main R process. ## Downloading data in parallel Once you've set up your parallel processes, **just call the Voteview functions like normal**, and they will automatically download data in parallel! **Reminder:** parallel processing only impacts downloads where `length(congress) > 1`. ```{r download Voteview data} # download Voteview data from multiple Congresses get_voteview_members(congress = 95:118) get_voteview_rollcall_votes(congress = 95:118) ``` When you're done with all your parallel processing, you can close the daemon connections with `mirai::daemons(0)` if you'd like. The connections will close automatically when your session ends otherwise. ```{r close daemons} mirai::daemons(0) ``` ## More details See the documentation for `purrr::in_parallel()` and `{mirai}` (especially `mirai::daemons()`) for additional details on parallel processing.