--- title: "Introduction to idealstan" author: "Robert Kubinec" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Introduction to idealstan} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE, cache=FALSE} knitr::opts_chunk$set(echo = TRUE,warning=FALSE,fig.align = 'center',fig.width=6, fig.height=5) library(idealstan) library(dplyr) library(ggplot2) library(tidyr) library(tinytable) options(cmdstanr_warn_inits = FALSE) ``` ## About `idealstan` *Note: To report bugs with the package, please file an issue on the [Github page](https://github.com/saudiwin/idealstan/issues).* **If you use this package, please cite the following:** Kubinec, Robert. "Generalized Ideal Point Models for Time-Varying and Missing-Data Inference". Working Paper. . **Note: At present, `idealstan` uses the `cmdstanr` package, which is not on CRAN and must be installed separately. Please see below for instructions.** This package implements IRT (item response theory) ideal point models, which are models designed for situations in which actors make strategic choices that correlate with a unidimensional scale, such as the left-right axis in American politics, disagreement over product design in consumer choice, or psychometric scales in which the direction that items load on the latent scale is unknown. Compared to traditional IRT, ideal point models examine the polarizing influence of a set of items on a set of persons, and has similarities to models based on Euclidean latent spaces, such as multi-dimensional scaling. In fact, this package also implements a version of the latent space model for binary outcomes, which is an alternate formulation of an ideal point model. The goal of `idealstan` is to offer a wide variety of ideal point models that can model missing-data, time-varying ideal points, and incorporate a range of outcomes, including binary outcomes, counts, continuous and ordinal responses. In addition, `idealstan` uses the Stan estimation engine to offer full and variational Bayesian inference for all models so that every model is estimated with uncertainty. Variational inference–specifically the [Pathfinder algorithm](https://mc-stan.org/docs/reference-manual/pathfinder.html)–provides informative starting values to ensure convergence, and for very large models, the possibility of estimating ideal points when full MCMC is impractical. However, the MCMC algorithm has the ability to parallelize within chains using multiple cores, which makes it possible to estimate much larger models with full Bayesian inference than was previously possible. The approach to handling missing data in this package is to model cases where missing data is a function of a person's ideal point. In other words, the package will adjust estimates if missingness appears to be correlated with either high or low values of the latent trait. This general missing data adjustment can be usefully applied to many contexts in which a missing outcome is a function of the person's ideal point (i.e., people will tend to be present in the data when the item is far away or very close to their ideal point). If missingness does not appear to arise as a function of ideal points, the models will still incorporate missing data but will assume it is random conditional on each item. The package includes the following models: | Model ID | Model Type | Response Type | Missing Data Adjustment | |----|----|----|----| | 1 | IRT 2-PL | Binary | No | | 2 | IRT 2-PL | Binary | Yes | | 3 | IRT rating scale | Ordinal | No | | 4 | IRT rating scale | Ordinal | Yes | | 5 | IRT graded response | Ordinal | No | | 6 | IRT graded response | Ordinal | Yes | | 7 | IRT 2-PL (Wordfish) | Poisson | No | | 8 | IRT 2-PL (Wordfish) | Poisson | Yes | | 9 | IRT 2-PL | Continuous (Normal) | No | | 10 | IRT 2-PL | Continuous (Normal) | Yes | | 11 | IRT 2-PL | Positive-Continuous (Log-Normal) | No | | 12 | IRT 2-PL | Positive-Continuous (Log-Normal) | Yes | | 13 | Latent Space | Binary | No | | 14 | Latent Space | Binary | Yes | | 15 | IRT 2-PL | Ordered Beta: Proportion (0 to 1 Inclusive) | No | | 16 | IRT 2-PL | Ordered Beta: Proportion (0 to 1 Inclusive) | Yes | : `idealstan` List of Specifications {#tbl-models} In addition, all of these models can be estimated with either time-varying or static ideal points if a column of dates for each item is passed to the model function (see the [Time Series](https://saudiwin.github.io/idealstan/vignettes/Time_Series.html) vignette). This package implements a range of time series processes (random walk, AR(1), Gaussian processes, and splines). The package also has extensive plotting functions via `ggplot2` for model parameters, particularly the legislator (person) ideal points (ability parameters). This vignette demonstrates how to install the package and get it up and running. However, for more information about how to run and use the many options in the package, please see the package website at . ## Installation Instructions To use `idealstan`, you first have to have both `cmdstanr`, an R package, installed and `cmdstan`, the underlying MCMC library. Unfortunately, `cmdstanr` is not yet available on CRAN. Simply use the following command to install the package: ``` r # we recommend running this in a fresh R session or restarting your current session install.packages("cmdstanr", repos = c('https://stan-dev.r-universe.dev', getOption("repos"))) ``` Then you will need to install `cmdstan`, which is the Stan engine. You can do so by loading `cmdstanr` and using the function `install_cmdstan`: ``` r library(cmdstanr) install_cmdstan() ``` There are some pre-requisites to using `cmdstan` as you need to be able to compile models on your machine. For example, with Mac OS X you will first need to install Xcode from the Apple App Store. For more details, see complete installation instructions on this page: . Assuming you install `cmdstan` using the functions provided in the package, please allow it to install in the default location. ## Simulation of Ordinal IRT with Missing Data To begin with, we can simulate data from an ordinal ideal-point model in which there are three possible responses corresponding to a legislator voting: yes, abstain and no. An additional category is also simulated that indicates whether a legislator shows up to vote or is absent, which traditional IRT models would record as missing data and would drop from the estimation. This package can instead utilize missing data via a hurdle model in which the censoring of the vote/score data is estimated as a function of individual item/bill intercepts and discrimination parameters for the decision to be absent or present. In other words, if the missing data is a reflection of the person's ideal point, such as more conservative legislators refusing to show up to vote, than the model will make use of this missing data to infer additional information about the legislators' ideal points. The function `id_sim_gen()` allows you to simulate data from any of the sixteen models currently implemented in `idealstan` (see previous list). To include missing data, specify the `inflate` option as `TRUE`. For example, here we sample data from an ordinal graded response model: ```{r sim_data} #| cache: false ord_ideal_sim <- id_sim_gen(model='ordinal_grm',inflate = T) ``` The vote/score matrix in the `idealdata` object `ord_ideal_sim` has legislators/persons in the rows and bills/items in the columns. The `outcome_disc` column has the simulated 3-category ordered outcome. The function `id_estimate` will take this processed data and run an IRT ideal point model with the model ID . To specify the model type, either include the model ID from @tbl-models as the `model_id` argument in the `id_estimate` function *or*, in the case of multiple models/item types, pass a column `model_id` to the `id_make` function that specifies the model ID for each row in the data. This latter option is useful when you have items of mixed types, such as binary, ordinal and/or continuous items. The function `id_make` also includes the ability to incorporate hierarchical (person or item-level) covariates, as discussed below. To speed up processing, all of the models shown in @tbl-models make use of multiple core parallel computation. To use this option, the specified number of available cores in the `ncores` option must exceed the number of MCMC chains `nchains`. `cmdstanr` will automatically assign cores by dividing the number of chains by the number of cores. In all of the examples in this vignette, I use a machine with 8 cores and estimate 2 chains, so there are 4 cores per chain. By default, `id_estimate` parallelizes over persons, although that can be changed to items with the `map_over_id` option (only works with static models). The package has options for identification that are similar to other IRT packages in which the IDs of legislators/persons to constrain are specified to the `id_estimate` function. For example, we can use the true values of the simulated legislators to constrain one legislator/person with the highest simulated ideal point and one legislator/person with the lowest ideal point. Each constrained parameter must be fixed to a specific value, preferably at either end of the ideal point spectrum, to identify the model. In particular, two pieces of information are necessary: a value for the high ideal point, and the difference between the high and low points. In this example I pre-specify which parameters to constrain based on the simulated data as `restrict_ind_high` and `restrict_ind_low`, and use the actual values to pin the parameters to specific values with `fix_high` and `fix_low`. ```{r constrain_sim} #| eval: false true_legis <- ord_ideal_sim@simul_data$true_person high_leg <- sort(true_legis,decreasing = TRUE,index.return=TRUE) low_leg <- sort(true_legis,index.return=TRUE) ord_ideal_est <- id_estimate(idealdata=ord_ideal_sim, model_type=6, fixtype='prefix', restrict_ind_high = as.character(high_leg$ix[1]), restrict_ind_low=as.character(low_leg$ix[1]), fix_high = sort(true_legis,decreasing = TRUE)[1], fix_low = sort(true_legis,decreasing = FALSE)[1], id_refresh=500, ncores=8, nchains=2) ``` We can then check and see how well the Stan estimation engine was able to capture the "true" values used in the simulation by plotting the true ideal points relative to the estimated ones: ```{r check_true} #| eval: false id_plot_persons(ord_ideal_est,show_true = TRUE) ``` Given the small amount of data used to estimate the model, the imprecision with which the ideal points were recovered is not surprising. However, the uncertainty intervals generally include the true values, indicating a model that is functioning correctly at recovering estimates even with substantial measurement error. To automatically identify the model (that is, identify people to fix high or low), simply change the `fixtype` option to `'vb_full'`. By default, the model will select the highest and lowest ideal points to constrain by running an approximation to the full posterior using `cmdstanr`'s `pathfinder()` function. While this method works, the exact rotation is not known a priori, and so it may produce a different result with multiple runs. Note that there will be two `pathfinder` runs as the first run identifies the parameters to constrain and the second is used to create starting values for the Hamiltonian Monte Carlo estimation. For example, using our simulated data and identifying the model automatically with `'vb_full'`: ```{r restrict_auto} #| eval: false ord_ideal_est <- id_estimate(idealdata=ord_ideal_sim, model_type=6, id_refresh=2000,fixtype="vb_full", ncores=8, nchains=2) ``` We can see from the plot of the Rhats, which is an MCMC convergence diagnostic, that all the Rhats are below 1.1, which is a good (though not perfect) sign that the model is fully identified: ```{r rhats} #| eval: false #| id_plot_rhats(ord_ideal_est) id_plot_persons(ord_ideal_est,show_true = T) ``` In general, it is always a good idea to check the Rhats before proceeding with further analysis. Identification of time-varying ideal point models can be more complicated and is discussed in the accompanying vignette. As can be seen above, while the Pathfinder algorithm will usually identify *a* unique rotation of the ideal points without using any other prior information, it may not be the rotation that is theoretically interesting. For that reason, I recommend specifying persons or items to pin to specific values for applied use of the package as I show in the next section. *For more information on how to use the package, and also for an empirical example, see the vignettes at the package website at .*