---
title: "Getting started with regDIF"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting started with regDIF}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.width=7, fig.height=4)
```

This vignette introduces `regDIF()` to the general user by providing common use cases.

## Using LASSO regularization to evaluate measurement bias in a 2-parameter logistic IRT model.

In this example, data in `ida` were generated to mimic an integrative data analysis, where
data are pooled across multiple studies and the measurement model is evaluated for between-study and
within-study (e.g., gender, age) measurement bias. These data include 6 item responses (binary)
and 3 background characteristics -- namely, age (continuous, centered), gender (categorical, 
effect-coded), and study (categorical, effect-coded).

DIF was generated to be on items 2 (age, gender, study), 3 (age, gender, study), 4 (age), and 5
(gender, study), for both intercepts and slopes.

```{r}
library(regDIF)
head(ida)
```
The item response data must first be separated from the predictor data (background variables) before
running `regDIF()`. A single value of the tuning parameter, `tau = 2`, is then fit to the data.

```{r,results='hide'}
item.data <- regDIF::ida[,1:6]
pred.data <- regDIF::ida[,7:9]
fit <- regDIF(item.data, pred.data, tau = 2)
```
```{r}
summary(fit)
```
The `summary()` function shows that no DIF effects remain in the model. Only the latent variable
parameters and base item parameters, which were not penalized at all, are estimated to be non-zero. This is shown by using the `coef` method.
```{r}
coef(fit)
```

Now that the data have been properly specified in `regDIF`, a more thorough investigation of DIF is
warranted. The `regDIF()` function defaults to estimating 100 values of the tuning parameter,
starting with a value large enough to penalize all DIF effects to zero. However, for brevity, only
10 values of tau are specified with the `num.tau` argument and we reduce the tolerance for convergence using the `control` argument.

```{r,results='hide'}
fit2 <- regDIF(item.data, pred.data, num.tau = 10, control = list(tol = 1e-3))
```
```{r}
fit2
```

By printing the model object, 10 rows of results appear, one for each value of the tuning parameter.
The first thing to notice is that 4 rows are missing. This occurs because `regDIF()` automatically 
stops model-fitting when a small value of tau would produce a non-identified model. In focusing
attention to the BIC column, it is evident that the smallest value occurs well before the model
would be non-identified. This is an encouraging result. The `summary()` function may be used again,
which, with multiple values of the tuning parameter fit to the data, produces non-zero DIF effects 
corresponding to the model with the minimum value of BIC.

```{r}
summary(fit2)
```

A plot of the regularization path also shows the remaining DIF effects.

```{r}
plot(fit2)
```

To produce other model results, the `fit2` object contains lists of the impact (latent variable)
parameters, base (intercept and slope) item parameters, and DIF parameters for all values of the
tuning parameter. For instance, the impact parameters are printed below.

```{r}
fit2$impact
```

EAP scores and standard deviations may also be produced.

```{r}
lapply(fit2$eap, head)
```

Finally, when data include a large number of items, observations, and predictors, `regDIF()` can
run relatively slowly. An alternative approach, which yields much faster results, is to provide an 
observed proxy for the latent scores. In the case of binary data, this might be sum scores. Note
that using observed proxy scores is identical to performing a multivariate regression, where the
item responses are regressed on the proxy scores and background variables. (The proxy
scores are simultaneously regressed on the background variables as well.)

```{r,results='hide'}
fit3 <- regDIF(item.data, pred.data, prox.data = rowSums(item.data), num.tau = 20)
```
```{r}
summary(fit3)
```

The results show more DIF on both the intercepts (Items 3 and 6) and slopes (Items 2, 3, and 5).

## More modeling possibilities with regDIF.

In addition to LASSO, other penalty functions are possible with `regDIF()`. For instance, the
elastic net penalty combines LASSO and ridge functions, which is useful when many correlated
predictors are evaluated for DIF. The elastic net is controlled by a second tuning parameter, 
`alpha`, and defaults to `alpha = 1`, corresponding to the LASSO penalty. In contrast, `alpha = 0`
corresponds to the ridge penalty. When `alpha` is between 0 and 1, however, the elastic net is 
used to perform DIF selection. For brevity, observed proxy scores are used in all model fitting
below.

```{r,results='hide'}
fit_net <- regDIF(item.data, pred.data, prox.data = rowSums(item.data), num.tau = 20, alpha = .5)
```
```{r}
summary(fit_net)
```

The final elastic net results yield the same DIF effects as the LASSO results, although the amount
of penalization is greater for elastic net (i.e., larger `tau`).

Other penalty functions include the minimax concave penalty (MCP) and the group extensions of LASSO
and MCP, which penalize the intercept and slope DIF effects in tandem. The group LASSO function is 
shown below.

```{r,results='hide'}
fit_grp_mcp <- regDIF(item.data, pred.data, prox.data = rowSums(item.data), num.tau = 20, pen.type = "grp.mcp")
```
```{r}
summary(fit_grp_mcp)
```

Although the MCP results appear largely the same as the LASSO results, the group MCP function
included both the intercept and slope for each background variable remaining in the final model.

In summary, the regDIF R package provides a flexible implementation of using
regularization to identify DIF across multiple background characteristics.

Please reach out to [wbelzak@gmail.com](mailto:wbelzak@gmail.com) for any questions,
and remember to cite regDIF in your work. Thank you kindly!

```{r,echo=F}
citation("regDIF")
```