Simulating datasets is a powerful and varied tool when conducting unmarked
analyses. Writing our own code to simulate a dataset based on a given model is an excellent learning tool, and can help us test if a given model is generating the expected results. If we simulate a series of datasets based on a fitted model, and calculate a statistic from each of those fits, we can generate a distribution of the statistic - this is what the parboot
function does. This can be helpful, for example, when testing goodness of fit. Finally, simulation can be a useful component of power analysis when a closed-form equation for power is not available.
unmarked
provides two different ways of generating simulated datasets, depending on the stage we are at in the modeling process.
For (1), we simply call the simulate
method on our fitted model object and new dataset(s) are generated. This is the approach used by parboot
. In this vignette we will focus on (2), a more flexible approach to simulation, also using the simulate
method, that allows us to generate a dataset corresponding to any unmarked
model from scratch.
We will need to provide several pieces of information to simulate
in order to simulate a dataset from scratch in unmarked
.
unmarkedFrame
of the appropriate type defining the desired experimental designmodel
), if the unmarkedFrame
is used for multiple model typescoefs
) controlling the simulation, which correspond to the formula(s) specified earlier.The easiest way to demonstrate how to use simulate
is to look at an example: we’ll start with a simple one for occupancy.
unmarkedFrame
Suppose we want to simulate a single-season occupancy dataset in which site occupancy is affected by elevation. The first step is to create an unmarkedFrame
object of the appropriate type, which defines the experimental design and includes any covariates we want to use in the simulation. Since we want to simulate an occupancy dataset, we’ll create an unmarkedFrameOccu
.
The unmarkedFrameOccu
function takes three arguments: the observation matrix y
, the site covariates siteCovs
, and the observation-level covariates obsCovs
. The dimensions of y
define how many sites and replicate samples the study includes. We’ll create a blank y
matrix (i.e., filled with NA
s) of dimension 300 x 8, indicating we want our study to have 300 sites and 8 sampling occasions. The values you put in this y
matrix don’t matter, you can put anything in there you want as they’ll be overwritten with the simulated values later. It’s only used to define the number of sites and occasions.
Earlier we said we want to include an elevation covariate, so we’ll simulate the covariate now and add it to a data frame. We could create several covariates here, including factors, etc.
site_covs <- data.frame(elev = rnorm(M))
We’re not using any observation covariates, so we can now make the complete unmarkedFrameOccu
:
umf <- unmarkedFrameOccu(y = y, siteCovs = site_covs)
head(umf)
## Data frame representation of unmarkedFrame object.
## y.1 y.2 y.3 y.4 y.5 y.6 y.7 y.8 elev
## 1 NA NA NA NA NA NA NA NA -0.56047565
## 2 NA NA NA NA NA NA NA NA -0.23017749
## 3 NA NA NA NA NA NA NA NA 1.55870831
## 4 NA NA NA NA NA NA NA NA 0.07050839
## 5 NA NA NA NA NA NA NA NA 0.12928774
## 6 NA NA NA NA NA NA NA NA 1.71506499
## 7 NA NA NA NA NA NA NA NA 0.46091621
## 8 NA NA NA NA NA NA NA NA -1.26506123
## 9 NA NA NA NA NA NA NA NA -0.68685285
## 10 NA NA NA NA NA NA NA NA -0.44566197
Since unmarkedFrameOccu
is used by both the single-season occupancy model (occu
) and the Royle-Nichols occupancy model (occuRN
), we need to tell unmarked
which one to use.
model <- occu
Most unmarkedFrame
types in unmarked
are used by only one model fitting function, so this step is often unnecessary.
Take a look at the help file for occu
. When fitting a single-season occupancy model we need to provide, in addition to the data, the formula
argument defining the model structure. We’ll need to provide these same argument(s) to simulate
. Many fitting functions will have multiple required arguments, such as the mixture distribution to use, key functions, etc.
Here we specify a double right-hand-side formula as required by occu
, specifying an effect of elevation on occupancy.
form <- ~1~elev
The model structure, as defined by the formula above, implies a certain set of parameter/coefficient values (intercepts, slopes) we need to supply to simulate
. These need to be supplied as a named list, where each list element corresponds to one submodel (such as state
for occupancy and det
for detection). Each list element is a numeric vector of the required parameter values. It can be tricky to figure out the structure of this list, so simulate
allows you to not include it at first, in which case the function will return a template for you to fill in.
simulate(umf, model = model, formula = form)
## coefs should be a named list of vectors, with the following structure
## (replace 0s with your values):
##
## $state
## (Intercept) elev
## 0 0
##
## $det
## (Intercept)
## 0
## Error: Specify coefs argument as shown above
We need to supply a list with two elements state
and det
. The state
element contains two values, the intercept and the slope corresponding to elevation. The det
element contains only the intercept since we have no covariates on detection. Note that all values supplied in this list must be on the inverse link scale, which will depend on the specific submodel used. So for example, a value of 0 for det
implies a detection probability of 0.5, because we’re using the logit link function.
plogis(0)
## [1] 0.5
Now let’s make our own coefs
list:
Here we’re setting a negative effect of elevation on occupancy.
We now have all the pieces to simulate a dataset.
out <- simulate(umf, model = occu, formula = ~1~elev, coefs = cf)
## Assumed parameter order for state:
## (Intercept), elev
## Assumed parameter order for det:
## (Intercept)
The result is always a list of unmarkedFrame
s. By default, we just get one, but we can get more with the nsim
argument.
head(out[[1]])
## Data frame representation of unmarkedFrame object.
## y.1 y.2 y.3 y.4 y.5 y.6 y.7 y.8 elev
## 1 1 1 1 1 1 1 1 1 -0.56047565
## 2 0 0 0 0 0 0 0 0 -0.23017749
## 3 0 0 0 0 0 0 0 0 1.55870831
## 4 0 0 0 0 0 0 0 0 0.07050839
## 5 0 0 0 0 0 0 0 0 0.12928774
## 6 1 0 0 0 0 0 0 0 1.71506499
## 7 0 0 0 0 0 0 0 0 0.46091621
## 8 0 0 0 0 0 0 0 0 -1.26506123
## 9 1 1 1 0 1 0 1 0 -0.68685285
## 10 1 0 0 1 1 0 0 0 -0.44566197
The simulated unmarkedFrame
now contains y
values and is ready to use.
As a quick check, let’s fit a model to our simulated dataset.
occu(~1~elev, data = out[[1]])
##
## Call:
## occu(formula = ~1 ~ elev, data = out[[1]])
##
## Occupancy:
## Estimate SE z P(>|z|)
## (Intercept) -0.146 0.120 -1.22 0.223241
## elev -0.572 0.136 -4.20 0.000027
##
## Detection:
## Estimate SE z P(>|z|)
## -0.038 0.0612 -0.621 0.535
##
## AIC: 1929.538
We get out roughly the same parameters that we put in, as expected.
The gdistremoval
function fits the model of Amundson et al. (2014), which estimates abundance using a combination of distance sampling and removal sampling data. When simulating a dataset based on this model, we have to provide several additional pieces of information related to the structure of the distance and removal sampling analyses.
unmarkedFrame
First create the appropriate type of unmarkedFrame
, which is unmarkedFrameGDR
. There’s two y-matrices: one for distance sampling and one for removal sampling. We’ll create a dataset with 4 distance bins and 5 removal periods.
set.seed(123)
M <- 100
Jdist <- 4
Jrem <- 5
y_dist <- matrix(NA, M, Jdist)
y_rem <- matrix(NA, M, Jrem)
We’ll create an elevation site covariate and a wind observation covariate. Observation-level covariates are only used by the removal part of the model, so they should have the same number of values as y_rem
.
site_covs <- data.frame(elev = rnorm(M))
obs_covs <- data.frame(wind = rnorm(M * Jrem))
Finally we can create the unmarkedFrameGDR
. We’ll also need to specify the distance bins and the units for the distance part of the model here. See ?unmarkedFrameGDR
for more information.
umf <- unmarkedFrameGDR(yRem = y_rem, yDist = y_dist, siteCovs = site_covs, obsCovs = obs_covs,
dist.breaks = c(0,25,50,75,100), unitsIn = 'm')
head(umf)
## Data frame representation of unmarkedFrame object.
## yDist.1 yDist.2 yDist.3 yDist.4 yRem.1 yRem.2 yRem.3 yRem.4 yRem.5
## 1 NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA NA NA
## 7 NA NA NA NA NA NA NA NA NA
## 8 NA NA NA NA NA NA NA NA NA
## 9 NA NA NA NA NA NA NA NA NA
## 10 NA NA NA NA NA NA NA NA NA
## elev wind.1 wind.2 wind.3 wind.4 wind.5
## 1 -0.56047565 -0.71040656 0.2568837 -0.24669188 -0.34754260 -0.95161857
## 2 -0.23017749 -0.04502772 -0.7849045 -1.66794194 -0.38022652 0.91899661
## 3 1.55870831 -0.57534696 0.6079643 -1.61788271 -0.05556197 0.51940720
## 4 0.07050839 0.30115336 0.1056762 -0.64070601 -0.84970435 -1.02412879
## 5 0.12928774 0.11764660 -0.9474746 -0.49055744 -0.25609219 1.84386201
## 6 1.71506499 -0.65194990 0.2353866 0.07796085 -0.96185663 -0.07130809
## 7 0.46091621 1.44455086 0.4515041 0.04123292 -0.42249683 -2.05324722
## 8 -1.26506123 1.13133721 -1.4606401 0.73994751 1.90910357 -1.44389316
## 9 -0.68685285 0.70178434 -0.2621975 -1.57214416 -1.51466765 -1.60153617
## 10 -0.44566197 -0.53090652 -1.4617556 0.68791677 2.10010894 -1.28703048
gdistremoval
Looking at ?gdistremoval
, required arguments include lambdaformula
, removalformula
, and distanceformula
. We need to set these formula values to control the simulation. We’ll also use the negative binomial distribution for abundance.
lambdaformula <- ~elev # elevation effect on abundance
removalformula <- ~wind # wind effect on removal p
distanceformula <- ~1
mixture <- "NB"
As in the previous section, we’ll leave the coefs
argument blank at first and get the correct output structure.
simulate(umf, lambdaformula=~elev, removalformula=~wind, distanceformula=~1,
mixture="NB")
## coefs should be a named list of vectors, with the following structure
## (replace 0s with your values):
##
## $lambda
## (Intercept) elev
## 0 0
##
## $alpha
## alpha
## 0
##
## $dist
## (Intercept)
## 0
##
## $rem
## (Intercept) wind
## 0 0
## Error: Specify coefs argument as shown above
We need to set two values for the abundance (lambda
) model on the log scale, one for dist
which represents the distance function sigma parameter (log scale), one for the negative binomial dispersion parameter alpha
(log scale), and two for the removal detection probability model (logit scale).
We’ll pick the (relatively arbitrary) values below:
Now provide everything to simulate
. Note we don’t need to provide the model
argument because unmarkedFrameGDR
is used for only one fitting function (gdistremoval
).
We’ll simulate 2 datasets.
out <- simulate(umf, lambdaformula=~elev, removalformula=~wind, distanceformula=~1,
coefs=cf, mixture="NB", nsim=2)
## Assumed parameter order for lambda:
## (Intercept), elev
## Assumed parameter order for alpha:
## alpha
## Assumed parameter order for dist:
## (Intercept)
## Assumed parameter order for rem:
## (Intercept), wind
lapply(out, head)
## [[1]]
## Data frame representation of unmarkedFrame object.
## yDist.1 yDist.2 yDist.3 yDist.4 yRem.1 yRem.2 yRem.3 yRem.4 yRem.5
## 1 0 1 0 0 1 0 0 0 0
## 2 1 0 1 2 1 1 1 0 1
## 3 4 6 7 3 7 3 5 2 3
## 4 1 0 1 0 0 0 0 1 1
## 5 0 0 1 0 1 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## 7 3 3 0 1 1 0 1 3 2
## 8 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0
## elev wind.1 wind.2 wind.3 wind.4 wind.5
## 1 -0.56047565 -0.71040656 0.2568837 -0.24669188 -0.34754260 -0.95161857
## 2 -0.23017749 -0.04502772 -0.7849045 -1.66794194 -0.38022652 0.91899661
## 3 1.55870831 -0.57534696 0.6079643 -1.61788271 -0.05556197 0.51940720
## 4 0.07050839 0.30115336 0.1056762 -0.64070601 -0.84970435 -1.02412879
## 5 0.12928774 0.11764660 -0.9474746 -0.49055744 -0.25609219 1.84386201
## 6 1.71506499 -0.65194990 0.2353866 0.07796085 -0.96185663 -0.07130809
## 7 0.46091621 1.44455086 0.4515041 0.04123292 -0.42249683 -2.05324722
## 8 -1.26506123 1.13133721 -1.4606401 0.73994751 1.90910357 -1.44389316
## 9 -0.68685285 0.70178434 -0.2621975 -1.57214416 -1.51466765 -1.60153617
## 10 -0.44566197 -0.53090652 -1.4617556 0.68791677 2.10010894 -1.28703048
##
## [[2]]
## Data frame representation of unmarkedFrame object.
## yDist.1 yDist.2 yDist.3 yDist.4 yRem.1 yRem.2 yRem.3 yRem.4 yRem.5
## 1 0 1 0 0 1 0 0 0 0
## 2 0 1 0 0 0 1 0 0 0
## 3 2 1 2 1 4 2 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 1 1 0 1 1 1 0 1 0
## 6 0 0 0 0 0 0 0 0 0
## 7 1 1 1 0 1 0 2 0 0
## 8 0 0 0 0 0 0 0 0 0
## 9 0 0 0 1 1 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0
## elev wind.1 wind.2 wind.3 wind.4 wind.5
## 1 -0.56047565 -0.71040656 0.2568837 -0.24669188 -0.34754260 -0.95161857
## 2 -0.23017749 -0.04502772 -0.7849045 -1.66794194 -0.38022652 0.91899661
## 3 1.55870831 -0.57534696 0.6079643 -1.61788271 -0.05556197 0.51940720
## 4 0.07050839 0.30115336 0.1056762 -0.64070601 -0.84970435 -1.02412879
## 5 0.12928774 0.11764660 -0.9474746 -0.49055744 -0.25609219 1.84386201
## 6 1.71506499 -0.65194990 0.2353866 0.07796085 -0.96185663 -0.07130809
## 7 0.46091621 1.44455086 0.4515041 0.04123292 -0.42249683 -2.05324722
## 8 -1.26506123 1.13133721 -1.4606401 0.73994751 1.90910357 -1.44389316
## 9 -0.68685285 0.70178434 -0.2621975 -1.57214416 -1.51466765 -1.60153617
## 10 -0.44566197 -0.53090652 -1.4617556 0.68791677 2.10010894 -1.28703048
As a check, we’ll fit the same model used for simulation to one of the datasets.
gdistremoval(lambdaformula=~elev, removalformula=~wind, distanceformula=~1, data=out[[1]],
mixture="NB")
##
## Call:
## gdistremoval(lambdaformula = ~elev, removalformula = ~wind, distanceformula = ~1,
## data = out[[1]], mixture = "NB")
##
## Abundance:
## Estimate SE z P(>|z|)
## (Intercept) 1.651 0.152 10.89 1.28e-27
## elev 0.887 0.142 6.25 4.16e-10
##
## Dispersion:
## Estimate SE z P(>|z|)
## 0.118 0.242 0.486 0.627
##
## Distance:
## Estimate SE z P(>|z|)
## 3.89 0.053 73.4 0
##
## Removal:
## Estimate SE z P(>|z|)
## (Intercept) -0.901 0.141 -6.38 1.74e-10
## wind -0.373 0.088 -4.24 2.20e-05
##
## AIC: 1162.882
Looks good.
The simulate
function provides a flexible tool for simulating data from any model in unmarked
. These datasets can be used for a variety of purposes, such as for teaching examples, testing models, or developing new tools that work with unmarked
. Additionally, simulating datasets is a key component of the power analysis workflow in unmarked
- see the power analysis vignette for more examples.