ADMB Framework : Synthetic Data

Here we describe how synthetic data is generated, which is useful when testing various models and admb code.

The Model

The ADMB framework is first tested on synthetic data. In the first step, we start with a simple model, which is the same (simple) model used for real data:

N_i,t	=	(N_i,t-1-C_i,t-1)e^-M	[1]
f(N_i,0,q_k,s,M)	=	sum_i,t,k {-(2s²)^-1[ln(I_i,t,k)-ln(q_kN_i,t)]²-ln(s)}	[2]

We generate synthetic data using parameters in Table 1. Next, we attempt to recover the parameters using the objective function defined in [2]

Table 1: Data model parameters

Parameter	Description	Dimension
N_i,0	Recruitment for cohort i	m-by-1
M	Natural mortality	Scalar
C_i,t	Catch of indivduals in cohort i, at age t	m-by-(maxage+1)
I_i,t,k	Survey index for cohort i, at age t, by survey k	At most m-by-(maxage+1)-by-num_surveys
s	Standard deviation for I	Scalar
q_k	Catchability of survey k	1-by-num_surveys

Assumptions

The number of cohorts m, N_i,0, the number of surveys, the catchabilities q_k, s and M are input by the user. The catch is randomly generated (see formula below), and the survey indices are generated to comply with the model's statistical assumption that:

ln(I_i,t,k) ~ N(ln(q_kN_i,t),s²)

Implementation and Coding

Data can be generated using the file generate_data.m, which can be found in the mfile repository.

Example Results

SUMMARY: Below are various experiments conducted in April 2010 to test the performance of the model and the data file generator. The conclusion that can be drawn from them is that unless s is very small, then it is important to have a certain level of catch, otherwise the estimates of the initial population size will have enormous variance.

Varying s in the data: single experiments

For the sake of example, let us generate a data set with three cohorts and two surveys:
NOTE: All the experiments regarding varying s were generated using this old version of generate_data.m. It imposes a fixed cfactor = 1.0, and also allows observations of age group 0.

generate_data([200 400 800],[0.5 0.5],1e-4,0.15,'../cod_admb/cod',3,15,2,3*15*2,1980)

In other words, the three cohorts should start with initial levels of 200, 400 and 800, respectively, and the catchabilities for the surveys should both be 0.5. We set s to be close to zero, which will make the data close to deterministic in nature. The natural mortality is set to 0.15, and we follow the cohorts until they are age 15, starting in 1980. 3*15*2 specifies the number of survey indices, in other words that we should have all the possible survey indices available. This gives this .dat file.

…and the following .pin file

cod.pin

# N0
 200   400   800 
# q
 0.50000 0.50000
#logs
-9.21034
#M
 0.15000

After running ADMB, the following estimates are obtained.

cod.par

# Number of parameters = 7  Objective function value = -449.991  Maximum gradient component = 4.90435e-07
# N0:
 199.969 399.929 799.862
# q:
 0.500070 0.500079
# logs:
-4.99999999861
# M:
0.149994820774

The corresponding standard deviations are:

cod.std

 index   name   value      std dev   
     1   N0    1.9997e+02 4.6847e+00
     2   N0    3.9993e+02 9.9539e+00
     3   N0    7.9986e+02 2.0937e+01
     4   q     5.0007e-01 1.2089e-02
     5   q     5.0008e-01 1.2089e-02
     6   logs -5.0000e+00 3.9311e-06
     7   M     1.4999e-01 7.3867e-04

In other words, the recovered values are almost exactly the original values, which makes sense since the data was deterministic. The real population trajectories are shown in the figure below, with a different color for each cohort. (The recovered trajectories are almost identical, and are not shown.)

Note that for all the experiments in this section the true trajectories are the same!

Instead of having deterministic data, let us set s to 0.05, and randomly generate a dataset (datfile, parfile, stdfile – the pinfile from the zero-sigma files defined earlier can still be used). This is then, of course, only one possible realization. If we optimize the model, we get the results in the figure below.

The curves with triangles are the true population trajectories. The plain curves are the computed trajectories, with a 95% confidence interval for the initial estimate marked as a vertical line. The plus sign and circles are survey observations from the first and second survey, respectively, DIVIDED BY the corresponding catchability. The colors of these observations correspond to the cohort colors.

Now, let us increase sigma to 0.1, and generate a new data set. (datfile, parfile, stdfile.) The results are given in the figure below:

The deviation between the computed and real trajectories is now slightly larger than for the prevoius case, as one might expect. However, the size of the confidence intervals has now grown dramatically. In addition, although the computed trajectories mathc the observations quite well, the observations do not match the true trajectories (especially for the blue curves), suggesting that the estimate of q is quite inaccurate. Indeed, from the stdfile we see that the estimated qs are about 0.3, as opposed to the true values of 0.5 for each survey.

Finally, let us try s = 0.15. (datfile, parfile, stdfile).

As one can see, the computed trajectories actually match the true curves better than they do for the previous case. This suggests that looking at only one realization per s-value is not enough in order to make inferences about the effect of s on the estimates. However, it can be noted that in this case the qs were estimated to be about 0.4, which is closer to the true values of 0.5, and this might explain why the fit is better.

Varying s in the data: Monte-Carlo experiments

NOTE: As in the preceding section, all the experiments regarding varying s were generated using this old version of generate_data.m, with cfactor = 1.0.

For the three choices of s (0.05, 0.1, 0.15) we generate data sets with one cohort which is followed for 15 years, with q = 0.5 and M = 0.15. The estimates for N0 vary a great deal, although the median is relatively stable, as shown in the figures below:

As one might expect, changing s has a significant effect on the number of experiments which produce estimates close to the true value, with a lower s-value giving better results.

Varying q in the data: Monte-Carlo experiments

NOTE: These experiments were generated using the data generation file listed above, with cfactor = 0.65.

For the three choices of q (0.25, 0.5, 0.75) we generate data sets with one cohort which is followed for 15 years, with s = 0.1 and M = 0.15. As before, the estimates for N0 vary a great deal, although the median is relatively stable, as shown in the figures below:

As we can see, varying q has little effect, at least compared to varying s.

Varying M in the data: Monte-Carlo experiments

NOTE: These experiments were generated using the data generation file listed above, with cfactor = 0.65.

For the three choices of M (0.05, 0.1, 0.15) we generate data sets with one cohort which is followed for 15 years, with s = 0.1 and q = 0.5. As before, the estimates for N0 vary a great deal, but changing M does not seem to have an effect on the proportion of experiments which produce estimates for N0 that are much larger than the true value.

Varying cfactor in the data: Monte-Carlo experiments

For this set of experiments we vary the constant cfactor. It appears in the file generate_data on the line:

CC(i-1,j) = rand(1,1) * cfactor/n * N(i-1,j);

Here CC is the generated catch, n the number of years we follow the cohort, and N is the population level. What this line does is to set the catch for the year in question as fraction of the remaining stock multiplied with a uniformly distributed random variable. Increasing cfactor will therefore increase the catch.

NOTE: The above formula should probably be changed to be independent of n and have less variance.

We test for five different values of cfactor (0.3, 0.6, 0.9, 1.2, 1.5). The results are shown in the following plots:

It appears to be the case that the N0 estimates become more stable as the catch increases, with the cfactor = 1.2 being a slight exception.

We therefore test two relatively large values, cfactor = 5 and 10. The results are given below:

As we can see, the estimates are now very consistent, but the case case cfactor = 10 corresponds to the cohort reaching almost zero level very quickly, (the order of 2-3 years), whereas cfactor = 5 allows the cohort to have reasonably large levels for about 7 years.

AdmArwiki

Table of Contents