Sabre

Defining, fitting and displaying models

Sabre manual

There are two types of model that can be fitted in SABRE. These are:

A standard model which does not estimate a case-specific term in the linear predictor
A mixture model with a case-specific random effect in the linear predictor

More details on the statistical background and estimation procedures for the models are given in the technical description.

The command to specify the response variable is

YVARIATE variable

where the variable must have values of either zero or one for a binary model, non-negative integer values for a Poisson model, any real values for a linear model, or integer values between one and the number of categories for an ordered response model.

To specify the distribution of the response variable type

FAMILY argument

where argument is either B for binomial (binary data or ordered response models), P for Poisson (count data models) or G for Gaussian (linear models).

To specify the link function for binary data or ordered response models type

LINK argument

where argument is either L for logit, P for probit or C for complementary log-log (binary data models only). For Poisson and linear models the link function is set to the log link and the identity link respectively, and cannot be changed by the user.

To fit an ordered response model type

ORDERED y

For such models the YVARIATE must consist of the integers from 1 to the number of categories. The model fitted will either be an ordered logit or ordered probit depending on the link function specified in the LINK command.

To specify initial estimates for the cutpoint parameters in a random effects ordered response model, precede the FIT command with the CUTPOINTS command as follows:

CUTPOINTS c1 c2 c3 ...

where the arguments are real numbers.

The current values of both the FAMILY and LINK specifications can be obtained using the DISPLAY SETTINGS command.

For mixture models, the name of the case variable must also be specified using

CASE variable

Clearly, one must be careful that this is specified correctly as a sequence of non-descending integer values.

Multivariate models are specified using

MODEL B

for bivariate models, and

MODEL T

for trivariate models.

For multivariate models, the name of the risk variable must also be specified using

RVARIATE variable

where the variable must take the values one or two for bivariate models, and one, two or three for trivariate models. The risk variable is used to associate each observation with the correct linear predictor.

The current specifications of the response, case and risk variables can be obtained using the DISPLAY VARIABLES command.

Multivariate random effects models can either be correlated or uncorrelated; to specify this type

CORRELATED argument

where argument is either Y or N. The default is to fit a correlated model.

Multivariate models are specified in the same way as univariate models via a single variable list and so the number of variables in the first linear predictor must be specified by

NVAR n

where n is a positive integer.

For trivariate models, the number of variables in the second linear predictor must also be specified.

For multivariate models it is necessary to specify more than one argument to some commands. For example, in bivariate models the distributions of both responses are required and each has an associated link function. The full list of commands which can take multiple arguments is: CONSTANT, FAMILY, LINK, MASS, NVAR, RHO, SCALE and SIGMA. In each case, the single argument syntax is extended to

COMMAND FIRST=argument1 SECOND=argument2 THIRD=argument3

In the case of a single argument the option FIRST can be omitted and so the syntax reverts back to

COMMAND argument

So, for example, to fit a bivariate model in which the first response is continuous and the second response is a binary outcome to be modelled as a probit, the family and link commands would be specified as follows:

FAMILY FIRST=g LINK SECOND=p

Note that the second argument to family can be omitted because this will be binomial by default, and the first argument to link can also be omitted because this is set to the identity link automatically by virtue of the associated family being Gaussian.

To fit a standard model with explanatory variables A, B and C use the command

LFIT a b c

where A, B and C can be either variables, factors or pseudo factors.

To fit a mixture model type

FIT a b c

The FIT command will first fit a standard model in order to then use these estimates as starting values for the explanatory variable parameters in the random effects model. Note that the estimates from the standard model fit will not be displayed; if these are of genuine interest then LFIT should be used explicitly followed by DISPLAY ESTIMATES. The user can overide these default starting values using the INITIAL command (see below).

For both the standard and mixture models, the constant (or intercept) term will usually be needed and this should be specified using

CONSTANT variable

if variable is an explicit constant in the dataset, or

CONSTANT cons

to create an implicit constant cons.

The constant should then be included as an argument in the LFIT or FIT command. Note that, regardless of where it appears in the argument list, the constant will be taken to be the first term in the model.

For mutivariate models, dummy variable versions of the risk variate should be used as the constants.

To specify initial estimates for the explanatory variable parameters in a mixture model, precede the FIT command with the INITIAL command as follows:

INITIAL a1 a2 a3 ...

The argument list should contain starting values for each explanatory variable level to be estimated in the model, so factor levels which will be intrinsically aliased during the model fit (usually the first) should not be initialised. The starting values for the remaining factor levels should be supplied in reverse order from the highest back to the lowest. For example, for a model which includes an intercept term, a 4-level factor contributes 3 arguments to the INITIAL command, representing the initial estimates for the 4th, 3rd and 2nd levels of the factor in turn.

To include a lagged response variable in a univariate binary model type

LAG y

prior to the FIT command.

Note that all subsequent univariate binary mixture models will continue to include the lagged parameter until either the

LAG n

or (see below)

MARKOV y

command is issued.

To fit a 2-state Markov univariate binary logistic-normal mixture model type

MARKOV y

prior to the FIT command.

Note that all subsequent univariate binary mixture models will continue to be 2-state Markov until either the

MARKOV n

LAG y

command is issued.

Note that in both the lagged and Markovian frameworks, the initial observation for each case is automatically dropped from the dataset. Switching LAG on switches off MARKOV, while switching MARKOV on switches off LAG.

Note also that, since the standard models assume each observation to be independent and ignore the case structure within the data, the lagged and Markovian environments are relevant only to mixture modelling.

If any parameters in a model are known a priori then the associated explanatory variables may be included in the model through use of the

OFFSET var

command. The offset term var will have its coefficient fixed at 1.0 and thus be taken as a constant part of the linear predictor.

When a model has converged to its maximum likelihood solution, the log likelihood and residual degrees of freedom may be obtained using

DISPLAY MODEL

Similarly, parameter estimates and standard errors are displayed by typing

DISPLAY ESTIMATES

while for univariate models the correlation matrix may be examined via the command

DISPLAY CORRELATIONS

The order of arguments for FIT and LFIT can be very important as this order determines the aliasing structure of the model. SABRE will estimate parameters (for factors or pseudo factors, going from the highest to the lowest level) in order of the arguments until a particular parameter cannot be estimated. The estimate for this parameter is set to zero. If the aliasing is due to the design of the explanatory variates (eg, if fitting an intercept with a factor, one parameter cannot be estimated) then 'ALIASED [I]' is written in the standard error column of the table of estimates. If, however, aliasing is due to other linear dependencies between the explanatory variates then 'ALIASED [E]' is written in this column.

For very long sequences of observations for the same case or count/linear models with large value responses, calculating the likelihood and derivatives can result in numerical underflow. For univariate models SABRE has a method of calculation which overcomes these problems. However, the drawback with this method is that it makes the fitting algorithm considerably slower. Thus, a command ARITHMETIC is provided which allows the user to choose which method of calculation she requires. The syntax is

ARITHMETIC argument

where argument = ACCURATE for the method which does not suffer from underflow problems, and argument = FAST for the fast method which can suffer from underflow. The argument can be shortened to either A or F.

Clearly, whether underflow occurs is dependent on a number of factors. These include the number of observations for each case, the magnitude of the probabilities being multiplied and the accuracy of the machine. If the fast method is being used, SABRE will return an error message if underflow occurs during the fitting of a model. The user can then issue the ARITHMETIC ACCURATE command and refit the model.

SABRE also has two ways for estimating the Hessian matrix of second derivatives of the likelihood used in the Newton-Raphson algorithm. The first of these uses an approximation based upon first derivatives suggested by Meilijson (1989) and Berndt et al (1974), and the second uses the true second derivatives. The approximation is probably more robust when the parameter estimates are a long way from their maximum likelihood solutions and it has the advantage that the estimated Hessian matrix is necessarily positive definite. The drawback with this method is that there is uncertainty about the conditions under which the approximation provides consistent estimates of the parameter standard errors. The problem with the matrix of true second derivatives is that it is often not positive semi-definite if parameter estimates are a long way from their solutions during a particular iteration. This requires a pragmatic intervention which may greatly reduce the speed of the algorithm.

For these reasons, SABRE begins fitting the logistic-normal model (or log-linear normal model for count data) using the Hessian approximation. The number of iterations performed by this method is controlled using the command

APPROXIMATE n

where n must be a non-negative integer which is less than or equal to the maximum number of iterations (see below). The default value of n is 5. If the model has not converged before the end of these n iterations, SABRE then switches to using the true matrix of second derivatives. However, if on this (n+1)st iteration the true Hessian is not positive semi-definite, then the algorithm reverts back to the approximation and only switches methods when the true matrix of second derivatives finally becomes positive semi-definite. The algorithm then continues with the true Hessian, which may or may not become non-positive semi-definite at a later date, until the model converges or the maximum number of iterations for a model has been reached. Note that, irrespective of the number of iterations needed for convergence, the final parameter estimate standard errors are based upon the true Hessian.

The maximum number of iterations is set using

MAXIMUM n

where n must be a positive integer which is greater than or equal to the number of approximate iterations. The default value is 100.

To add comments to a Sabre command file, place a 'C' in the first column of the comment line followed by a space - everything that follows on that line will then be treated as a comment.

To terminate a Sabre session, type STOP.

Other links: Centre for e-Science | Centre for Applied Statistics