|Data preparation: Introduction|
A Statistical Analysis/Modelling Cycle
This is a schematic to represent the various stages a quantitative researcher may go through during their research in a substantive area using individual level panel or clustered data from secondary sources. There are other types of quantitative research, e.g. using aggregate level data, these are not covered here.
The top part of this diagram is concerned with data set selection, data management and preparation, the lower part represents the analysis part. In this example, the data selection stage is done using Nesstar on data sets in a data archive. In the past, this stage would be performed by reading hard copy data descriptions and by ploughing through coding schedules and questionnaires by hand. Much of this work has been made easier by through the use of online metatdata and resource discovery tools by the data archives.
However the on line surveys of individual behaviour typically do not contain relevant contextual information, such as would occur in a study of labour market or occupational mobility where the researcher needs some measure of the opportunities and activities in the Travel to Work Area of the respondent. These differences can have a major effect on an individualís behaviour. In this figure the TTWA data is obtained from NOMIS.
Having selected the relevant contextual data (it may need to be disaggregated to the appropriate spatio-temporal scale, e.g. month and Travel To Work Area (TTWA)) it is added to the individuals records. This is often done in standard desktop software such as SPSS or Stata.
The resulting file is then manipulated to put it in a form (working data) appropriate for the analysis. This working data set is then transferred/saved in the text format needed for Sabre. The analysis is performed and if the researcher finds problems with the analysis, e.g. missing values in a variable make it impossible to use (or the referees suggest a modified analysis), the researcher goes back to the data management step and selects alternative or additional variables, depending on results. This cycle can be repeated 2 or 3 times.
The data management step could be much more complex than this, e.g. with the British Household Panel Survey †there are 30 files with related data that may need to be combined to produce a coherent data set (e.g. work history data) and this type of data base management can not currently be performed in Nesstar.
However, many of the data sets we use to illustrate Sabre in this site are much simpler as they have individual specific but time constant explanatory variables for the responses of interest. In this case the data can be arranged in a compact wide form.
Sabre concentrates on procedures for estimating random effect models, it only has a few commands for performing simple transformations. For instance it does not have the facilities for handling data with missing values, reshaping data, or manipulations like sort on a particular variable, so these activities are best performed in more general statistical software packages. The commands we use in this section on data preparation are for Stata 9. The Stata versions of the data sets we refer to in this section are not available from this site. We do provide references that will help you find them. Other software such as R and SPSS could also be used to perform the types of data set preparation we describe here.
Our notes are not meant to be comprehensive as there are many sites that provide an introduction to Stata and some of the commands we present here, see for e.g. http://www.ats.ucla.edu/stat/stata/.
Other links: Centre for e-Science | Centre for Applied Statistics