2

Criminal Careers: Discrete or Continuous?

 3 years ago
source link: https://link.springer.com/article/10.1007/s40865-016-0029-2
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Mixture Modeling of Criminal Career Trajectories

The classification of criminals into discrete criminal types was an important element of late nineteenth-century positivist criminology [810, 35, 96], but until recently, the appeal of general theories of crime causation (e.g., strain theory, learning theory, control theory) left the enterprise of constructing taxonomies somewhat marginal to mainstream twentieth-century criminology. It has, however, been revived in the past 40 years, with a refocused interest on criminal careers—that is, the sequence of offenses or arrests committed by law violators over time.

The ground for this revival was paved by Chicago School sociologists whose qualitative investigations focused on the processes by which youths were initiated into delinquent activities through social interactions (Shaw 1930, 1931; [7]). Around the same time, Glueck and Glueck [37, 38] tracked the criminal involvement of Boston youths over time, and carried out quantitative analyses of the patterns they found. Perhaps because of the difficulty in following large numbers of young people over an extended period of time, this type of criminological research was never widely adopted. The studies of self-reported delinquency undertaken in the 1960s and 1970s were usually cross-sectional.

Wolfgang et al. [124] revived the longitudinal study of criminality with their research on the arrest histories of Philadelphia boys. Soon thereafter, criminologists began to develop statistical models describing the temporal patterns disclosed in those arrest histories [14, 16]. In this new body of work, it is the classification of careers, rather than the classification of criminals on the basis of static individual traits, that is at the center of attention.

Statistical analyses of temporal patterns of involvement in criminal conduct have been the subject of a number of books, and numerous journal articles. Theory and research on this topic carried out in the 1970s dealt with such issues as the shape of the age-crime curve, which rises in childhood, peaks in adolescence or early adulthood, and then declines [31, 40, 41, 51, 52]. This pattern was, Hirschi and Gottfedson [51, 52] proclaimed, the same for blacks and whites, males and females, and in different times and places—a contention that others disputed [41, 43, 109].

In retrospect, it is striking that the debate over this claim largely focused on the comparison of aggregate figures—that is, on the total numbers of arrests or convictions committed by persons of different ages, while having little to say regarding individual variability about the aggregate trends. The lack of long-term longitudinal data sets of individual crime records forced researchers of the late twentieth century to work with data for aggregates. In the interim, information about individual involvement in crime spanning a number of years has become available, making it possible to study individual patterns of criminality temporally. This availability has enabled researchers to overcome another weakness in the earlier body of work: it drew inferences about the effect of age from cross-sectional data, potentially confounding age effects with cohort effects [44].

The heated debates about the universality of the age-crime curve have largely been resolved. The existence of diverse patterns of temporal change in individual offenders’ involvement in crimeFootnote 1 is well-established [15, 27, 29, 43, 78, 109]. Research has also identified life events that alter patterns of criminality [13, 59, 60]. Nevertheless, despite vigorous research efforts spanning several decades, some fundamental questions about criminal careers remain unanswered. This paper addresses one of them—the shape of the distribution across individuals of criminal career trajectories. In particular, is this distribution discrete or continuous? Much of the recent research on criminal careers has adopted a particular strategy for studying temporal patterns of crime—the estimation of finite mixture models, also known as group-based trajectory analysis or latent class analysisFootnote 2 [2, 7477, 84]. In a review of this research published just 8 years ago, Piquero [94] tallied more than 80 criminological publications using this approach. A more recent review located 105 studies [54]. Sterba et al. [110] observe that hundreds of studies using this method have been carried out in psychology.

There are no doubt numerous reasons for the popularity of finite mixture modeling as a strategy for studying criminal careers; one of them is that it overcomes an important limitation to earlier statistical methods for analyzing longitudinal patterns. Older methods could analyze heterogeneity in criminal career trajectories due to sources of heterogeneity that were known to researchers and measured. They could do this using interaction terms or through subgroup analyses. Yet researchers commonly do not know all of the sources of heterogeneity. Finite mixture modeling allows researchers to take into account heterogeneity due to causes that are not known to the researcher, and that are not represented by variables in the data set.

Although some writers have employed other statistical methods for studying criminal careers, such as multilevel modeling [17, 29, 34, 53, 81, 115], these alternatives have been used much less often than finite mixture models. This is so even though several researchers have expressed doubts or misgivings about the finite mixture approachFootnote 3 [46, 26, 28, 97, 101, 102, 107, 110], and even though these methods can also address heterogeneity due to unknown and unmeasured causes. Finite mixture modeling clearly dominates research on this topic. The present paper is intended to clarify some of the issues that have figured in the debates about the potential value of this particular statistical tool to the study of criminal careers, and more generally, to developmental studies. It is also intended to encourage researchers to consider the full range of methodological options available to them, and to offer guidance to researchers considering these options as to how they might best choose models for their analyses.

The main statistical modeling approaches currently being used for analyzing criminal careers strive to represent data regarding criminal events in simplified form by positing a simple functional dependence of those events on time or on an individual’s age. They then seek to model individual variability in the parameters that define these functions. The methods do this in different ways. In both the multilevel modeling and the finite mixture modeling approaches, this dependence is usually taken to be a polynomial of second or third degree. In most instances, a polynomial of low degree will provide only an imperfect fit to the sequence of criminal events that make up a career. The trajectory is thus a mathematical construct that the researcher employs to approximate the sequence of actual events [72, 75]. Rather than assuming that a single functional dependence holds exactly for the entire population of offenders (the standard assumption in an OLS or Poisson regression), the multilevel approach and the finite mixture approach both allow for individual variability in the parameters that characterize this functional dependence. Multilevel modeling assumes a normal distribution for the unmeasured random effects characterizing individual differences. In contrast, the finite mixture modeler assumes that the complete set of individual sequences can be treated as realizations of a finite number of discrete latent trajectories, each with its own trajectory parameters. Here, the distribution assumed for the unmeasured sources of heterogeneity is taken to be discrete, but not otherwise specified. The procedure estimates the parameters characterizing each latent trajectory, along with the probabilities that a given individual is following each trajectory. One of the appeals of the group-based approach is that it need not assume that the individual effects are normally distributed. Most of the time, researchers have no strong reason for thinking the individual effects to be Gaussian, so the ability to avoid reliance on an uncertain assumption is attractive.

It is common practice in the group-based approach to assign individuals to the most likely trajectory, given the observed sequence of offenses in that individual’s criminal history. This method treats all the trajectories in a given set as being the sameFootnote 4 [2, 68, 113]. Consequently, variation in the trajectories being studied is modeled as being entirely due to membership in the classes [58]. The strategy groups together subjects whose sequences of criminal events are fairly similar, and places in separate groups subjects whose sequences are dissimilar. Models of this sort can be estimated in Stata (Partha Deb’s fmm routine and Bobby Jones’s traj), SAS’s PROC TRAJ [55], Latent Gold [118], MPlus [71, 72], and R [11, 62] In Stata’s fmm routine, the researcher proceeds by estimating a model on the assumption that there are two discrete groups, then three, then four, and so forth. Fit statistics, including the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), are used, along with several other criteria, to determine the number of groups that provide an optimum fit to the observed sequences of offenses in the sampleFootnote 5 [74, 75, 87]. Researchers studying criminal careers in this way commonly conclude that optimal fits can be obtained with three to seven groups [27, 29, 30, 54, 64, 84, 95, 112].

Statisticians, including the developers of the finite mixture modeling approach, have noted a limitation of the method: it does not test the assumption that the distribution being modeled consists of a discrete number of groups. When a finite mixture model is estimated, the estimation procedure fits the data on the assumption that the assumed number of groups is correct. It will fit the data as closely as possible whether or not the distribution being fit actually consists of discrete groups [18, 19, 75, 76, 110]. That this is so raises a legitimate concern that if a distribution does not actually contain discrete groups, the algorithms used in finite mixture modeling could, nevertheless, specify an optimal number of discrete groups that is larger than 1.

Whether the over-extraction of groups is a problem depends on the purpose of the analysis. If the empirical distribution of events under study consists of a mixture of a finite number of discrete groups, and one of the purposes of the analysis is to determine how many groups are present, then one will want to determine this number accurately. If the distribution is actually continuous, with no sharply demarcated groups, and one wants to know this, then the failure of the method to determine this will be equally troubling. There are other purposes, however, where the finite mixture modeling may be perfectly satisfactory even if it does not perfectly capture all of the features of the underlying distribution of trajectories. This paper is primarily concerned with analyses in which the researcher wants to know the shape of the true underlying distribution—in particular, whether it is continuous or discrete. Toward the end, the paper will also offer some brief remarks about analyses where this is not the goal.

A modest number of studies have examined the ability of finite mixture modeling to capture the features of known artificial data sets, or to compare it with multilevel modeling [18, 20, 21, 63, 73, 107, 110, 117]. To date, however, only a limited range of data patterns have been examined. Though there are particular circumstances where each of the two methods outperforms the other, in many circumstances encountered in research, the multilevel approach outperforms the group-based approach, even in the presence of moderate departures from normality [110].

Further simulations carried out by Warren et al. [121] add to concerns about the group-based approach. Comparing six different implementations of the basic idea, these authors found that they did not agree on the optimum number of groups to include in a model, and could not be counted on to identify the correct number. They could not reliably predict the group to which a subject belonged, or the proportion of subjects associated with a given trajectory, or the qualitative features of the trajectories estimated. As the authors point out, these simulations were carried out on ideal data sets, with no missing values and no sample attrition. In less than ideal circumstances, they observe, performance might be worse. Other researchers have found the optimal number of groups and the assignment of individuals to groups to depend on the length of the follow-up [28].

Simulations conducted by Schork and Schork [103] and Bauer and Curran [4] are of particular interest in relation to our investigation. Their studies were designed to assess the possibility that finite mixture models will yield optimum fits for models with more than one group even when the data-generating process does not have distinct groups, merely because of skewed distributions in the data. Their simulations show that when there is just one true group, the group-based approach will generally yield one group when the parameters are normally distributed,Footnote 6 but will “find” more than one group when the parameters are skewed.

I confirmed their conclusions by generating 1000 observations for random draws from the beta distribution B(a, b, x), with the shape parameters a = 2, b = 6. The beta distribution is a family of continuous distributions whose shape parameters are defined by the formula for the probability density

when x lies in the closed interval [0,1], and is 0 otherwise. This theoretical distribution is shown in Fig. 1a, while a histogram of the distribution of the random draws from that distribution is displayed in Fig. 1b. Clearly, there are no distinct groups here. By construction, we have a unimodal, smooth, moderately skewed distribution, apart from irregularities originating in the randomness of the draw.

Fig. 1

a Graph of Beta(2, 6). b Histogram of 1000 random draws from a B(2,6) distribution

To assess the performance of finite mixture modeling in relation to distributions with these features, I carried out fmm estimation in versions 11 and 14 of Stata, assuming that the data originate in a mixture of 1, 2, 3, or 4 groups, each of which has a normal probability density distribution. The optimum fit is the model with the smallest value of BIC.Footnote 7 The BIC statistics displayed in Table 1 prefer the model with three groups as optimal.Footnote 8 If, instead of assuming that we have a mixture of normal distributions, we assume that we have a mixture of Student’s t distribution, the optimum number remains 3. If we assume a mixture of lognormal or gamma distributions, the optimum number becomes 2.

Table 1 BIC values (finite mixture estimation of 1000 random draws from a B(2,6) distribution)

The proportions of cases belonging to each group of the three-group mixture-of-normals model are shown Table 2. The proportion of cases for which a given trajectory is most likely are all substantial enough that we would want to retain all three groups in our model. Table 3 shows the Bayes’ factor appropriate for comparing the probability that the correct model is i with the probability that it is j.Footnote 9 Using Jeffrey’s interpretation of these factors, in a comparison of the three-group model with the two-group model, the estimated Bayes’ factor of 1311.60 signifies extremely strong evidence for three groups compared to two; in a comparison of the three-group model with the two-group model, the Bayes’ factor of 19.01 also provides very strong evidence for three instead of four. The probability that the three-group model is correct, using the criterion of Kass and Wasserman [57]; see also [56, 75, p. 70], is 0.80.

Table 2 Proportions of cases in each class (three group model, mixture of normals) (finite mixture estimation of 1000 random draws from a B(2,6) distribution)
Table 3 Bayes factors (B12) for pairs of groups (three group model, mixture of normals) (finite mixture estimation of 1000 random draws from a B(2,6) distribution)

Several criteria have been proposed to assess the model one has chosen. Nagin [75] recommends the computation of the average assignment probabilities based on the rule of assigning each case to the group that has the largest posterior probability. Tables 4 displays these averages (AvePP). If all cases were being assigned by the maximum posterior probability rule to the right group, the averages would be 1 for the proper group. Here, they are .713 for those assigned to the first group, .856 for those assigned to the second group, and .756 to the third group. Nagin [75] expresses a “personal rule of thumb” that all groups should have AvePP values of .70 or higher. Assignments for this model all meet this standard.

Table 4 Average assigned probabilities based on highest posterior probability rule (finite mixture estimation of 1000 random draws from a B(2,6) distribution)

A second criterion Nagin [75] recommends for assessing a model’s adequacy is the odds of correctly classifying a case into the proper group based on the maximum probability rule, compared to the odds of doing so by assigning cases randomly in the proportions estimated to exist in the population. For our three groups, the OCC statistics are, respectively, 11.51, 11.29, and 3.17. The higher the OCC value, the higher the model’s assignment accuracy. Nagin considers values greater than 5 for all groups to indicate high classification accuracy. We easily surpass this standard for groups 1 and 2, and fall somewhat short for the third group. If assignments to groups were being made at random, the odds of an assignment being correct would be right 49.4 % of the time. With the model, the assignment would be right 75.6 % of the time. Not bad!

A third criterion available for assessing the usefulness of a model is how well it separates the groups. Do they overlap a little, or a lot? Table 5 addresses this question by showing the estimated means of the three groups, along with their standard errors. It is clear that the three-group model separates the three groups quite effectively. None of the 95 % confidence limits (LCL and UCL) overlap.

Table 5 Parameter estimates of three group model (3-group model, mixture of normals) (finite mixture estimation of 1000 random draws from a B(2,6) distribution)

The entropy is another measure of how useful a model is in classifying cases into categories. It can take on values between 0 (useless) and 1 (maximally useful). The three-group model has an entropy of .533, which is respectable but not spectacular.

Most published criminological studies of trajectories using the group-based approach rely on the BIC criterion, along with the principle that models should not have trajectories followed by very few individuals. Employing these criteria, our artificial example clearly has more than one group, with the precise number considered optimal dependent on the shape of the distributions we posit for the components of the mixture. A researcher considering the additional criteria as well would probably conclude that these models are acceptable. Yet these groups are artifacts of the algorithm used in the computation. They do not demonstrate that more than one group is present. (We will say more about these results at a later point).

In a more elaborate simulation, I generated trajectory models in which the values of an interval-level outcome variable measured annually on a sample of individuals between ages 12 and 41 were characterized by a quadratic dependence on age with coefficients being random draws from a normal distribution. I again obtained optimal fits for the coefficients and slopes with models that had more than one group. These results show that estimation results from a finite mixture model can be untrustworthy guides to the true number of groups actually present in a data set.

The findings of the Bauer and Curran [4] study and of the present simulations are especially troubling in relation to criminal career research because, in many populations, the distribution of offenses is highly skewed. This is just the circumstance in which the group-based approach is known to over-estimate the number of groups present in a data set. Because of the possibility that the research findings based on finite mixture model methods applied to genuine data sets in previous criminal career research are artifacts of the methods used to analyze the data, their findings as to the number of groups present in a data set cannot be assumed, without further investigation, to reflect the underlying distribution of the parameters characterizing the individual subjects of the study. Indeed, critics of the finite mixture modeling approach have observed that if there are, in fact, no discrete groups, analyses carried out on the assumption that they exist could be misleading [97]. Our analyses demonstrate the perspicuity of this observation, and motivate the exploration of alternatives to the group-based procedures.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK