The Self-Consistency Analysis of Surveillance (SCANS) model of prostate cancer natural history is a statistical model built by analytic mathematics rather than computer simulation algorithms. It consists of a set of integrated submodels that fold into a joint model of cancer incidence, disease presentation (stage and grade) at diagnosis, disease progression in screen-detected persons incorporating early detection and treatment benefit, and survival and cancer mortality. The system of submodels is designed so they can be fit sequentially to cancer incidence, survival, and mortality registry data from the US (Surveillance, Epidemiology, and End Results [SEER] program) and Europe (European Union registry [EUREG]), and data from the screening trials (Prostate, Lung, Colorectal, and Ovarian [PLCO] cancer screening trial in the US and European Randomized Study of Screening for Prostate Cancer [ERSPC] in Europe). The model consists of the following components.
To model incidence, we use a two-stage model for the screening schedule (a point process ). The first stage is defined by the hazard of the first prostate-specific antigen (PSA) test for a man at a given age and calendar time. Another intensity of testing is defined for men who already had their first PSA test. Both intensities of PSA testing are specified using data from a retrospective analysis of PSA testing1. Cancer diagnosis is defined as a result of two competing risks, clinical diagnosis (CDx) and diagnosis due to screening (SDx), whichever comes first. The risks are dependent based on a common natural history of the disease, with either risk zero until the onset of a detectable tumor. Estimation is based on parametric maximum likelihood. Contributions of population data to the likelihood represent an average over the unobserved screening schedule and natural history processes.2 Once the stochastic process mixed model is fit, predictions for lead time, overdiagnosis, age of tumor onset, and other characteristics in the patient and the population are predicted using Bayesian conditional probabilities.
To model disease presentation (stage, grade) at diagnosis, we represent disease stage and grade as a categorical mark () to the incident cancer. We use the mixed multinomial model to specify the distribution of stage and grade at diagnosis, where the mixing variables represent the key unobserved features of the disease natural history prior to diagnosis (e.g., age at onset), predicted as conditional distributions, given information observed on the patient (e.g., age and year of diagnosis). Stage and grade are modeled using a mixed multinomial logit model. The model is estimated by maximum likelihood. We developed a special method of artificial mixtures and the quasi-EM algorithm to deal with the curse of dimensionality in complex models3. Applications to the multinomial model and the stage- and grade-specific incidence model are given in a series of papers4-6. Conditional on age, year, stage, grade, and other patient characteristics at diagnosis (e.g., race), we generate model-based predictions of the unobserved characteristics of the disease latent natural history prior to onset and counterfactually after diagnosis (e.g., point of CDx and lead time in the absence of treatment).
To model disease progression after screen diagnosis, let Z(ξ) be the cancer progression process with the time (i.e., age of tumor) ξ measured from the point of cancer onset in the subject. Given the two potential competing risks of clinical (CDx) and screening (SDx) diagnosis, we can define the corresponding potential values of the cancer development process Z(ξSDx) and Z(ξCDx) measured on the same subject. The competing character of the two detection mechanisms makes them partially unobserved. Since ξSDx is undefined for an unscreened subject, we would not be able to treat ξSDx as missing data in a likelihood-based approach. Let the indicator ISDx be 1 for screening and 0 for clinical diagnosis. Let the vector V=(a,z) be the disease presentation at diagnosis combining age and stage/grade z at the point of diagnosis. The disease progression model defines the probability of disease progression during the lead time in the absence of treatment represented by the transition model [V0 | V1]. For a screen-detected subject, let fV (V0 |V1,x) be the joint pdf of the disease presentation at counterfactual CDx (with characteristics V0), conditional on the observed presentation V1 at SDx and the birth cohort x. The transition probabilities between the two points of diagnosis (SDx, CDx) are modeled as functions of the lead time ξL, pb (z0│z1,ξL ), summarized as a progression probability matrix (PPM). Under the null hypothesis of no screening benefit, the baseline PPM probabilities pb are not affected by treatment applied at the point of SDx. Two model predictions for cancer incidence are considered: (1) λI (a,z|^¬ S) under no screening and (2) λI (a,z|IS) under ignored screening (no screening benefit) when the patient is left undiagnosed until symptoms. The first scenario does not involve the PPM, while the second scenario uses PPM. Making the two counterfactual incidence predictions as close as possible (we use a Poisson-type distance measure) by fitting PPM serves as an estimating procedure. When the model was fit to SEER data, we found that only about 5% of patients would progress in stage/grade during the lead time in the absence of treatment.
To model screening benefit, survival, and mortality, we use a cumulative logit (i.e., proportional odds) regression model to describe reduced rates of progression in stage/grade during the lead time as a result of treatment at screen diagnosis. A two-stage logistic-multinomial model was used to model treatment assignments at diagnosis. Another treatment effect is introduced in the Cox model describing post-lead-time survival, conditional on stage and grade at CDx. The model is fit by maximum likelihood to SEER survival and mortality data. The model is mixed over the partially unobserved lead time and stage and grade.
References
- Mariotto AB, Etzioni R, Krapcho M, Feuer EJ. Reconstructing PSA testing patterns between black and white men in the US from Medicare claims and the National Health Interview Survey. Cancer. 2007;109(9):1877-1886. [Abstract]
- Tsodikov A, Szabo A, Wegelin J. A population model of prostate cancer incidence. Statistics in medicine. 2006;25(16):2846-2866. [Abstract]
- Tsodikov A, Liu L, Tseng C. Likelihood Transformations and Artificial Mixtures. Institute of Mathematical Statistics Festschrift in honor of A. Yakovlev: IMS Collections; 2014.
- Chefo S, Tsodikov A. Stage-specific cancer incidence: An artificially mixed multinomial logit model. Statistics in Medicine 2009;28(15):2054-2076. [Abstract]
- Tsodikov A, Chefo S. Generalized Self-Consistency: Multinomial logit model and Poisson likelihood. Journal of statistical planning and inference. 2008;138(8):23802397. [Abstract]
- Wang S, Tsodikov A. A Self-consistency Approach to Multinomial Logit Model with Random Effects. Journal of statistical planning and inference. 2010;140(7):1939-1947. [Abstract]