Thilly Laboratory | Biological Engineering Division

U.S. Mortality Data
(1900-2019, 111 forms of mortality, 95 forms of cancer)

Japanese Cancer Mortality Data
(1952-1995, 54 forms of cancer)

Download all Mortality Data
Microsoft Excel

files. (66 MB)

MIT Database Administrator:

Lohith Kini(lkini86@gmail.com), Prachi Kharade(pkharade14@gmail.com)

This MIT web page has been assembled and is maintained by present and former students, researchers and faculty of the Department of Biological Engineering to provide an interface between the population experience of common mortal diseases in the United States and Japan and quantitative cascade models based on biological and clinical information about these diseases. (See list of contributors below.) It contains two major elements:

The mortality and population data of the United States (1900 to 2019) and Japan (1952 to 1995). U.S. data have been updated annually since 1998 as available from the National Center for Health Statistics and the U.S. Census Bureau. The data were originally collected and organized by the U.S. Census Bureau (1900-1935) and then the U.S. Public Health Service (1936-) as annual records of most forms of cancer and other major forms of mortality and published in Vital Statistics of the United States up to 1998. Each annual record has been organized by gender, ethnic group (European Americans, Non-European Americans (primarily African Americans), Japanese) and age of death (0,1,2,3,4,5-9,…,100-104 years). These data were matched with population in reporting counties back for 1900-1930 before all U.S. counties reported annually by one of us (P. H.-J).
A series of computer programs designated "CancerFit" have been written to test quantitative hypotheses about carcinogenesis in humans. The present version, CancerFit v5.0, discussed and applied in Kini et al., 2013, in prep, is based on the two-stage initiation/promotion model of Armitage and Doll (1957) with "n" required events for tumor initiation and "m" required events for promotion but amended to permit:
- (a.) limitation of initiation to the fetal/juvenile.
- (b.) risk stratification represented grossly as a fraction of the population at risk, F, and not as risk, (1-F) (Herrero-Jimenez et al., 1998, 2000)
- (c.) competing synchronous forms of mortality with shared risks represented by the fraction, f, the fraction of deaths in an age interval among the forms of shared risk accounted by the specific form of death studied.
In development is v5.1 that permits stratification of multiple biological parameters: the number of stem cells at maturity, juvenile growth rates, rates of initiation and promotion events, and preneoplastic growth rates to reflect observations made in human populations.

Mortality Data: Getting Started

One may select either “View U.S. Mortality Data” or “View Japanese Mortality Data” by clicking on the appropriate icon.

Clicking on "View U.S. Mortality Data" opens a list of forms of mortality as recorded in Vital Statistics of the United States beginning with "All Causes" and ending with "Senility". This site contains most but not all data regarding more common forms of cancer and other forms of mortality. Researchers interesting of organizing data for any unlisted disease(s) should contact Prof. W.G.Thilly, thilly@mit.edu, for advice and/or assistance. We would be pleased to include links to historical databases for other countries.

Clicking on a particular group of diseases such as “Digestive organs and Peritoneum (76-89)” under Malignant Neoplasms displays a more specific list of cancers.

Numbers shown in brackets are the International Statistical Classification of Diseases and Related Health Problems (UCR358 and ICD-10) Codes used to categorize diagnoses recorded as the cause of death. In some cases we have combined data from several cancer sites in order to obtain a more complete historical record. For example, Colon Cancer (83) was recorded only since 1958 but if combined with Anal Cancer and Small Intestine, they yield a historical intersection with Lower Gastro-intestinal Tract Cancer with records continuous from 1900-2006. Occasionally, the printed record contained obvious typographic errors or nonsensical data. In these cases interpolations were used to fill in missing values and such interpolations are clearly printed in red in the primary record of number of deaths on the Excel file sheets "MOR(t)".

Clicking next on a specific cancer site such as “Lower GI Tract” opens a page of summary data recorded from 1900-2019 organized by gender and ethnic groups (EA, European-Americans and NEA, Non-European Americans, predominantly African-Americans) and secondarily with regard to (a.) age of death (displayed chart) (b.) calendar year of birth and (c.) calendar year of death. Charts for (b.) and (c.) are opened by clicking the desired gender and ethnic group for each category.

Shown are summary charts in which the log₁₀ age-specific mortality rates (annual deaths) on the y-axis are shown as a function of age of death on the x-axis. Each birth decade cohort's age-specific mortality rate is depicted by joined symbols so that the form and historical changes in age-specific lifetime mortality rates for this form of death may be observed in a single chart.

Alternately, one may choose to observe the mortality rates of individual birth decade cohorts displayed over calendar years or as specific age-specific death rates, e.g. 50—54 yrs, displayed over the entire period of recording.

Finally, the complete record for any disease may be downloaded to inspect the raw annual data as recorded by the U.S. Census Bureau or U.S. Public Health Service along with several additional ways to view the data.

If desired all data on this website may be downloaded by clicking the icon Download all Mortality Data that comprises ~66 Mb as Excel(TM) files.

CancerFit

Clicking “CancerFit” below opens a page containing four links.

Link 1: Incidence as a function of birth year cohort and age, INC(h,t)

The first link, when clicked, shows the basic assumptions and equations used in a cascade model including but not limited to the assumptions of CancerFit v.5.0.

THIS LINK MUST BE STUDIED AND UNDERSTOOD BEFORE ANY FURTHER STEPS COULD BE USEFUL.

Model of Cancer:

INC(h,t) is the set of age-specific mortality rates for death year intervals (t = 15-19, 20-24,…,100-104) of a particular population cohort defined by gender, ethnic group, and birth decade, h, e.g. EAM, 1890-99 (European-American males born 1890-99) corrected for (a.) coincident forms of death within each year and (b.) survival due to medical intervention.

CAL(h,t) is the set of age-specific incidence rates predicted by the model as the "best-fit" to the data supplied as INC(h,t) of the model to INC(h,t). In the calculation of CAL(h,t), wide ranges of values for initiation, R_i,j,…,n and promotion R_A,B,…,m event rates, preneoplastic colony growth rates, the fraction, "F", of persons at risk of the particular disease for any combination of required inherited or environmental risks and a function, "f" that represents the fraction of a group with synchronously mortal form(s) of disease with shared risks with the disease studied accounted by deaths by that disease.

These data are compared by CancerFit v.5.0 to a cascade model that assumes (a.) 'n' initiation mutations are required in an organogenic stem cell during the fetal juvenile period to create a first preneoplastic stem cell and (b.) 'm' promotion mutations are required in an initiated preneoplastic stem cell to create a first neoplastic (tumor) stem cell. Goodness of fit (GOF(h,t)) is goodness of fit of the function generated from comparison of INC(h,t) to CAL(h,t). GOF(h,t), is calculated as the sum of [log(INC(h,t))-log(CAL(h,t))]² divided by the number of age-of-death intervals employed in the comparison.

Link 2: Source Code for CancerFit v5.0

The second link, when clicked, will download the entire source code of CancerFit, written for MATLAB v7.6 or higher. The download file (CancerFit v5.0, approximately 66 MB) is a zipped filed containing MATLAB source code along with all the mortality data from this M.I.T. repository. An interested user who downloads the zip file has to first unzip the file, titled CancerFitv5_0.zip. If you are using a Mac OS X, the zip file will show up in your Downloads list and will be automatically unzipped and available in the location where your downloaded items are sent. The unzipped folder will reveal a list of folders: “Mortality Files”, “src”, “util” along with the following files: “CancerFit.fig” and “CancerFit.m”. The model equations are implemented in the files listed under the “src” folder and the interface itself is programmed in the files labeled “CancerFit.fig” and “CancerFit.m”. The folder “Mortality Files” consists of all the mortality and population data of all ~111 diseases available on this website as Excel(TM) and text files, both of which can be directly accessed for analysis by the CancerFit program.

Link 3: CancerFit Tutorial

The third link is a tutorial describing the steps a CancerFit user needs to take in order to analyze a particular age-specific lifetime mortality function here using cancer of the lower GI tract in European American Males born 1890-1899 as an example.

(a.) Download mortality data from this website, if not already downloaded,
(b.) Open CancerFit in MATLAB,
(c.) Run CancerFit on the example: Cancer of the Lower GI Tract, EAM, birth cohort interval 1890-99, for a wide range of iterations for all parameters of the model. Note: If you download CancerFit as described above containing the folder “Mortality Files”, you do not need to re-download the files as described in the tutorial.

Link 4: Example of CancerFit v5.0 applied to Age-Specific Incidence of Lower GI Tract Cancer in European Americal Males born 1890-99.

The fourth and final link opens a page containing example results obtained on the Cancer of the Lower GI Tract, EAM, birth interval 1890-99 using estimated post-diagnosis five-year survival rates (See Herrero-Jimenez et al., 1998, 2000) to define INC(h,t). The program CancerFit v.5.0 was run iteratively for all twenty-five pairs of different numbers of initiation events (n = 1,2,3,4,5) and promotion events (m = 1,2,3,4,5).

First, the best fits of CAL(h=1890-99, 15< t <104) were calculated for the twenty-five combinations of n = 1-5 and m = 1-5 under the parsimonious conditions of homogeneous risk, F=1, and no synchronous mortal diseases sharing risk factors with colorectal cancer, f = 1. Values of (Pi_i R_i)^1/n and (Pi_A R_A)^1/m were permitted to range from 10^-9 to 10⁰ and the range of mu was set at 0.1 to 0.3.

Second, the best fits of CAL(h,t) to INC(h,t) were assessed under the additional assumption of inhomogeneous risk, i.e., the parameter “F” representing a hypothetical fraction of the population at risk was allowed to range from 0 to 1.

Thirdly, we considered the possibility of both population inhomogeneity, F < 1, and a competing synchronous mortal disease having genetic and/or environmental risks shared with colorectal cancer, i.e., the parameter “f” representing this possibility was allowed to range from 0 to 1. This assumption did not, however, further reduce the values of GOF(h,t).

A figure at the bottom of these sample results depicts the degree of concordance of the two trial conditions given n=2 and m=1: F = 1, f = 1 (population homogeneity, no synchronous competing risk) and F < 1, f =1 (population inhomogeneity, no synchronous competing risk)) with adult lifetime incidence data for lower G.I. tract cancer in European American males born 1890-99 INC(h,t).

Click here: CancerFit

CancerFit v5.0.0 : Requires MATLAB 7.6 or higher to run.

CASTAT

The "cohort allelic sums test" or "CAST" is provided as the following excel program, CASTAT(c). This test is described in Thilly & Morgenthaler (2007) Mutation Research paper, "A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST)."

Click here: CASTAT (c)

People

Dr. Pablo Herrero-Jimenez, Ph.D. (Manually entered all mortality and population data for the U.S. 1900-1991, improved and tested two-stage cancer models and wrote the first computer for his MIT Ph.D. Thesis in Toxicology and Epidemiology submitted in 2001, titled “Determination of the historical changes in primary and secondary risk factors for cancer using U.S. public health records”).
Prof. William G. Thilly, Sc.D. (Recognized the possibility that mutations occurring during human growth and development could account in large part for the age-specific cancer rate (1987) and with Prof. Morgenthaler devised mathematical models to explore a series of biological possibilities. Developed technology to measure mutations in human tissues and discovered that mutations in adult human lungs are distributed in cluster sizes consistent with origins limited to the fetal/juvenile period. With Dr. Gostjeva explored metakaryotic stem cells in which initiation mutations are hypothesized to occur.)
Prof. Stephan Morgenthaler, Ph.D. (Chair of the Department of Statistics and Applied Mathematics, Ecole Polytechnique Federale, Lausanne, Switzerland has worked on modeling complex biological processes, often with Prof. Thilly, since his days as Instructor of Mathematics at MIT in the early 1980s. He wrote the original program, now used in CancerFit, in Fortran in 2002.)
Jose’ Angel Márquez Jr. (Manually entered all cancer and population data for Japan, 1952-1995, and compared historical shifts in cancer mortality rates between the U.S. and Japan. M.S. Thesis in Toxicology from MIT submitted in 1999, titled “A Comparative Analysis of age-dependent and birth year cohort-specific cancer mortality data between Japan and the United States”.)
Dr. Efren Gutierrez, Jr., M.D. (Entered all mortality and population data for the U.S. 1992-1997 and tested two-stage models for his MIT M.S. Thesis submitted in 2003, titled “The Analysis of Esophageal Cancer using two different epidemiological models”).
David Hensle (Transported Fortran program written by Prof. Morgenthaler into Java and designed the original user-friendly interface employed in CancerFit today. M.S. Thesis in EECS from MIT submitted in 2003, titled “Computation of Population and Physiological Risk Parameters from Cancer Data”.)
John Kogle (Improved user interface of CancerFit and used CancerFit v.2.0 to explore hypotheses about carcinogenesis. Discovered generally higher rate of cancer deaths in premenopausal females relative to males in organs such as the colon. M.S. Thesis in EECS from MIT submitted in 2004, titled “Multi-parametric Numerical Simulation of Age-Specific Cancer Rates in Human Populations”.)
Dr. Elena V. Gostjeva (Former Soviet Crimean scientist who first discovered the “metakaryotic” cells at MIT (2003) that appear to comprise the stem cell lineages of organogenesis, carcinogenesis, atherogenesis, post-surgical restenosis and wound healing in humans. Provided biological embodiment of hypothetical mutator/hypermutable stem cells in human development. Head of Genetic Risk Assessment Group, Chernobyl Expedition, U.S.S.R. and then Ukraine.)
Lohith Kini (Devised and tested computer model, CancerFit v5.0, for the fetal/juvenile initiation hypothesis 2004-10 as an undergraduate and M.Eng. candidate in Electrical Engineering and Computer Science at MIT.)
Jayodita Sanghvi (Brought a new understanding as to when and how fast US cancer risks changed historically, especially in the late 19th century as an undergraduate student at MIT, B.S, ‘07. Elucidated sudden historical increases in risk for pancreatic cancer by analyses of historical mortality data for the United States and Japan.)
Tushar Kamath (Updated the CancerFit model to incorporate stratification of initiation and promotion mutation events as a high school student. Currently a B.S. candidate at MIT.)
Ray Kurzweil (Pioneering analyst of computational and biological possibilities, joined in the studies of metakaryotic cells in carcinogenesis and atherogenesis with Drs. Gostjeva and Thilly in 2007. President Kurzweiltech, Inc. Wellesley, MA. Member MIT Corporation.)
Rebecca Kusko (Using ASCII files from the National Center for Health Statistics, updated mortality and population data for the U.S. 1998-2006 and tested interface with CancerFit programs. (2009-10))
Dr. Karl Rexer, Ph.D. (President of Rexer Analytics, Inc. Winchester, MA, donated his time and created a program to transform ASCII files to the Excel files of this data base (2009-10.)
Prachi Kharade(As an MIT UROP intern(2021), automated the data upload and mortality files generation process and tested with CancerFit programs, and also extended data to 2019.Developed mortality visualization dashbaord to analyze multiple causes of death)

Publications

Thilly, WG. "Have environmental mutagens caused oncomutations in people?". Nature Genetics 34, 255 - 259 (2003).
Thilly, WG. "Looking ahead: algebraic thinking about genetics, cell kinetics and cancer." IARC Sci Publ. 1988;(89):486-92.
Kini et al., "Mutator/hypermutable fetal/juvenile metakaryotic stem cells and human colorectal carcinogenesis".publ 2013.
Herrero-Jimenez et al., "Mutation, cell kinetics, and subpopulations at risk for colon cancer in the United States".publ 1998.
Herrero-Jimenez et al., "Population risk and physiological rate parameters for colon cancer. The union of an explicit model for carcinogenesis with the public health records of the United States".publ 2000.