Article Published in the Author Account of

Jimmy Lin

Genome-wide Mutational Analyses of Breast and Colorectal Cancers

Abstract: With the human genome sequence at hand, it is now possible to sequence coding regions of cancer cell genomes to identify the mutated genes that drive tumor formation. The clinical importance of breast and colorectal cancer, together causing 14% of yearly cancer deaths, make these two tumor types suitable initial candidates for cancer genome sequencing. We recently surveyed more than half of the known human genes for somatic mutations in eleven breast and eleven colorectal cancers, and defined 122 and 69 genes, respectively, as candidate cancer genes in these two diseases. The study design provides a blueprint for future cancer genome sequencing efforts, validated by its ability to detect known and novel cancer genes. The findings shed light on heterogeneity between and within tumor types and provide novel research avenues for cancer biology.


Cancer is a disease of damaged genes. The current multistage model of tumor development postulates a sequential accumulation of mutations in genes controlling cell growth and cell death, all occurring in one pluripotent stem cell or its clonal offspring. Acquisition of new mutations leads to waves of clonal expansion where the fittest tumor cell clone will expand at the expense of other clones. The accumulated mutations either cause loss of function of tumor suppressor genes or gain of function of proto-oncogenes. Some of these mutations are inherited, but the vast majority are somatic, i.e., they occur only in the cancer cell and not in the germline. Mutations in different genes can cause the same phenotype when the genes converge in a common pathway. Examples of such gene pairs and processes frequently affected by mutation during the evolution from normal cell to full-blown cancer are APC/CTNNB in the Wnt signalling pathway and TP53/MDM2 in the DNA damage response (Vogelstein and Kinzler, 2004). Some frequently mutated cancer genes can be used for tumor detection in screening tests or for targeting with anticancer therapeutics (Diehl et al., 2005; Druker et al., 2001).

Three decades of cancer research have shown that frequent somatic mutations of an individual gene or of members of a cellular pathway are strong predictors of the importance of that gene or pathway in the neoplastic process. However, the spectrum of mutated tumor suppressor genes and proto-oncogenes varies between tumor types. Nine of ten colorectal cancers have inactivating mutations in the tumor suppressor APC, whereas mutations in APC are rare in breast cancers. More than 25% of late stage cases of both tumor types harbor activating mutations in the PIK3CA oncogene, and more than 50% have inactivating mutations in the tumor suppressor TP53. Twenty-five percent of breast cancers have ERBB2 amplifications, which do not occur in colorectal cancers. Because some cancer genes are tumor type specific, it will be essential to analyze the entire compendium of genes in the tumor type of interest in order to comprehensively identify its oncogenes and tumor suppressor genes.

Studies of familial cancer, cytogenetics, and other candidate gene approaches have identified several mutated oncogenes and tumor suppressor genes. However, these approaches are biased and not comprehensive and have therefore given way to approaches wherein entire gene families or entire cancer genomes are analyzed for the presence of somatic mutations (Bardelli and Velculescu, 2005). Two main approaches to cancer genome sequencing are currently envisioned. The first prioritizes sensitivity by sequencing all genes, or a subset of genes, in a large set of tumors (~100) of a given type, and can thereby detect 99% of the sequenced genes mutated in 5% or more of tumor cases. However, this approach requires large amounts of tumor and matched normal DNA and will be costly in terms of time and labor. If a subset of genes is chosen for study, there is a risk of bias towards selecting certain genes at the expense of others. The second approach relies on a discovery screen with a small number of tumors of a given type (~10) to identify genes with mutations, followed by validation of these mutated genes in an independent set of tumors of the same type. This approach has enough power to identify 99% of the genes mutated in 30% or more of tumors, 90% of the genes mutated in 20% or more of tumors, and 50% of the genes mutated in 6% of tumors. This study design will thus enable rapid identification of the frequently mutated genes at the expense of some rarely mutated genes, and can be accomplished with less cost, labor, and sample consumption. As discovery of frequently mutated genes is more likely to impact the development of novel diagnostic and therapeutic modalities, we implemented the latter approach.

Mutational analysis of a gene requires isolation of DNA from tumor cells, gene amplification by polymerase chain reaction using specific primers, sequence determination of the alleles present in the tumor, and mutation detection by comparison of tumor-derived sequences to sequences observed in normal cells from the same patient (Figure 1). Cell lines or mouse xenografts established from late stage primary tumors or metastases are suitable for mutational analysis because they provide an ample source of pure genomic DNA and are expected to contain many mutated genes as a result of several rounds of clonal selection. In addition, matched germline DNA from normal tissue must be available for each case. If a particular gene is important in the neoplastic process, one would expect it to be mutated in the majority of cancer cells within an individual lesion. Conventional dideoxy terminator sequencing of coding exons amplified from genomic DNA by polymerase chain reaction will depict the genome of the average tumor cell, with a detection limit of one mutant allele in a background of two wild-type alleles.

Although the human genome sequence is largely known, the number and precise location of all human genes is still subject to debate. A current estimate of the total number of protein coding genes supported by experimental evidence is 23,200 (Birney et al., 2006). Transcripts from ~13,000 of these genes comprise the Consensus Coding Sequences (CCDS), which are characterized by unambiguous annotation, canonical start and stop codons, consensus splice sites, inter-species conservation evidence, supporting transcripts, and protein homology. This set of high quality transcripts, encompassing 56% of the estimated genes, was chosen as the starting point for our recently published cancer genome sequencing effort (Sjöblom et al., 2006).

The genome-wide mutational analysis was implemented as a two-stage screen to identify genes with recurring mutations, followed by a filter to eliminate genes more likely to be mutated by chance alone and rank ordered by relative mutation frequency (Figure 2). Using 135,483 primer pairs, we successfully amplified and sequenced 19 Mb of gene sequence per tumor, corresponding to 90% of the bases in CCDS. After comparing tumor derived sequences to the reference genome sequence, we observed 816,986 putative mutations by automated mutational analysis. The biological impact of synonymous mutations was considered negligible and putative synonymous changes were therefore discarded without further curation. The remaining putative mutations comprised technical artifacts, known and novel germline polymorphisms, or true somatic mutations. After automated removal of known polymorphisms and visual inspection to identify sequencing artifacts, 29,281 putative mutations remained. Upon re-sequencing, two-thirds of these alterations were confirmed. When compared to sequences from matched normal DNA, 93% of the confirmed changes were also present in the germline, and thus represented novel germline variants. The remaining 1,307 mutations in 1,149 genes were somatic, i.e., not present in the germline.

To reduce the impact of background (also called “passenger” mutations), we implemented a validation screen where the mutated genes were sequenced in their entirety in 24 independent tumors of the same tissue type. These samples were drawn from a larger set of late stage colorectal cancer xenografts and micro-dissected primary breast cancers, and mutational analysis was performed as described above. The validation screen yielded 365 additional mutations in 236 genes, and these were considered “validated” genes.

The identification of cancer genes requires statistical methods to distinguish genes that may be driving the tumorigenic process from passenger genes subject to background mutations during the decades of growth and thousands of cell divisions through which advanced cancer cells have passed. Which of the validated genes had more mutations than would be expected from the background mutation rate alone? To address this question, we implemented a context-specific binomial statistical model. For each validated mutated gene, this model took into consideration the number of base pairs sequenced, the nucleotide composition, the number of mutations observed, and the sequence context of the mutations observed in each gene. If more base pairs were sequenced, there would be a greater number of passenger mutations expected. A higher mutation rate in 5′-TpC-3′ and 5′-CpG-3′ dinucleotide contexts was observed in breast and colorectal cancers, respectively, leading one to expect more background mutations in genes rich in these specific nucleotide contexts. The observed number of mutations was compared to the expected mutations with a binomial distribution for each of the nucleotide contexts. With these considerations, we calculated a cancer mutation prevalence (CaMP) score for each validated mutated gene, which reflects the probability that the number of observed mutations in the gene results from a higher mutation frequency than that expected from the background mutation rate alone. This score was normalized so that scores >1.0 translated to a mutation frequency that was likely to be greater than background, and genes fulfilling this criterion were termed candidate cancer (CAN) genes. Based on an approximation of the false discovery rate defined by Hochberg and Benjamini (1990) as well as simulations set up to mimic the type of two-stage study we performed, we predicted that no more than 10% of the CAN-genes were false positives. In addition, the CaMP score provided a rank order of the possible biological importance of the CAN-genes and it would be reasonable to prioritize future studies on the basis of a gene’s rank. Thus, the CaMP score accomplishes the prescribed goal of the statistical model.

The validity and relevance of the CAN-gene set was confirmed by the re-discovery of APC, TP53, KRAS, SMAD4, and FBXW7/CDC4, which constitute the known colorectal cancer genes in CCDS. Additionally colorectal cancer genes such as NF1, SMAD2/3, and TGFBRII, previously known to be mutated in 10% of cases or less, were CAN-genes. By inference, the other colorectal CAN-genes were likely to play important roles in tumor development. However, some infrequently mutated cancer genes were not included in the CAN-gene set. For example, BRAF is known to be mutated in 7% of mismatch repair proficient colorectal cancers. An oncogenic mutation was detected in the discovery screen but no BRAF mutations were found in the validation screen. Therefore, some mutated genes outside the CAN-gene set will be true cancer genes, but their mutation frequency is too low to be detected by our approach.

Because all genes were sequenced in all samples of the discovery screen, this dataset can be used to estimate the number of mutated genes present in each tumor. The average tumor in the discovery screen had non-synonymous mutations in 60 genes, 11 of them being CAN-genes. Assuming 23,200 human protein coding genes, one would then expect to find ~100 total mutated genes per tumor, and ~20 mutated CAN-genes per tumor. Only two CCDS genes were CAN-genes in both colorectal and breast cancers. A maximum of six mutated CAN-genes were shared between any two tumors of the same histotype. These data imply that at least twice as many genes are mutated during tumorigenesis than predicted from past studies, and that the set of mutated genes differs significantly even between two tumors of the same histotype. The multiplicity of mutation targets will constitute a challenge for diagnostic and therapeutic development.

Several CAN-genes were components of the same pathway, as was the case with TGFBRII and SMADs, thus reducing the apparent complexity. Classification of the CAN-genes by literature mining and analysis of gene ontology databases revealed that two-thirds of them belonged to three major groups: cell adhesion proteins, signal transducers, and transcription factors (Figure 3).

Please do not enter the figure caption or legend here. After this figure is inserted into the article composition window, click on this message twice and delete it if you don't have a figure legend or replace it with the figure legend. You may enter the legend by typing or use the Paste-from-Word button to copy and paste your already written legend.

Two-thirds of tumors had a mutation in a cell adhesion gene, and 40% of colorectal cancers had mutations in a metalloprotease-encoding gene. Genes involved in transcriptional regulation constituted 20% of the CAN-genes in both tumor types. Nearly half of breast cancer specimens had mutations in a zinc-finger transcription factor. Other groups included CAN-genes involved in transport, metabolism, intracellular trafficking, and RNA metabolism. Seven of the CAN-genes had no known function, and two-thirds had no previous connection to cancer.

Previous gene family oriented mutational analyses in tumors of the colon, breast, and brain were based on the notion that the chosen gene families were more likely to contain proto-oncogenes or tumor suppressor genes than the rest of the genome. Though this may be true, the CAN-gene set contains many genes not previously known or predicted to be involved in cancer, supporting the value of unbiased sequencing approaches. More powerful sequencing technologies will soon become available. These will enable sequencing of whole cancer genomes including intergenic regions. They may also allow detection of alterations present in minor clones of the tumor cell populations. Such studies may uncover interesting mutations but their interpretation will be significantly more difficult than those in protein-encoding genes that are present in every neoplastic cell of the tumor (i.e., clonal).

Our study as well as other genome-wide studies currently underway should be looked upon as a foundation for future studies of cancer genes. In our study, 189 genes were shown to be somatically mutated more frequently than expected by chance alone whereas 12,836 genes were not. This core set of genes provides a starting point for multiple avenues of future research. It is important to realize that they are candidates, not bona fide cancer genes. As a first step, they should be subjected to mutational analyses in a large set of tumors of different stages to confirm their cancer gene status and reveal their role in tumor progression. These studies should be complemented with analyses of CAN-gene expression and methylation status in normal and tumor tissues. Second, the CAN-genes should be subjected to functional studies. Because some breast and colorectal candidate cancer genes are likely to be mutated in other malignancies, it could be worthwhile to perform mutational analyses of the CAN-gene set in other tumor types. All such studies can prioritize genes by their CaMP scores, starting with those with the highest CaMP score and working down. Further genetic, epigenetic, functional, and translational studies of these genes will hopefully lead to new insights into tumorigenesis and improved diagnostic and therapeutic approaches.

References and Further Readings

Bardelli A, Velculescu VE. Mutational analysis of gene families in human cancer. Current Opinion in Genetics and Development 15(1):5-12, 2005.

Birney E, Andrews D, Caccamo M, Chen Y, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, et al. Ensembl 2006. Nucleic Acids Research 34(Database Issue):D556-D561, 2006.

Diehl F, Li M, Dressman D, He Y, Shen D, Szabo S, Diaz LA Jr, Goodman SN, David KA, Juhl H, Kinzler KW, Vogelstein B. Detection and quantification of mutations in the plasma of patients with colorectal tumors. Proceedings of the National Academy of Sciences USA 102(45):16368-16373, 2005.

Druker BJ, Talpaz M, Resta DJ, Peng B, Buchdunger E, Ford JM, Lydon NB, Kantarjian H, Capdeville R, Ohno-Jones S, Sawyers CL. Efficacy and safety of a specific inhibitor of the BCR-ABL tyrosine kinase in chronic myeloid leukemia. New England Journal of Medicine 344(14):1031-1037, 2001.

Hochberg Y, Benjamini Y. More powerful procedures for multiple significance testing. Statistics in Medicine 9(7):811-818, 1990.

Sjöblom T, Jones S, Wood LD, Parsons DW, Lin J, Barber TD, Mandelker D, Leary RJ, Ptak J, Silliman N, et al. The consensus coding sequences of human breast and colorectal cancers. Science 314(5797):268-274, 2006.

Vogelstein B, Kinzler KW. Cancer genes and the pathways they control. Nature Medicine 10(8):789-979, 2004.

[Discovery Medicine, 7(37):13-19, 2007]



Close
Close
E-mail It