Efficient use of the large data sets generated by gene expression microarray experiments requires computerized data analysis approaches (1, 2). In this chapter we briefly describe and illustrate two broad families of commonly used data analysis methods: class discovery and class prediction methods. Class discovery, also referred to as clustering or unsupervised learning, has the goal of partitioning a set of objects (either the genes or the samples) into groups that are relatively similar, in the sense that objects in the same group are more alike than objects in different groups (3, 4). A typical application is to generate hypotheses about novel disease subtypes (5, 6). Class prediction, also referred to as classification or supervised learning, has the goal of determining whether an object (usually a sample, but sometimes a gene) belongs to a certain class (7, 8). A typical application is classification of patients into existing disease subtypes or prognostic classes (9, 10) using gene expression information.

In our discussion, 'sample' refers generically to any type of biological material that is processed and hybridized to a chip. For example, in a study of breast cancers, the samples could represent RNA isolated from breast cancer tissues biopsied from a group of women. 'Gene' is used loosely to refer to the features on the arrays, such as sequences from genes or ESTs, single oligonucleotides in Agilent arrays, oligonucleotide sets in Affymetrix arrays and so forth. 'Object' refers to the entity being clustered, and can be either a gene or a sample, as the same algorithms can often be applied symmetrically to both. 'Attribute' is any feature of the object being clustered. If we cluster samples, genes are typically attributes, and vice versa. 'Phenotype' refers to any clinical or biological characteristic of a sample or the person or organism from which the sample is derived, such as disease subtype, age, gender, or time to disease progression.

To demonstrate the clustering methods in this chapter, we use a gene expression microarray dataset published by Hedenfalk and colleagues (11) and including samples from 22 breast cancers, of which seven are from

patients with known BRCA 1 mutations, eight from patients with known BRCA2 mutations, and seven are sporadic. Complementary DNA (cDNA) labeled with Cy3 or Cy5 was obtained from each tumor sample and hybridized to two channel cDNA arrays which included spots for 3226 genes and ESTs. The reference sample was cell line MCF-10, a nontumori-genic breast cell line. Data from this study is available at http://www.nhgn.

Statistical computing environments typically offer a rich set of alternatives for clustering and classification. In particular the free and open source computing environment R (12) and the associated Bioconductor (13) project cover most standard tools, a wide variety of developmental tools and offer the flexibility for implementing custom solutions. A range of free and open source tools can be accessed via the website The site maintains a catalog of both free and commercial microarray data analysis software.

10 Ways To Fight Off Cancer

10 Ways To Fight Off Cancer

Learning About 10 Ways Fight Off Cancer Can Have Amazing Benefits For Your Life The Best Tips On How To Keep This Killer At Bay Discovering that you or a loved one has cancer can be utterly terrifying. All the same, once you comprehend the causes of cancer and learn how to reverse those causes, you or your loved one may have more than a fighting chance of beating out cancer.

Get My Free Ebook

Post a comment