Dynamic Graphics for Data Analysis Essay

1. INTRODUCTION

Dynamic graphical methods have two important properties direct manipulation and instantaneous change. The data analyst takes an action through manual manipulation of an input device and some thing happens, virtually instantaneously, on a com puter graphics screen. Figure 1 shows an example in which a dynamic method is used to turn point labels on and off. The data analyst moves a rectangle over the scatterplot by moving a mouse; the figure shows the rectangle in a sequence of positions. When the rectangle covers a point, its label appears and when the rectangle no longer covers the point, its label disappears.

1.1 The Importance of Dynamic Methods

In the future, dynamic graphical methods will be ubiquitous. There are two reasons One is the addition of dynamic capabilities to the methodology of tradi tional static data display provides an enormous in crease in the power of graphical methods to convey information about data—wholly new methods become possible and many capabilities that are cumbersome and time consuming in a static environment become simple and fast (Tukey and Tukcy, 1985). Huber (1983) aptly describes the importance of the dynamic environment: “We see more when we interact with the picture especially if it react* instantaneously than when we merely watch.” This does not mean that current static methods will be discarded, but rather that there will be a much richer collection of methods. The second reason is that the price and availability of powerful statistical computing environments are rap idly evolving in a direction that will permit the use of dynamic graphics McDonald and Pedersen. 1985 . Thus, it seems likely that the methods described in this paper will be standard methodology in the future. Furthermore, because the number of people that have so far heen involved in research in dynamic methods is relatively small, the development of new dynamic methods should accelerate as the appropriate computing environments liecome more widely available.

1.2 Two Early Systoms

A recognition of the potential of direct manipula tion, real-tiine graphics for data analysis goes back as far as the early 1960s when computer graphics systems by a knob. Points could be deleted by positioning a cursor on them. The system demonstrated that dynamic graphical methods had the potential to be important tools for data analysis. Another early system was PRIM-9 (Fisherkcller, Friedman and Tukey, 1975), a set of dynamic tools for projecting, rotating, isolating, and masking multi dimensional data in up to nine dimensions Rotation was the central operation; this dynamic method allows the data analyst to study three-dimensional data by showing the points rotating on the computer screen. Isolation and masking were features that allowed point deletion in a lasting or in a transient way. PRIM-9 was an influential system; many subsequent systems were modeled after it and during the 1970 dynamic graphics und PRIM operations were nearly synonymous. (In fact, in the rush to implement PRIM systems. Fowlkes’ idea* were nearly forgotten.) As the reader will see from the descriptions to follow and their origins, it was not until the early 1980s that significantly different methods would begin appearing; this was stimulated in large measure by new comput ing techniques coming from computer scientists.

1.3 Contents of the Paper

A variety of dynamic graphical methods are de scribed and illustrated in Section 2 of this paper. Sections 2.1 to 2.6 cover identification, deletion, link ing brushing, scaling, and rotation. Section 2.7 de scribes in a general way what many of the methods are doing—providing dynamic parameter control and thereby opens the door to a large collection of potential methods. Computing issue are discussed in Section 3 of the paper; hardware and software consid erations tend to be much more tightly bound to the success of dynamic methodologies than is the case for static graphical methods. Section 4 of the paper is a brief summary and discussion.

2. METHOOS

2.1 Identification of Labeled Data

Identification has two directions. Suppose we have a collection of elements on a graph (e.g., points) and each element has a name or label. In one direction of identification we select a particular element and then find out what its label is; we will call this labeling. In the other direction, we select a label and then find the location on the graph of the element corresponding to this label. We will call this locating. Identifica- tion (asks, although seemingly mundane, are so all- pervasive that simple ways of performing them arc of enormous help to a data analyst. Labeling Points. Suppose x, and y, for to are measurements of two variables that have labels. Figure 2 shows an example. The data are measure ments of the bruin weight* and body weights of a collection of animal species (‘rile and Quiring, 1940). Biologists study the relationship between these two variables because the ratio (brain weight) (body weight)’3 is a rough measure of intelligence (Gould. 1979; Jenson, 1955). In Figure 2, the data ore graphed on a log scale and the axes are scaled so that у 2x/3 is a 46 line; thue, 45 lines are contours of constant intelligence under this measure. Each point on the graph has a label: the name of the species. In analyzing bivariate data with labels, we almost always want to know the labels for all or some of the points of the scatterplot.

For example, in Figure 2 the point graphed by a filled circle is interesting because it lice on a 45 line that is above the lines for all other points, which means that this species has a higher intelligence mcusure than all others. Which is it With static displays, finding out the label of points on scntterplots is a cumbersome task. It is not possible to routinely show the labels of all points because overplotting frequently causes an uninterpretable mess; this would be the case in Figure 2. With dynamic displays, getting label information is simple because the data analyst can turn labels on and off very rapidly. One method for doing this has been illustrated in Figure 1. It turns out that the label for the inter esting point in Figure 2 is. not surprisingly, “modern man.” Imeating Points. In analyzing labeled point we often want to go in the other direction and locate a point with a certain name. For example, we might have asked where modem man is in Figure 2. This task also can lie accomplished in a particularly effi cient way by using dynamic graphics The data analyst can press a button on a mouse to bring up a menu on the screen that shows the labels, move the mouse to select the label, and then release the button (Becker and Cleveland. 1987); this can result in the point being highlighted, as modem man i highlighted in Figure 2.

Locating Different Data Sets: Alternagraphies In graphing two-variable data we often encounter an other type of identification problem: the quantitative information is partitioned into subsets, and we want to locate the different subsets on the plot. For exam ple, in Figure 2 the data can be categorized into five subsets: primates, nonprimate mammals, fish, birds, and dinosaurs. The subsets are shown in Figure 3. Another way to think of this is that there is a third variable, a categorical one, that wc also want to show. Subset of the quantitative information on a graph can be enormously varied. Here arc just three exam ples: 1) each subset is a set of point, a in Figure 3; 2) each subset ia collection of contour of a third variable as a function of the two axis variable: 3) the first subset is a scatterplot of points and the remaining subsets are regression curves of у on x resulting from several different models. Comparing subsets can be surprisingly tricky when using static graphics. Part of the problem is that we want to be able to do more than just figure out to which subset each element of the graph belongs; we want to perceive each subset a a whole, mentally filtering out the other*. With static graphics, many methods have been suggested (Cleveland. 1985) use different plotting symbols, use different color, or connect points of the same subset by lines.

Often, none of these methods works, because the overlap of elements of the graph makes it impossible to distin guish the different subsets. In such cases the only solution ie use juxtaposed panels as illustrated in Figure 3. The drawback to such a display is that we cannot compare the location of different subsets as effectively as when all of the data are on the samepanel. A dynamic method that is often effective for iden tifying subsets is alternagraphics (Tukey. 1973). At a given moment in time the viewer can identify some of the subsets and the selection of identified subsets can be changed quickly. There are many ways of imple menting this idea. One is to cycle through the subsets showing each for a short time period. More explicitly, nique is to provide the data analyst with the capability of turning any subset on or off with a simple and rapid act ion such as the use of a mouse to point to a subset name on a menu and t hen clicking a button: with such a technique the analyst can rapidly get any panel of Figure 3 (Donoho. Donoho and Gasko, 1985) 2.2 Deletion Another fundamental operation that can be easily carried out by using dynamic graphics is deleting points from a graph.

Figure illustrates one simple use of deletion A scatterplot is made and there is an outlier that causes the remaining points on the graph a graph of the firet. sulwet appears on the screen for, say, 1 see, then it is replaced for 1 sec by a graph of the second subset, and so forth until we get to the last subset. Then the process repeats. Of course, the sub sets are all 6hown on common axes so that the scale of the pictures remains identical as various subsets are shown. Another technique is to show all of the data at all times and have the cycling consist of a highlighting of one subset at each stage. A third tech to be crammed into such a small region that their resolution is ruined; the analyst removes the point by touching it with a cursor, and after the deletion the graph is automatically rescaled and redrawn on the screen. For example, Fowlkes (1971) used this dynamic deletion of outliers for probability plots; after points -were selected for deletion, the expected order statistics of the reduced sample were recomputed automatically and the graph redrawn. Deletion is actually a very general concept that can enter dynamic graphical methods in many way*. Its basic purpose is to eliminate certain graphical ele ments so that we can better study the remaining elements. For example, the outlier deletion lets us focus more incisively on the remaining data, and in alternagraphics, subsets can be temporarily deleted to allow better study of the remaining subsets.

Other applications of deletion will be given in later sections 2.3 Linking Suppose we have n measurement on p variables and that scatterplots of certain pairs of the variables are made A linking method enables us to visually link corresjxmding points on different scatterplots. For example, suppose there are four variables, and and that we graph у against and against . To link points on the two scatterplots means to see by some visual method that the point on the first plot corresponds to on the other plot. To illustrate this, consider the Anderson (1935) iris data made famous by Fisher (1936). There are 150 mens urements of four variables: sepal length, sepal width, petal length, and petal width. Two scatterplots are shown in Figure 5. The data have been jittered, that is. small amounts of noi»e added, to avoid the overlap of plotted symbols on the graph. F-nch ecattcrplot has two clusters, and we immediately find ourselves want ing to know if there is some correspondence botwecn the clusters of separate plots.

Linking is a concept that has long existed in the development of static display (Chamliers, Cleveland, Kleiner and Tukey, 1983; Diaconis and Friedman, 1980; Tufte, 1983). One method for linking is the M and N plot of Diaconis and Friedman (1990); lines are drawn between corresponding point* on the two scat terplots. Another method is to use a unique plotting symbol for each point (Chambers, Cleveland, Kleiner and Tukey, 1983) on a particular plot and to use the same symbol for corresponding observations on dif ferent plot. A third method is the scattcrplot matrix, all pairwise scatterplots arranged in a rectangular array, which arose, in part, because it provide a certain amount of linking. An example is Figure 6, which shows the iris data.

To maximize the resolution of the plotted points, scale information is put inside the panels of the off diagonal of the matrix; the labels are the variable name and the numbers show the ranges of the vari- ables. Consider the cluster to the northwest in the sepal length and width plot of the panel. Does this cluster correspond to one of the two clusters in the petal length and width scatterplot of panel By scanning horizontally from the panel to the panel and then vertically to the panel wre can see that the top half or so of the northwest sepal cluster corresponds to the top half or so of the north east petal cluster. By scanning vertically from the panel to the panel and then horizontally to the panel we can see that the left half or so of the northwest sepal cluster corresponds to the bottom half or so of the northeast petal cluster. The union of these two scans shows that most of the northwest sepal cluster corresponds to most of the northeast petal cluster, it is a good guess that the remaining pieces of the clusters also correspond.

Related Topics

1. INTRODUCTION

1.1 The Importance of Dynamic Methods

1.2 Two Early Systoms

1.3 Contents of the Paper

2. METHOOS

2.1 Identification of Labeled Data

Need custom essay sample written special for your assignment?

Related Topics

Dynamic Graphics for Data Analysis Essay

1. INTRODUCTION

1.1 The Importance of Dynamic Methods

1.2 Two Early Systoms

1.3 Contents of the Paper

2. METHOOS

2.1 Identification of Labeled Data

Need custom essay sample written special for your assignment?

More related essays