Visual Cognition and Scene Recognition
Our results demonstrate that scene recognition is performed in a manner which is strikingly similar to object recognition. Subjects took incrementally longer to recognize novel views of familiar scenes rotated in depth. Furthermore, the cost (in response latency) of recognizing novel scenes was best predicted by their distance in angular three dimensional space from the nearest trained viewpoint-specific representation in memory. These results closely parallel those reported by Tarr (1995; Tarr & Pinker, 1989) and suggest that representations of inter-object relations may be narrowly tuned around experienced and encoded viewpoints in the same manner that representations of objects are.
Our results also suggest that scene recognition is achieved through the alignment of view-specific three dimensional representations (e.g., Ullman, 1989; 1996). Performance as measured by response latencies was a function of viewpoint, but error rates were uniformly low. These results diverge from some published work in object recognition which suggest that object recognition may be achieved through view interpolation between essentially two-dimensional view-specific representations (e.g., Bülthoff, Edelman, & Tarr, 1995; Logothetis & Pauls, 1995). In general view-interpolation models of recognition predict uniform cost generalization to views which lie along the minor viewing arc between closely spaced study views but incremental cost generalization to views which lie along the major arc. For example, if study views are labeled 0 deg. and 75 deg. then generalization to 30 deg. will be easier than generalization to 105 deg. (even though each is an equal distance away from a study view). Views lying on the minor arc between study views are by convention term interpolation views, while views lying on the major arc are termed extrapolation views (e.g., Bülthoff & Edelman, 1992). Our results in scene recognition showed no differences between interpolation and extrapolation suggesting that recognition for each of these classes of views was achieved similarly; by alignment. We are currently attempting to evaluate the reasons for why our results in scene recognition may diverge from object recognition in these important aspects but are similar otherwise. Several possibilities abound. A majority of view interpolation results in object recognition were achieved following the temporally proximate presentation of spatially proximate study views. In contrast, in our scene recognition experiments, the study views were often separated from each other by temporal intervals leading up to several seconds. It is possible that the temporally proximate presentation of object views in perception is important for the neuropsychological system to link them in memory (e.g., Miyashita, 1993). In a second series of experiments we have investigated these same issues but using a slightly different set of stimuli. These experiments were conducted using stimuli which in a sense is an abstraction of the previously mentioned scene domain. Scenes were defined by configurations of colored dot arrays displayed in depth on a computer monitor (see attached figure) using linear perspective. On a given trial two study views (0 deg. and 75 deg.) separated by 75 deg. of viewing angle were presented in succession over three iterations; each view was presented for a second. The manner of presentation of study views resulted in the experience of visual apparent motion between study views in three dimensional space (e.g., Shepard & Judd, 1976) on a majority of trials. Following study, subjects made an old-new recognition judgment on a test view of dots. Recognition was tested for views around the viewing circle in increments of 15 degrees (24 test views). The results of this experiment support view interpolation models of recognition (e.g., Bülthoff, Edelman, & Tarr, 1995). However, the presence of apparent motion between study views however creates an interesting possibility. Perhaps apparently experienced views (or "virtual views") are as easily recognized as actually experienced views? In the first attempt to address this issue we conducted an experiment which served as a perceptual control In this perceptual control experiment subjects observed a study sequence of 11 views oscillating between 75 deg. of viewpoint angle. Following study, old-new recognition was tested as before. The results generally paralleled the results of the apparent motion experiment; Recognition of interpolation views was uniformly low (not surprising since, the test views were actually perceived at study) but recognition for extrapolation views was achieved with incremental cost. Perhaps then, the experience of apparent motion creates a representation in memory which includes apparently perceived views. However, a weaker explanation may be that merely presenting study views proximate in space and time is sufficient to allow recognition to be experienced by view interpolation. Possibly, the temporally proximate presentation of scene views in perception is sufficient for the neuropsychological system to link them in visual memory (e.g., Miyashita, 1993). This alternative hypothesis was further investigated by systematically varying the conditions under which views are studied. In one control experiment, study views were shown in temporal proximity but visual apparent motion between study views was eliminated. Results showed that the previously evinced uniform cost generalization in the interpolation range disappeared and instead incremental cost generalization was in evidence. The lack of visual apparent motion at study evidently affected generalization in the interpolation range at test. We believe that our results suggest that the experience of visual apparent motion between study views may be sufficient to create views in memory which lie along the impleted path of motion. That is memory for visual motion (e.g., Blake, Cepeda, & Hiris, in press) may allow the visual system to span the intervening angular distance, allowing uniform generalization to unseen views which lie along the path of motion. The possibility is intriguing and may further demonstrate the functional equivalence between real and apparent motion. The document reporting these results is under preparation. In further studies we are planning experiments which may be diagnostic in identifying the specific processes used in such visual reasoning tasks. These experiments will revolve around the issue of whether the pose of a scene (that is, the position from which a scene is being viewed, relative to the view in memory) is computed in visual recognition. That is, during recognition, does the visual system compute the transformation necessary in three dimensional space to bring the novel view of the scene into correspondence with the represented view. In general, view interpolation models do not require the computation of pose during recognition which some implementation of alignment models do. Furthermore, at present, no definitive evidence exists to suggest that representations (e.g., Cooper, 1976) or reference frames (e.g., Robertson, Palmer, & Gomez, 1987) are transformed during scene recognition. The approach to this issue is many fold. We will seek to examine (a) whether the pose of scenes can be computed, (b) whether pose information can allow for the preparation of an upcoming stimulus and (c) whether pose is computed in the act of recognition. The first set of experiments will examine whether the pose of scenes can be computed at all. Several experiments conducted in the study of inter-object relations in larger scaled spaces (e.g., Rieser, 1989; Shelton & McNamara, in press) have examined the computation of direction between objects as a function of imagined heading (e.g., "imagine you are standing in front of your sofa and facing the television; now point to the refrigerator). However, to our knowledge, no experiments exist in which subjects after being shown a view of a familiar scene have been asked to explicitly indicate the pose from which it is being presented. After learning a scene from a single restricted view, subjects will be shown pictures of unfamiliar views and will indicate the position around the viewing circle from which that view was pictured. This experiment will be of interest for two reasons: Are response latencies to compute pose systematically related to the absolute angular distance (complexity of the transformation) between the view and represented model? If they are, what is the "rate" (or the slope of the response latency function) at which such a transformation is computed and is this rate is comparable to those in recognition tasks. Secondly, is there a systematic relationship between the error in estimating pose and the angular distance? This may indicate whether the process used in estimating pose scales with complexity or not, an indication of the dimensionality of the underlying representation (Ullman, 1989). Another set of experiments will be designed to address whether subjects can make compensatory adjustments from memory to prepare for an upcoming stimulus. Studies with alphanumeric characters have demonstrated that when subjects are given the identity and the orientation of an upcoming stimulus, they can effect a transformation from memory to offset processing of the stimulus in a noncanonical orientation (e.g., Cooper & Shepard, 1973). We are interested in whether subjects can perform similarly with multi-object scenes. Subjects will learn scenes from a single restricted view. Following learning, subjects will participate in a recognition task. On each trial of the task, subjects will be given a cue which will indicate the pose of the upcoming stimulus. Following the presentation of the pose cue, subjects will indicate their readiness for the upcoming test stimulus with a key press. Subjects will make an old-new judgment to the test stimulus. "Old" stimuli will be views of the familiar scene presented at the cued pose. If subjects prepare for the upcoming stimulus by mentally realigning a stored representation or by aligning an abstract reference frame, they should be in a state of readiness to compute an immediate match with the test view. Response latencies to the test view would be independent of angular distance between the test view and the learned view. Furthermore, the time taken by subjects to prepare for the upcoming stimulus will be a further index of whether subjects undertake some form of alignment or not. Another set of experiments will address the issue of whether pose is actually computed in recognition. Robertson, Palmer, and Gomez (1987) have demonstrated facilitation in responding to alphanumeric stimuli as a function of whether their orientation matches the orientation of previously presented stimuli. For example a handedness judgment on an "R" presented in a 90 deg. clockwise rotation is faster if it is preceded by a set of similarly oriented "Fs" than if it is preceded by a set of differently oriented "Fs". The response to the "F" stimuli positions the reference frame used in making the judgment in that particular pose. Because this reference frame is then aligned with the orientation of the subsequent stimulus ("R"), speed of judgment is facilitated. The question of interest is whether such "pose priming" can be demonstrated in scene recognition. Subjects will learn two scenes from strictly restricted single views. The scenes will consist of the same objects but in different spatial positions. Following learning, subjects will participate in a recognition task in which we will manipulate the relation between successive views. Of interest is whether the recognition of unfamiliar views of one layout preceded by unfamiliar views of another layout in a similar pose will be facilitated. To the extent that they are, it constitutes very strong evidence that some form of alignment (ostensibly of reference frames) is performed in recognition. Several questions suggest themselves for inquiry. (a) What is the depth and the breath of the affinities between inter-object and intra-object spatial representations. A primary approach could be to test whether individuals process scenes in the same manner in which they process objects. To the extent that they are, tasks like handedness judgments (e.g., Shepard & Cooper, 1982), left-right judgments (e.g., Jolicoeur, 1988) will be carried out similarly with scenes. Such evidence would allow us to articulate the representational similarities between each of these classes of representations. (b) Are inter-object relations represented in a manner which faithfully preserves the three-dimensional structure of the stimuli in the world? Evidence from object recognition indicates that object representations may primarily be two-dimensional (with some further embedded dimensional information) even when subjects have access to the three-dimensional structure of objects during perception (e.g., Rock & DeVita, 1987). The low rates of error in scene recognition experiments (e.g., Diwadkar & McNamara, in press) suggest that subjects encode scenes in a format approaching three dimensions. (c) How does attention modulate the extraction of features to achieve visual recognition? Does the visual system selective attend to key features in the image and is the allocation of visual attention in such case driven by indogenous or exogenous cues? Whether attention is driven endogenously or exogenously may help us understand whether represented views are compared to images or whether images are compared to represented views during recognition (e.g., Ullman, 1989). (d) How can we harness other behavioral measures of cognitive performance such as eye-movement analysis (e.g., Carpenter & Just, 1978) or brain imaging (e.g., Carpenter, et al., 1995). The use of such methods will potentially allow us to further converge toward an understanding of processes used in the representation and recognition of inter-object spatial relations.REFERENCES
Biederman, I., & Gerhardstein, P. C. (1993). Recognizing depth-rotated objects: Evidence and conditions for three-dimensional viewpoint-invariance. Journal of Experimental Psychology: Human Perception and Performance, 19, 1162-1182. Biederman, I., & Gerhardstein, P. C. (1995). Viewpoint-dependent mechanisms in visual object recognition: Reply to Tarr and Bülthoff (1995). Journal of Experimental Psychology: Human Perception and Performance, 21, 1506-1514. Blake, R., Cepeda, N. J., & Hiris, E. (in press). Memory for visual motion. Journal of Experimental Psychology: Human Perception & Performance. Bülthoff, H. H., & Edelman, S. (1992). Psychophysical support for a two-dimensional view interpolation theory of object recognition. Proceedings of the National Academy of Sciences, USA, 89, 60-64. Bülthoff, H. H., Edelman, S., & Tarr, M. J. (1995). How are three-dimensional objects represented in the brain? Cerebral Cortex, 3, 247-260. Carpenter, P. A., & Just, M. A. (1976). Eye fixations during mental rotation. In, J. W. Senders, D. F. Fisher, & R. A. Monty, (Eds.), Eye movements and the higher psychological functions. Hillsdale, NJ: Lawrence Erlbaum Associates. Carpenter, P. A., Just, M. A., Keller, T. A., Thulborn, K. R., Eddy, W. F., & Mockus, A. (November, 1995). Brain imaging of mental rotation. Annual Meeting of the Psychonomic Society, Los Angeles, CA. Cooper, L. A. (1976). Demonstration of a mental analog of a physical rotation. Perception and Psychophysics, 19, 296-302. Cooper, L. A., & Shepard, R. N. (1973). Chronometric studies of the rotation of mental images. In W. G. Chase (Ed.), Visual information processing (pp. 75-176). New York: Academic Press. Diwadkar, V. A., & McNamara, T. P. (in press). Viewpoint dependence in scene recognition. Psychological Science. Jolicoeur, P. (1988). Mental rotation and the identification of disoriented objects. Canadian Journal of Psychology, 42, 461-478. Logothetis, N. K., & Pauls, J. (1995). Psychophysical and physiological evidence for viewer-centered object representation in the primate. Cerebral Cortex, 3, 270-288. Marr, D., & Nishihara, H. K. (1978). Representation and recognition of the spatial organization of three dimensional shapes. Proceedings of the Royal Society, London, B, 200, 269-291. Miyashita, Y. (1993). Inferior temporal cortex: Where visual perception meets memory. Annual Review of Neuroscience, 16, 245-263. Presson, C. C., DeLange, N., & Hazelrigg, M. D. (1989). Orientation-specificity in spatial memory: What makes a path different from a map of the path? Journal of Experimental Psychology: Learning, Memory, and Cognition, 15, 887-897. Rieser, J. J. (1989). Access to knowledge of spatial structure at novel points of observation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 15, 1157-1165. Robertson, L. C., Palmer, S. E., & Gomez, L. M. (1987). Reference frames in mental rotation. Journal of Experimental Psychology: Learning, Memory, & Cognition, 13, 368-379. Rock, I., & DeVita, J. (1987). A case of viewer-centered object perception. Cognitive Psychology, 19, 280-293. Shelton, A. L., & McNamara, T. P. (in press). Multiple views of spatial memory. Psychonomic Bulletin & Review. Shepard, R. N., & Cooper, L. A. (1982). Mental images and their transformations. Cambridge, MA: MIT Press. Shepard, R. N., & Judd, S. A. (1976). Perceptual illusion of three-dimensional objects. Science, 191, 952-954. Tarr, M. J. (1995). Rotating objects to recognize them: A case study on the role of viewpoint dependency in the recognition of three-dimensional objects. Psychonomic Bulletin & Review, 2, 55-82. Tarr, M. J., & Bülthoff, H. H. (1995). Is human object recognition better described by geon-structural descriptions or by multiple views? Journal of Experimental Psychology: Human Perception & Performance, 21, 1494-1505. Tarr, M. J., & Pinker, S. (1989). Mental rotation and orientation-dependence in shape recognition. Cognitive Psychology, 21, 233-282. Ullman, S. (1989). Aligning pictorial descriptions: An approach to object recognition. Cognition, 32, 193-254. Ullman, S. (1996). High-level vision: Object recognition and visual cognition. Cambridge, MA: MIT Press.