next up previous contents
Next: Robust Estimate Combination Up: Position Estimation Previous: Position Estimation

Estimation by Linear Combination

 

When a position estimate is required, an image is obtained and landmarks are extracted by selecting the local maxima of edge density, as described in Chapter 3. The extracted candidate landmarks must then be matched to the tracked landmarks in the database, which is accomplished using the procedure outlined in Chapter 4, neglecting the steps which modify the database. That is, each landmark candidate l undergoes a local position adjustment to find a best match to each tracked landmark T, and the tracked landmark whose prototype is unambiguously closest to the encoding of l is selected as the match. Figure 5.1 shows the results of matching the landmarks observed in an image with the prototypes of a set of tracked landmarks (which were depicted previously in Figure 4.2(b)). The top row of intensity distributions corresponds to the landmarks observed in the image (after their positions were adjusted to optimise the matching), whereas the bottom row represents the prototypes to which the corresponding landmarks were matched. While at first glance, the images appear to be identical, there are some very subtle differences in appearance, as well as undepicted differences in position in the image.

  figure546
Figure 5.1: Landmark-prototype matches for a single image: The top row of intensity distributions corresponds to the landmarks observed in the image (after their positions were adjusted to optimise the matching), whereas the bottom row represents the prototypes to which the corresponding landmarks were matched. While at first glance, the images appear to be identical, there are some very subtle differences in appearance.

Once landmark matching is accomplished, we exploit an assumption of linear variation in the landmark characteristics with respect to camera pose in order to obtain a position estimate. If the assumption of smoothly linear local variation in the landmark is true, then the encoding of the landmark observed from an unknown camera position will be a linear combination of the encodings of the tracked models, allowing us to interpolate between the sample positions in the database. We will later present a method for quantitatively evaluating the reliability of the linearity assumption, and which will allow us to obtain a measure of confidence in the results. For the remainder of this section, let us assume that we have observed a single landmark l in the world and it has been correctly matched to the tracked landmark T.

Let us define the encoding tex2html_wrap_inline4490 of a landmark candidate l as the projection of the intensity distribution in the image neighbourhood represented by l into the subspace defined by the principal components decomposition of the set of all tracked landmark prototypes. We repeat equation 4.2 with slightly different terminology here for reference:
 equation557
where tex2html_wrap_inline4496 is the local intensity distribution of l normalised to unit magnitude and tex2html_wrap_inline4346 is the set of principal directions of the space defined by the tracked landmark prototypes.

Let us now define a feature-vector tex2html_wrap_inline4502 associated with a landmark candidate l as the principal components encoding tex2html_wrap_inline4506, concatenated with two vector quantities: the image position tex2html_wrap_inline4508 of the landmark, and the camera position tex2html_wrap_inline4388 from which the landmark was observed:
 equation570
where, in this particular instance alone, the notation tex2html_wrap_inline4512 represents the concatenation of the vectors tex2html_wrap_inline4514 and tex2html_wrap_inline4516.

Given the associated feature vector tex2html_wrap_inline4518 for each landmark tex2html_wrap_inline4444 in the tracked landmark tex2html_wrap_inline4522, we construct a matrix tex2html_wrap_inline4524 as the composite matrix of all tex2html_wrap_inline4518, arranged in column-wise fashion, and then take the singular values decomposition of tex2html_wrap_inline4524,
equation592
to obtain tex2html_wrap_inline4530, representing the set of eigenvectors of the tracked landmark T arranged in column-wise fashion. Note that since tex2html_wrap_inline4534 is a component of each tex2html_wrap_inline4518, tex2html_wrap_inline4530 encodes camera position along with appearance. Now consider the feature vector tex2html_wrap_inline4540 associated with l, the observed landmark for which we have no pose information - that is, the tex2html_wrap_inline4388 component of tex2html_wrap_inline4540 is undetermined. If we project tex2html_wrap_inline4540 into the subspace defined by tex2html_wrap_inline4530 to obtain
equation615
and then reconstruct tex2html_wrap_inline4540 from tex2html_wrap_inline4554 to obtain the feature vector
equation622
then the resulting reconstruction tex2html_wrap_inline4556 is augmented by a camera pose estimate that interpolates between the nearest eigenvectors in tex2html_wrap_inline4530. In practice, the initial value of the undetermined camera pose, tex2html_wrap_inline4388 in tex2html_wrap_inline4540 will play a role in the resulting estimate and so we substitute the new value of tex2html_wrap_inline4388 back into tex2html_wrap_inline4540 and repeat the operation, reconstructing tex2html_wrap_inline4556 until the estimate converges to a steady state. This repeated operation, which constitutes the recovery of the unknown tex2html_wrap_inline4388 is summarised in Figure 5.2.

  figure636
Figure 5.2: The recovery operation. The unknown camera position tex2html_wrap_inline4388 associated with a landmark l is recovered by repeatedly reconstructing the landmark feature vector in the subspace defined by the matching tracked landmark.

Formally,
equation641
where tex2html_wrap_inline4576 is the optimising scatter matrix of the feature vectors in T, and hence tex2html_wrap_inline4556 corresponds to the least-squares approximation of tex2html_wrap_inline4502 in the subspace defined by the feature vectors of the tracked tracked landmark T. Convergence is guaranteed by the fact that tex2html_wrap_inline4530 is column-orthonormal and hence tex2html_wrap_inline4576 is symmetric and positive-definite. Convergence is typically achieved in two or three iterations, as depicted in Figure 5.3.

  figure658
Figure 5.3: Convergence properties for a single training set. The average convergence path, expressed in terms of distance from the steady-state, is plotted as a function of the number of iterations.

There are some subtleties to the estimation procedure that we have not yet acknowledged. First, since tex2html_wrap_inline4388 is unknown at the outset, there is an issue of what value to assign to tex2html_wrap_inline4388 in tex2html_wrap_inline4540. In practice, we set tex2html_wrap_inline4388 to be the mean of all camera poses tex2html_wrap_inline4534 in T. One might choose instead to use an a priori pose estimate. We will consider this possibility when we present our experimental results in Chapter 6. Second, there is an issue over how the camera pose tex2html_wrap_inline4388 and image position tex2html_wrap_inline4508 should be weighted when constructing a feature vector. Ideally, one would scale tex2html_wrap_inline4388 down to a tiny fraction of tex2html_wrap_inline4506 in order to downplay the effect that tex2html_wrap_inline4388 has on the subspace. If tex2html_wrap_inline4388 plays too strong a role in the subspace, then the reconstruction process will be ineffective. As for the image position, one can arbitrarily scale tex2html_wrap_inline4508 in order to weight its relative importance versus tex2html_wrap_inline4506. Such a weighting determines the degree to which we favour image geometry over appearance. We will consider the effects of varying the weight of both tex2html_wrap_inline4388 and tex2html_wrap_inline4508 in Chapter 6.

Figure 5.4 depicts a set of estimates obtained for the landmarks detected in a single image. While most of the estimates are reasonably accurate, at least one point may be considered an outlier, most likely produced by nonlinearities in the tracked landmark, poor tracking, or a match that is altogether incorrect. The next section will deal with the problem of detecting and removing outliers as well as combining the good estimates in way that is numerically robust.

  figure684
Figure 5.4: Position estimate for a single test image. Each 'x' marks an estimate as obtained from a single landmark in the image. The 'o' represents the actual position. The training images were obtained at the locations of the grid intersections.


next up previous contents
Next: Robust Estimate Combination Up: Position Estimation Previous: Position Estimation

Robert Sim
Tue Jul 21 10:30:54 EDT 1998