Estimator > Adjusting Data > Missing Data Algorithm

Missing Data Algorithm

Estimator Help Documentation

Return series may have gaps in the available time periods or varying start and end dates. For example, a small cap fund may not have as long a performance record as a large cap growth manager, or you may decide that a particular return observation is invalid due to error or accounting differences among time periods. Missing data is marked by red "N/A"s on the Asset Returns Worksheet. When there are missing data within the period specified on the Historical Worksheet, "missing data" messages appear in the Historical Worksheet's Data field as well as the Missing Data field on the Results Worksheet. The Estimator invokes the Missing Data Algorithm to extrapolate missing data (means, standard deviations, and correlations).

The Missing Data Algorithm makes several assumptions about the data. Missing data is said to be Missing Completely at Random (MCAR) when the probability that an observation is missing does not depend on any of the data. This is a strict assumption that is generally not valid for financial data series. Here we make a less strict assumption that the missing data are Missing at Random (MAR), which says that the probability that an observation is missing may depend on observed data from that time period, but not on the missing value itself. The algorithm in the NFA Estimator is valid for data missing at random. The MAR assumption says that the relationships in the observed data are preserved during periods of missing data, but that the missing periods may be somehow different from the observed periods. This is a much more plausible assumption for financial data than MCAR, which insists that missing data is identically distributed to the observed data.

Historically there have been several methods for estimation in the presence of missing data. The most naïve of these is to eliminate all records with missing data. This approach is referred to as the “complete case analysis". This approach will lead to biased estimates if the data are MAR but not MCAR, and may throw away much information. The obvious correction to this is to use the “available case analysis", in which means and standard deviations are estimated using all the observed data, and correlations are estimated using all jointly observed records for the asset pair in question. This approach may again lead to biased estimates for the mean when the data are correlated and MAR, and although estimates may be more precise, has the additional difficulty that the resulting covariance matrices often fail to be positive-semidefinite, and are thus invalid. Another approach to estimation with missing data is commonly known as “hot deck imputation", in which missing records are filled in with other randomly chosen observations from the same asset. This again does not take advantage of the correlation structure among the variables.

The NFA missing data algorithm is a more sophisticated approach to estimation with missing data. A multivariate normal model is fit to the data using maximum likelihood techniques. The outputs of the maximum likelihood procedure are exactly the mean and covariance of the data. They are both adjusted for the missing data: the means are adjusted using information from the observed variables and the correlations, the variances are inflated to account for the uncertainty due to the missingness, and the correlations are consistently estimated given all the other information. When possible, the estimates are made in a closed form using a noniterative technique, which will yield the same results as the iterative Expectation-Maximization (EM) algorithm described below, but much more quickly. However, the pattern of "missingness" must be monotone for it to be possible to use this closed form maximum likelihood technique. “Monotonicity” means that each missing data pattern contains the previous one, i. e. is missing only for assets which are also missing in all subsequent patterns with more missing values. If the missing data patterns are more complex and nonmonotone, the Estimator employs the EM algorithm. This algorithm starts with initial estimates, and iteratively refines these estimates until convergence. It can be quite slow to converge when the fraction of missing information is large.

The missing data algorithm is meant to fill in small gaps in the data, and New Frontier does not recommend its use to fill in large stretches of data series. When large stretches are missing, we recommend attempting to find another data series to use as a proxy or changing the start and/or end dates of the analysis.

References:

Little, R. J. A. & Rubin, D.B. 1987 Statistical Analysis with Missing Data. New York: Wiley.

Schafer, J. L. 1997 Analysis of Incomplete Multivariate Data. London: Chapman & Hall.

Dempster et al. 1977. "Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm (with discussion)" Journal of the Royal Statistical Society, B39, 1-38.