Previous Lesson Complete and Continue  

  2. Data Imputation & Outliers

Lesson content locked

Enroll in Course to Unlock
If you're already enrolled, you'll need to login.

Transcript

- Missing values is the most frequent problem that we encounter in data. We have three different ways to overcome this issue. One, ignoring the missing data. This is usually done when the data is missing at random and the fraction that it represents is relatively small, say less than 10% compared to all data available. Two, imputing missing data with replacement values, given by the mean, median of the most frequent value. This is practical in most circumstances and very commonly used in systematic reviews, but it may fail to properly account for uncertainty involved in the data. Three, imputing missing data using advanced statistical models that are based on relation assumptions existing in the available data. This usually requires involvement of a knowledgeable data analyst or data scientist with domain experts to avoid possible sources of bias when imputing the data. Depending of the type of data, imputation is also a challenging task. In many cases, data aggregation can provide strong hints in how to proceed in these situations, such as in the case when we aggregate production wells to determine a characteristic trend that allow us to fill some of the data gaps. A topic that deserve a separate treatment are outliers. In simple terms, an outlier is a value that departs from an expected one. Obviously, defining what is expected is quite subjective and can lead to critical decisions. Therefore, retention or deletion of an outlier could be controversial. While mathematical criteria provides an objective and quantitative methods for data rejection, they do not necessarily make the practice more scientifically or methodologically sound, especially in small sets or where a normal distribution cannot be assumed. Therefore, a robust rejection of outlier requires the knowledge of error distributions and domain knowledge associated to the problem. To illustrate this point, let us look at practical case that typically arises in decline curve analysis, or DCA, for production forecasting and reserve estimation. The figure illustrates the oil rate declining behavior during the last 450 days of well production. We can see that the data is quite noisy and shows several potential outliers in the first half of the plot. Let us try to remove some outliers before a certain threshold, in this case, represented by a green vertical line. By using an aggressive outlier removal approach, we only keep the red points. Fitting a typical DCA analytical model, we come up with the red line representation. The curve has been extended beyond the green line threshold to see how well it may predict future values. Let us now take a less aggressive approach for removing outliers. In this case, we keep the points represented by both red and blue colors. Correspondingly, we fit the same DCA analytical model. We can clearly see that this model can predict future values better than the other one, based on red points. These two results will actually lead to appreciable differences in calculating the estimated ultimate recovery, or EUR, in this particular case. So generally speaking, many oil and gas applications, like the one just described, could be quite sensitive to outlier selection. Now, what are some of the mechanisms to remove outliers? When there is a certain understanding of the problem, we can rely on a few mathematical approaches, including establishing error bounds or tolerance limits that could serve to delimitate wanted from not wanted values. In the already familiar DCA plot, we can see that these bounds are denoted by the blue, dashed dot band and the red points have been marked as outliers. Another possibility is to follow a distributional approach. This requires computing histograms to approximate the value distribution and then discard the less frequent values located at the tails of the distribution. We can see this clearly depicted in the accompanying plot. The third alternative is to perform a residual analysis. This requires to identify the most prominent residuals resulting from curve fitting procedures. The residual plot at the right has emphasized that those outliers are located at the lower portion of the DCA plot. Of course, there are more approaches, but the bottom line is that outlier removal could be, at times, more an art than a science. Let me give you a few additional hints on how to proceed about outlier removal. Outliers can be removed safely when they do not change results and underlying assumptions to the problem. That means neither the presence nor the absence of outlier in the graph would change the regression line. A gray zone arises when results may change without violating the underlying assumptions. In this case, we need to proceed with caution in the absence of supporting data. For instance, it is known that there should be a decreasing trend in the data. A value should not appear noticeable higher than the previous one in the sequence. When both results and assumptions are violated, there is a chance to introduce spurious relations by keeping the outlier. In the graph, the relation between X and Y are clearly created by the outlier. Without it, there is no clear relationship between X and Y, so the regression coefficient does not truly describe the effect of X on Y.