Conducting a benchmarking study is an iterative process. After assembling the raw data, the benchmarking team should screen the data carefully to ensure the quality and quantity of information being gathered meet the requirements to allow for a successful project. This process is crucial to the study because poor data quality (inconsistent definitions, missing data or extreme data values) may lead to biased results. Specifically, attention must be paid to the following aspects when investigating raw data and their quality:

- Missing Data
- Accuracy and Comparability of Data Files

After investigating the data, the benchmarking team can assemble the benchmarking dataset. However, the sample size may not be large enough for utilizing sophisticated analysis. When the sample size is too small, the analyst may have to work on expanding the sample size – either by using panel data or conduct international (cross regional) benchmarking.

#### Missing Data

Missing data is a common problem, which may pose a serious problem for the quality of the benchmarking study for the following reasons.

- Sample size: Missing data will reduce the sample size. As mentioned above, a reasonable sample size is necessary for conducting sophisticated frontier benchmarking models.
- Non-random missing data: The missing data may be non-randomly distributed. For example, data providers (utilities) could have accidentally forgotten to supply data (i.e., randomly distributed missing data) or they could have intentionally not provided the data (i.e., non-random missing data). The non-random missing data will influence the statistical inferences because the observed sample will be a reduced sample of the true sample.
- Remedies: There are three ways to address the missing data problem:
- Delete the observations with missing values for variables, with the effect that it will reduce the sample size;
- Estimate (impute) missing values using existing observations and then use the estimated values in the data analysis, which may be inappropriate when the missing data is not at random
- Ask the utility to provide the missing data.

#### Accuracy and Comparability of Data Files

Before moving into formal statistical analysis, it is helpful to engage in preliminary data checks of the computerized file:

- Reduce data entry errors: Compare the computerized data set with the original data set to make sure that the data have been entered correctly.
- Prepare summary statistics: Summary statistics include the mean (or average) median, standard deviation, and minimum and maximum values of the key variables. Based on the summary statistics, analysts need to make sure that the utilities in the sample are really comparable to one another. For example, is utility A really comparable to utility B which is 100 times larger than A?
- Calculate key ratios: Prepare summary statistics for key ratios such as water delivered/employee and OPEX/water delivered. Pay attention to firms with extremely large or small ratios. There might be some data input errors for these firms or they may have different organizational structures (e.g., outsourcing),
- Ensure comparability of data definitions: Check the definitions of the key variables to make sure that they are really comparable. For example, does the number of employees refer to full time employees, part time employees, or total employees? Is the definition of operating cost similar across firms? In some cases, number of customers may not be a good output variable due to significant differences in service continuity. For instance, let company A’s number of customers be 1000 and average service time 4 hours/day. Company B’s number of customers is 600 and the average service time is 24 hours/day. In this case, the adjusted customer number (reflecting service hours) might be a better output measure, depending on the structure of the model.
- Recognize unique events or characteristics: Some special events such as natural disasters may have significant impacts on operating costs and service quality indicators. If the impact is severe, it is better to exclude the observation from the sample in that year. Similarly, a utility located near a water source and one requiring substantial investments in water storage and transport will have different costs. Such differences should be incorporated into the analysis.