Checking that data are within an acceptable or plausible range is an important basic check to apply to quantitative data. Checking that data are recorded with appropriate legal values or codes is an important basic check to apply to categorical data.
We will use the dataset rl.ex01 that is included in the nipnTK package.
#> age sex weight height muac oedema
#> 1 12 2 6.7 68.5 148 2
#> 2 6 1 6.4 65.0 125 2
#> 3 6 2 6.5 65.6 125 2
#> 4 8 1 7.2 68.4 144 2
#> 5 12 M 6.1 65.4 114 2
#> 6 8 1 7.7 66.5 146 2
The rl.ex01 dataset contains anthropometry data from a SMART survey from Angola.
We can use the summary() function to examine range (and other summary statistics) of a quantitative variable:
This returns:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 11.1 128.0 139.0 140.3 148.0 999.0
A graphical examination can also be made:

The “whiskers” on the boxplot extend to 1.5 times the interquartile range from the ends of the box (i.e., the lower and upper quartiles). This is known as the inner fence. Data points that are outside the inner fence are considered to be mild outliers. The NiPN data quality toolkit provides an R language function outliersUV() that uses the same method to identify outliers:
This returns:
#>
#> Univariate outliers : Lower fence = 98, Upper fence = 178
#> age sex weight height muac oedema
#> 33 24 1 9.8 74.5 180.0 2
#> 93 12 2 6.7 67.0 96.0 1
#> 126 16 2 9.0 74.6 999.0 2
#> 135 18 2 8.5 74.5 999.0 2
#> 194 24 M 7.0 75.0 95.0 2
#> 227 8 M 6.2 66.0 11.1 2
#> 253 35 2 7.6 75.6 97.0 2
#> 381 24 1 10.8 82.8 12.4 2
#> 501 36 2 15.5 93.4 185.0 2
#> 594 21 2 9.8 76.5 13.2 2
#> 714 59 2 18.9 98.5 180.0 2
#> 752 48 2 15.6 102.2 999.0 2
#> 756 59 1 19.4 101.1 180.0 2
#> 873 59 1 20.6 109.4 179.0 2
We can count the number of outliers or use:
This returns:
#>
#> Univariate outliers : Lower fence = 98, Upper fence = 178
#>
#> FALSE TRUE
#> 892 14
We can express this as a proportion:
This returns:
#>
#> Univariate outliers : Lower fence = 98, Upper fence = 178
#>
#> FALSE TRUE
#> 0.98454746 0.01545254
You may find it easier to use percentages:
This returns:
#>
#> Univariate outliers : Lower fence = 98, Upper fence = 178
#>
#> FALSE TRUE
#> 98.454746 1.545254
Some of the muac values identified as potential outliers are possible muac values:
#>
#> Univariate outliers : Lower fence = 98, Upper fence = 178
#> age sex weight height muac oedema
#> 33 24 1 9.8 74.5 180.0 2
#> 93 12 2 6.7 67.0 96.0 1
#> 126 16 2 9.0 74.6 999.0 2
#> 135 18 2 8.5 74.5 999.0 2
#> 194 24 M 7.0 75.0 95.0 2
#> 227 8 M 6.2 66.0 11.1 2
#> 253 35 2 7.6 75.6 97.0 2
#> 381 24 1 10.8 82.8 12.4 2
#> 501 36 2 15.5 93.4 185.0 2
#> 594 21 2 9.8 76.5 13.2 2
#> 714 59 2 18.9 98.5 180.0 2
#> 752 48 2 15.6 102.2 999.0 2
#> 756 59 1 19.4 101.1 180.0 2
#> 873 59 1 20.6 109.4 179.0 2
The outliersUV() function provides a fence parameter which alters the threshold at which a data point is considered to be an outlier.
The default fence = 1.5 defines the inner fence (i.e 1.5 times the interquartile range below the lower quartile and above the upper quartile). This will identify mild and severe outliers.
The value fence = 3 defines the outer fence (i.e 3 times the interquartile range below the lower quartile and above the upper quartile). This will identify severe outliers only:
This returns:
#>
#> Univariate outliers : Lower fence = 68, Upper fence = 208
#> age sex weight height muac oedema
#> 126 16 2 9.0 74.6 999.0 2
#> 135 18 2 8.5 74.5 999.0 2
#> 227 8 M 6.2 66.0 11.1 2
#> 381 24 1 10.8 82.8 12.4 2
#> 594 21 2 9.8 76.5 13.2 2
#> 752 48 2 15.6 102.2 999.0 2
There is something wrong with all of these values of muac.
The intention was that the muac variable records mid-upper-arm-circumference (MUAC) in mm. There are some impossibly small (i.e. 11.1, 12.4, and 13.2) and impossibly large values (i.e. 999.0).
The three impossibly small values are probably due to data being recorded in cm rather than mm. It is probably safe to change these three values to 111, 124 and 132. It is easiest to do this each record separately:
An alternative approach is to specify row numbers instead of values:
The three 999.0 values are missing values coded as 999.0. It is safe to set these three values to missing using the special NA value:
Range checks should be repeated after editing the data to ensure that the problems have been fixed:
Following is a boxplot of the muac variable made using:
after the fixes for incorrectly entered data and missing values were made.

There should now be no severe outliers:
returns:
#>
#> Univariate outliers : Lower fence = 68, Upper fence = 208
#>
#> FALSE
#> 100
It is usually better to identify and edit only the most extreme univariate outliers, as we have done here, and use the scatterplot and statistical distance methods described elsewherein this toolkit to identify other potential outliers.