library(nipnTK)

Checking that data are within an acceptable or plausible range is an important basic check to apply to quantitative data. Checking that data are recorded with appropriate legal values or codes is an important basic check to apply to categorical data.

Checking quantitative data

We will use the dataset rl.ex01 that is included in the nipnTK package.

#>   age sex weight height muac oedema
#> 1  12   2    6.7   68.5  148      2
#> 2   6   1    6.4   65.0  125      2
#> 3   6   2    6.5   65.6  125      2
#> 4   8   1    7.2   68.4  144      2
#> 5  12   M    6.1   65.4  114      2
#> 6   8   1    7.7   66.5  146      2

The rl.ex01 dataset contains anthropometry data from a SMART survey from Angola.

We can use the summary() function to examine range (and other summary statistics) of a quantitative variable:

summary(svy$muac)

This returns:

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    11.1   128.0   139.0   140.3   148.0   999.0

A graphical examination can also be made:

boxplot(svy$muac, horizontal = TRUE, xlab = "MUAC (mm)", frame.plot = FALSE)

The “whiskers” on the boxplot extend to 1.5 times the interquartile range from the ends of the box (i.e., the lower and upper quartiles). This is known as the inner fence. Data points that are outside the inner fence are considered to be mild outliers. The NiPN data quality toolkit provides an R language function outliersUV() that uses the same method to identify outliers:

svy[outliersUV(svy$muac), ]

This returns:

#> 
#> Univariate outliers : Lower fence = 98, Upper fence = 178
#>     age sex weight height  muac oedema
#> 33   24   1    9.8   74.5 180.0      2
#> 93   12   2    6.7   67.0  96.0      1
#> 126  16   2    9.0   74.6 999.0      2
#> 135  18   2    8.5   74.5 999.0      2
#> 194  24   M    7.0   75.0  95.0      2
#> 227   8   M    6.2   66.0  11.1      2
#> 253  35   2    7.6   75.6  97.0      2
#> 381  24   1   10.8   82.8  12.4      2
#> 501  36   2   15.5   93.4 185.0      2
#> 594  21   2    9.8   76.5  13.2      2
#> 714  59   2   18.9   98.5 180.0      2
#> 752  48   2   15.6  102.2 999.0      2
#> 756  59   1   19.4  101.1 180.0      2
#> 873  59   1   20.6  109.4 179.0      2

We can count the number of outliers or use:

table(outliersUV(svy$muac))

This returns:

#> 
#> Univariate outliers : Lower fence = 98, Upper fence = 178
#> 
#> FALSE  TRUE 
#>   892    14

We can express this as a proportion:

This returns:

#> 
#> Univariate outliers : Lower fence = 98, Upper fence = 178
#> 
#>      FALSE       TRUE 
#> 0.98454746 0.01545254

You may find it easier to use percentages:

prop.table(table(outliersUV(svy$muac))) * 100

This returns:

#> 
#> Univariate outliers : Lower fence = 98, Upper fence = 178
#> 
#>     FALSE      TRUE 
#> 98.454746  1.545254

Some of the muac values identified as potential outliers are possible muac values:

#> 
#> Univariate outliers : Lower fence = 98, Upper fence = 178
#>     age sex weight height  muac oedema
#> 33   24   1    9.8   74.5 180.0      2
#> 93   12   2    6.7   67.0  96.0      1
#> 126  16   2    9.0   74.6 999.0      2
#> 135  18   2    8.5   74.5 999.0      2
#> 194  24   M    7.0   75.0  95.0      2
#> 227   8   M    6.2   66.0  11.1      2
#> 253  35   2    7.6   75.6  97.0      2
#> 381  24   1   10.8   82.8  12.4      2
#> 501  36   2   15.5   93.4 185.0      2
#> 594  21   2    9.8   76.5  13.2      2
#> 714  59   2   18.9   98.5 180.0      2
#> 752  48   2   15.6  102.2 999.0      2
#> 756  59   1   19.4  101.1 180.0      2
#> 873  59   1   20.6  109.4 179.0      2

The outliersUV() function provides a fence parameter which alters the threshold at which a data point is considered to be an outlier.

The default fence = 1.5 defines the inner fence (i.e 1.5 times the interquartile range below the lower quartile and above the upper quartile). This will identify mild and severe outliers.

The value fence = 3 defines the outer fence (i.e 3 times the interquartile range below the lower quartile and above the upper quartile). This will identify severe outliers only:

svy[outliersUV(svy$muac, fence = 3), ]

This returns:

#> 
#> Univariate outliers : Lower fence = 68, Upper fence = 208
#>     age sex weight height  muac oedema
#> 126  16   2    9.0   74.6 999.0      2
#> 135  18   2    8.5   74.5 999.0      2
#> 227   8   M    6.2   66.0  11.1      2
#> 381  24   1   10.8   82.8  12.4      2
#> 594  21   2    9.8   76.5  13.2      2
#> 752  48   2   15.6  102.2 999.0      2

There is something wrong with all of these values of muac.

The intention was that the muac variable records mid-upper-arm-circumference (MUAC) in mm. There are some impossibly small (i.e. 11.1, 12.4, and 13.2) and impossibly large values (i.e. 999.0).

The three impossibly small values are probably due to data being recorded in cm rather than mm. It is probably safe to change these three values to 111, 124 and 132. It is easiest to do this each record separately:

An alternative approach is to specify row numbers instead of values:

The three 999.0 values are missing values coded as 999.0. It is safe to set these three values to missing using the special NA value:

Range checks should be repeated after editing the data to ensure that the problems have been fixed:

summary(svy$muac)
svy[outliersUV(svy$muac), ]
svy[outliersUV(svy$muac, fence = 3), ]

Following is a boxplot of the muac variable made using:

boxplot(svy$muac, horizontal = TRUE, xlab = "MUAC (mm)", frame.plot = FALSE)

after the fixes for incorrectly entered data and missing values were made.

There should now be no severe outliers:

prop.table(table(outliersUV(svy$muac, fence = 3))) * 100

returns:

#> 
#> Univariate outliers : Lower fence = 68, Upper fence = 208
#> 
#> FALSE 
#>   100

It is usually better to identify and edit only the most extreme univariate outliers, as we have done here, and use the scatterplot and statistical distance methods described elsewherein this toolkit to identify other potential outliers.