Examining Outliers in OpenRefine
Overview
Teaching: 5 min
Exercises: 5 minQuestions
How can we see numeric outliers?
Objectives
Transform a values for a subset of the data.
Lesson
Dates
Dates often cause issues in datasets when entered by hand. Let’s examine the date
column next! Quickly glancing over the column shows that the dates have been entered in various formats, even though the registry manual clearly states dates should be entered day/month/year. But we should be able to correct that easily!
- Click the down arrow at the top of the
date
column. ChooseEdit cells
and thenCommon transformations
and thenTo date
. - Uh oh…it looks like OpenRefine wasn’t able to distinguish between the month/day and the day/month entries! Undo the transformation.
- Let’s try a different approach then. Click the down arrow at the top of the
date
column and this time selectFacet
thenText facet
since these values appear to have been entered as text fields. - We can now edit all of the month/day values to match the format that was outlined in the registry manual!
Finding outliers - numbers
As we saw with the text faceting, OpenRefine allowed us to easily see typos and entry errors in our text fields. Using the facet tool can also bring to light numeric outliers.
It appears as though Jane Doe didn’t read the registry manual very carefully and we’re concerned the weights might have also been misentered. So let’s look at our weight
column next but using a numeric facet this time.
- Click on the down arrow at the top of the
weight
column. - Select
Facet
thenNumeric facet
. - Adjust the slider bars to see the two outliers.
Now that we can see the data associated with the outliers, it appears all of the weight values by Jane Doe were entered in grams instead of kilograms. Let’s correct them all!
- We know we need to correct all of Jane’s entries so remove the numeric facet and create a text facet on the
recorder
column instead. - Restrict the records to just those entered by
Jane Doe
. - Once again, click on the down arrow at the top of the
weight
column and this time selectEdit cells
thenTransform
. - We know that there are 1000 grams in a kilogram so we can edit the entries to the correct unit by dividing the current value by 1000. Let’s use Python this time!
- Select
Python / Jython
then enterreturn value/1000.00
so that we have two places after the decimal point returned. Check the results in the preview pane. - Click
Ok
.
Key Points
OpenRefine also provides ways to get overviews of numerical data.