Data Cleaning with OpenRefine: Glossary

Key Points

Introduction
  • OpenRefine is a powerful, free and open source tool that can be used for data cleaning.

  • OpenRefine will automatically track any steps allowing you to backtrack as needed and providing a record of all work done

Working with OpenRefine
  • OpenRefine can import a variety of file types.

  • OpenRefine can be used to explore data using filters.

  • Clustering in OpenRefine can help to identify different values that might mean the same thing.

  • OpenRefine can transform the values of a column.

Filtering and Sorting with OpenRefine
  • OpenRefine provides a way to sort and filter data without affecting the raw data.

Examining Outliers in OpenRefine
  • OpenRefine also provides ways to get overviews of numerical data.

Using Scripts
  • All changes are being tracked in OpenRefine, and this information can be used for scripts for future analyses or reproducing an analysis.

Exporting and Saving Data from OpenRefine
  • Cleaned data or entire projects can be exported from OpenRefine.

  • Projects can be shared with collaborators, enabling them to see, reproduce and check all data cleaning steps you performed.

Other Resources in OpenRefine
  • Other examples and resources online are good for learning more about OpenRefine

Glossary

including tab separated (tsv), comma separated (csv), Excel (xls, xlsx), JSON, XML, RDF as XML, Google Spreadsheets

csv
A file extension indicating that a text file that has values separated by commas (comma-separated-values).
Clustering
A method for finding different groups of values that may actually be representing the same thing.
Faceting
A method for exploring the values in a variable. In this episode it is used to explore the values in order to identify errors in data entry.
Filter
To select a subset of data from a dataframe.
JSON
A file extension indicating that the values in a text file are structured using JavaScript Object Notation (JSON).
RDF
A file that extension indicating that the values in a file are structured using Resource Description Framework (RDF).
Regular expressions (regex)
A text string for describing a search pattern. They usually incorporate the use of wildcards to match letters, numbers, punctuation, spacing, or some combination.
tsv
A file extension indicating that a text file that has values separated by tabs (tab-separated-values).
xls
A file extension indicating that a file is a spreadsheet created by Microsoft Excel.
xlsx
A file extension indicating that a file is a spreadsheet created by Microsoft Excel using XML.
XML
A file extension indicating that the values in a file are structured using Extensible Markup Language (XML).