Getting Started with Google Refine (now OpenRefine)

Start by downloading the .zip file located at: Refine

This download link is currently active, although the project is migrating to github.

Check with the projects main website if there is questions as to how to download and install from github. The site can be found at:

openrefine.org

(best viewed on Firefox web browser)

Save the file to your desktop, and then double-click on the file google-refine.exe

This will launch Refine in a browser window. If it does not automatically launch, paste this address into the address bar of your web browser: http://127.0.0.1:3333/

Refine will work with many types of files. Because our end goal is to upload into Specify, keeping everything in spreadsheet form is preferred. Refine will work with both .xls and .xlsx files.

First, create a project in Refine by uploading a dataset.
Once Refine has finished verifying your data, it gives you an intermediary screen that allows you to name your project (1), select which worksheets get imported (2), and some data-handling options (3). Select "Create Project" when you are satisfied with dataset.

Refine, as a default, displays 10 rows. You can have it display up to 50 but not more (1). Refine is not a tool for modifying data within cells one at a time. It is best used for dealing with whole swaths of data. Refine does that by a tool called 'facet' (2), which is an option you find by clicking on the down-arrow on which ever column you wish to facet. Faceting data is like a filter for selecting data that meets a certain criteria- it can be a word, length of an entry, or just lumping data into how many times it occurs. You can also facet many rows at once, to get a very precise set of data which you can then act on. In the example below, the facet was set to 'text facet' (3). Faceting the column this way shows the data in the cells (4), and how many times that data is used . The column 'Type status' (4) will only have a handful of variety (4 choices in this case). Something like 'Collection Number' would have many- 411 choices (5). Please note that the Facets allow you to sort by name or count.
Taking a closer look at the "Type Status" facet we see many entries that won't upload into Specify. There are 21 entries that read "mentioned*", and six that say "Figured*". By clicking on the 'edit' option that appears when we hover over that selection, the edit box appears and we can change all 6 entries at once. The number of choices now becomes 3, and the number of entries that say 'Figured' has gone from 21 to 27. We can do the same for the entries 'mentioned*'. Specify expects to see 'referred', not 'mentioned*', so we use this same process to change those records. There are many other columns where this can be done, also- Building (adding 'Building' to 122 and 33), you can also use 'edit' to add things to the blanks, such as adding 'dry' to all the blank entries in 'prep type' and so on. Facet by name also helps identify typos and misspelling ('Texas' has 300 entries, while Texsa has 4)

Faceting by words can help mine data out of comments fields. From the drop down on the column you are going to be mining, select 'Facet' then 'Customized Facets' and from that sub menu, select 'Word facet'. You can also facet for patterns (like lat/long entries in hours, min, seconds) using regex.
Moving the data is easy- faceting the Comments column for entries with the word 'cast' shows us which entries were noted in their inventory as being 'casts'. We can see in the Text Facet of the Comments column that the word 'cast' is being used to mean a cast of a specimen. We can then see the records in the PrepType Text Facet is blank for those records. Edit the blank cells in bulk using the edit option in the facet display box. Change (blank) to Cast, and apply the change. This process can be done on any type of data.