Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Collection dates, collector names and determiner names were edited in a stepwise fashion as outlined in detail below. Two collection dates, a begin date and end date, were assigned to all records. Null dates were assigned conservative begin and end dates based on other information known about the collectors associated with the record. In some instances, dates were recognized to be incorrect and were edited. Collector and determiner names were edited based largely on information contained in other records in the database. (more detail below)

...

The verbatim collector names as received from our donors were typically in a string in a single field with multiple collectors' names in diverse and inconsistent formats. Names were often misspelled, frequently incomplete, and multiple names were inconsistently separated by various punctuation marks or not separated at all. In Microsoft Excel, a copy of the original verbatim collectors' name strings were was parsed to individual collector names (using a mix of automated formulae and manual methods) into separate columns, and names assigned a number corresponding to their position in the original text string. Individual collector names were then further parsed and re-ordered to put all in the order last name, first and middle name, prefix. The maximum number of collectors for any record was 8, so 8 new sets of name-related fields were constructed for each record. Most records contain far fewer than 8 collectors and thus many records have blank values for all but the first few collector fields.

Individual names were then compiled into a single column to facilitate sorting and synonymization of names. During this process names were corrected for spelling and partial names were replaced by full names when possible, but we were sometimes limited by our lack of familiarity with many of the collectors. The sometimes-used 'et al' in original strings was dropped from the collector string, but 'family', university class names, and other group identities were maintained as distinct collectors. After the diverse permutations of collector names were synonymized, they were brought back together to their original positions (multiple collectors per record) for the subsequent steps.

...

The database was sorted by the standardized locality name and collection date to approximate collecting events (true collecting events would require grouping by collector as well) and each approximated collecting event was then assigned a group number. Within each group, collectors were reviewed, and in cases of at least one collector being found in all records within the group, all collector names were applied to all records. Thus, for a given group, we assumed that if a record included collector A only, while another record from the group had collector A, B and C; then A, B and C all participated in the collecting event, and the first record therefore should also have collectors B and C. In many cases records with blank or 'unknown' collectors were assigned the collectors of the other records from other records in the group. Efforts were also made to consider the original locality description, and in cases where the georeferenced locality name was more general (e.g., Lake Travis) and the original locality name was more specific (e.g., Lake Travis at site A) collectors were only synonymized to the level of the original donor's verbatim location. During this process additional errors in collector name spellings, missed during step 1, were identified and corrected.

...

Using Microsoft Excel, all raw donor dates were reformatted as text in separate year, month, and day fields, manually verified against original donor data, and then preserved as 'verbatim dates' in the database. The database was sorted on our recently edited collector fields and our standardized locality name to approximate collecting events. Each approximated collecting event (true collecting events would require grouping by date as well) was then assigned a group number. Within each group, dates greater than two years apart were reviewed. Many of these groups could easily be determined to be discrete collection events occurring on different dates and thus apparently correct in the database, but in many cases, we were able to confidently edit dates. In some rare cases this was a rather subjective act, but in general a conservative approach was taken and we tried to avoid making changes that could be controversial. Problems were often (but not always) identified and corrected when the following conditions were met: (1) within a group in which the day and month were the same among all records, but year differed for one; (2) where month or day was null but present in other records within the same group, or (3) when a date was not possible, such as a collection from the future or outside of a collector's lifetime. Not counting completely null dates (editing process is described in step 4), 118 track 1 records were confidently determined to have erroneous dates and we felt it appropriate to edit the verbatim date in some way.

...

To assure that all records could be retrieved by date-based queries, for all records lacking (= having 'null') dates (3.7% of the track 1 database; 2,996 records), we either determined actual collection dates or derived conservative estimates of date ranges that we were confident bracketed the actual collection dates. To do this the database was sorted on collector and georeferenced location to approximate collecting events and each approximated collecting event was then assigned a group number. Within each group, null dates (dates with day, month and year empty) were reviewed. All null dates were then converted to ranges determined using one or a combination of 4 methods: 1) For collectors we recognized as being older (before approximately 1930) we reviewed historic documentation about the collector found online, but this was done quickly and conservative estimates of date ranges were always applied. We hope users will keep this in mind as they use the data and report any possible refinements to us. 2) For collections made by collectors well-represented in the database we used the years of the date extents of their collections in our database. 3) For other records we set the date extent based on our personal knowledge of that person's collecting or determining activity and other evidence from the database. In many cases we were able to use determination year to define the upper date range, but usually these records have large date ranges. 4) In cases where none of the above was possible, we defined the date range as 1830 (approximately 2 decades prior to the first date recorded in the database) to the last date in the data track.

In some instances, the collectors verbatim field numbers, which typically include the date or at least year, could be used to extract accurate date information.

Many of our oldest records , collected during Texas boundary and rail road surveys in the early to mid-1800's lacked dates. In version 3 of the database we updated most of the oldest records lacking dates by reviewing the original survey reports and maps to find dates or extrapolate date ranges.

Step 5: Editing Determiner

...

Efforts to correct collector's names allowed for a method to examine collection dates. A pivot table was created in Excel that counted the number of records for each collector across each of all collection years, thus allowing quick detection of date outliers for any collector. However, due to the large size of this table we facilitated finding outliers by extracting and focusing on only those collectors reported to have gaps in the distribution of their collecting activity of more than 10 years; a condition that we suspected to be likely indicative of errors in either collector name or date of collection. Using this method, we identified 145 collector names in need of examination and of those we were able to correct 18 by editing either the date of collection or the collector name. We suspect numerous others of being legitimate errors, but could not find sufficient justification for changing the data from the verbatim original, nor did we have resources to explore this method at a finer temporal scale.