Your data sucks

I have worked with the user-related data of well over 100 organizations in my last 5 years as an Identity Management consultant. Without exception or reservation, it has all been garbage in one way or another (sometimes in multiple ways).

It may be a bit of hyperbole to refer to it as garbage, however, it is at least fair to question whether it is useful. It is useful data in the context of HR and Payroll operations but that is typically the extent of utility. If we disregard the common data-entry related challenges we still have a large, root challenge of territorialism.

Just like documents on the file-system, this data belongs to the organization, not HR. They are merely custodians of that information. It takes concerted, long term effort and attention to maintain this information in a consistent and useful context. First, and foremost, among the problem areas is 'simple' name data.

Useless Name Data

Using three common fields as an example; firstName, middleName, lastName, let's take a look at typical challenges.

The field containing the first or given name is usually the 'cleanest' although some variation of the following is occasionally present:

Value Condition
Madeleine Standard and acceptable
MARYANN Unclear capitalization or punctuation: Maryann, MaryAnn, Mary Ann
Madeleine (Maddy) Multi-value for preferred or nickname. This is especially challenging when the convention is not universally applied.
Madeleine R. Unacceptable inclusion of middle initial. Common when middle is not typically captured and is used ad hoc to differentiate a duplicate name. Also occurs when the middleName field is either not present or in a less convenient area for data entry.


The middle name field is where things can really begin to go awry.


Value Condition
Mitchell Standard and acceptable
M, M. Initial, not name
(None), NMN Absence of a middle name should mean the complete absence of data instead of a notation indicating 'No Middle Name'
Armando Garcia Inconsistency can cause issues here. Are there two middle names or is 'Garcia' the first part of a multipart surname?


If middle names are challenging, all bets are off when it comes to last names. Issues of capitalization, punctuation, and 'bonus' data abound.


Value Condition
VANNEUMAN This should be represented elsewhere as VanNeuman, however programming rules for this are impossible to create. VANDERBILT =/= VanDerbilt
VAN DE KAMP More difficult yet are names with not all parts capitalized: Van de Kamp
van Dyke Repeat of above
Hyphenations The hyphenation requires an inconsistent split between users, some taking first part, others taking last, still others using the full version.
Garcia Perez Data shows a multipart last name but the specification is for one (similar to hyphenations)
Smith, Jr. Suffixes present with last name, see below.


Other issues like generational or professional suffixes can also cause problems. I have seen the following in the same dataset:


Value Condition
Junior, Jr, JR, Jr., JR. Even more challenging when the division between name and suffix is not consistent:
Smith, Jr / Smith Jr / Smith,Jr
Senior, Sr, SR, Sr., SR., Snr.  
Roman Numerals  


Fix it

The most important correction is to use the field(s) for the specified data and nothing else. After that, the next issue is to extend the schema to support additional values e.g. preferredFirstName, preferredLastName, initials, suffixLastName, etc.

Across the organization, a data standard must be determined, instituted, and maintained. The organization must decide, for example, how to represent 'Junior' across all records.

Finally, if the HR platform supports API connections, it may be beneficial to create a 'front-end' web application for data entry that validates input against the convention, provides a dropdown for acceptable suffix values, and consolidates the relevant data entry into a single view.

I have seen an HR system with an 'Allergies' field have thousands of characters of general employee notes present simply because it was available on the main screen and 'easy' for HR personnel to reference. If organizations exploit the path of least resistance they wind up with useful information in the form of more complete data.

Plus it makes my job easier, which could save thousands of dollars from an Identity Management implementation. Take those savings and throw an ice cream party, you're welcome.