What is Data Profiling?

2 Answers

Relevance
  • 1 decade ago
    Favorite Answer

    (m)

    Data Profiling is a process whereby one examines the data available in an existing database and collects statistics and information about that data. The purpose of these statistics may be to:

    find out whether existing data can easily be used for other purposes

    give metrics on data quality including whether the data conforms to company standards

    assess the risk involved in integrating data for new applications, including the challenges of joins

    track data quality

    assess whether metadata accurately describes the actual values in the source database

    understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can incur time delays and project cost overruns.

    Some companies also look at data profiling as a way to involve business users in what traditionally has been an IT function. Line of business users can often provide context aboout the data, giving meaning to columns of data that are poorly defined by metadata and documentation.

    Typical types of metadata sought are:

    Domain: whether the data in the column conforms to the defined values or range of values it is expected to take

    for example: ages of children in kindergarten are expected to be between 4 and 5. An age of 7 would be considered out of domain

    a code for flammable materials is expected to be A, B or C. A code of 3 would be considered out of domain.

    Type: Alphabetic or numeric

    pattern: a North American phone number should be (999)999-9999

    frequency counts: most of our customers should be in California; so the largest number of occurrences of state code should be CA

    Statistics:

    minimum value

    maximum value

    mean value (average)

    median value

    modal value

    standard deviation

    Interdependency:

    within a table: the zip code field always depends on the country code

    between tables: the customer number on an order should always appear in the customer table

    Broadly speaking, most vendors who provide data profiling tools, also provide data quality tools. They often divide the functionality into three categories. The names for these categories often differ depending on the vendor, but the overall process is in three steps, which must be executed in order:

    Column Profiling (Including the statistics and domain examples provided above)

    Dependency Profiling, which identifies intra-table dependencies. Dependency profiling is related to the normalization of a data source, and addresses whether or not there are non-key attributes that determine or are dependent on other non-key attributes. The existence of transitive dependencies here may be evidence of second-normal form.

    Redundancy Profiling, which identifies overlapping values between tables. This is typically used to identify candidate foreign keys within tables, to validate attributes that should be foreign keys (but that may not have constraints to enforce integrity), and to identify other areas of data redundancy. Example: redundancy analysis could provide the analyst with the fact that the ZIP field in table A contained the same values as the ZIP_CODE field in table B, 80% of the time.

    Column profiling provides critical metadata which is required in order to perform dependency profiling, and as such, must be executed before dependency profiling. Similarly, dependency profiling must be performed before redundancy profiling. While the output of previous steps may not be interesting to an analyst depending on his or her purpose, the analyst will most likely be obliged to move through these steps anyway. Other information delivery mechanisms may exist, depending on the vendor. Some vendors also provide data quality dashboards so that upper management, data governance teams and c-level executives can track enterprise data quality. Still other provide mechanism for the analysis to be delivered via XML. Often, these same tools can be used for on-going monitoring of data quality.

  • Anonymous
    1 decade ago

    Data profiling is a process to assess current data conditions, or to monitor data quality over time. It begins with collecting measurements about your data, and then looking at the results individually and in various combinations to see where anomalies exist.

    Data anomalies are the “needle in the haystack” for technology projects. Even the best systems have them, but they may not cause pain until a data migration or integration project comes along. Once the “needles” are identified, the extract, transformation and load process or tools can remove them.

    Data profiling is attribute, redundancy and dependency analysis. Attribute analysis yields a set of metrics, which can be interpreted to reveal inherent business rules, as well as anomalies embedded in source system data. Redundancy analysis assists in determining source of record, and reduces the occurrences of violated primary keys during integration. Dependency analysis identifies orphan records, and validates a normalized model. Together, these analyses make it possible to interpret data meanings, and implement a structured approach to address and resolve data-related migration and integration issues.

    Data profiling has a parallel in common business practices. How does your company prove the integrity of its financial position to its owners? “Auditing” is sometimes viewed as a dirty word, but it assures the owners that their decisions are based on reliable financial information. Data profiling ensures that all of your business decisions are based on reliable information.

Still have questions? Get answers by asking now.