Database Answers

 

 

4.2 Stage 2 – Data Integration

4.2.1 Overview

Data Quality is a very important aspect of Data Integration because it is not appropriate to try to join poor quality data.

 

 

http://en.wikipedia.org/wiki/SQL_Server_Integration_Services

 

Question : What does ETL stand for ?

ETL stands for Extract, Transform and Load.

  • Extract means Extracting data from Data Sources.
  • Transform covers many tasks, including –
    • Selection of the data of interest
    • Validation and clean-up of the selected data
    • Changing the format and content of the data
    • Loading into the designated Target.

 

In practice, there are three options for implementing ETL:-

  • Develop bespoke SQL
  • Use a commercial package, such as Informatica or Microsoft’s Integration Services
  • Some combination of these two. For example, developing basic SQL to clarify the Requirements and then looking for a commercial product to meet the Requirements.

 

Wikipedia has an enrty on ETL which is worth a look :-

 

Case Study

Here’s a Case Study on the Database Answers Web Site about Data Integration in the Clouds

                http://www.databaseanswers.org/data_integration_case_study.htm

 

Connecting Databases

One of the requirements might be to connect separate physical Databases.

In order to achieve this, the requirements can be defined and then appropriate products can be selected from chosen vendors.

For example,  a simple techniques is to prefix a Table name with the Database name in an SQL statement.

 

4.2.2 Best Practice

Data Integration is a major area within Enterprise Data Management.

This Section discusses a number of Topics of importance in Best Practice for Data Integration.

 

Question : What is an Information Catalogue ?

An Information Catalogue can play a very important role in Data Integration.

 

It can also be called a Data Dictionary or an Information Repository.

Wikipedia has an entry for a Data Dictionary that is similar :-

http://en.wikipedia.org/wiki/Data_dictionary

 

 

 

Wikipedia also has an entry for Information Repository which describes something different :-

 

An Information Catalogue can be used to record every activity and decision of importance relating to Data Integration at each Stage.

It can play a role from the beginning, by recording Data Owners and Data Sources.

Then during the Data Integration Stage, it can record details of Tables, Fields, Rules for Validation, Mapping and Translation.

In the Reporting Stage, it can be used to record definitions of Report Templates and specific variations of the Templates for specific Projects.

 

In the Data Governance Stage, it can be used to record decisions and agreements reached in critical aspects of Data Management.

 

The best use of the Information is to be deployed over an Intranet and/or in the Clouds so that individuals throughout an organization can share the Information, comment on issues and so on.

 

Some commercial products are available, listed here alphabetically :-

 

 

The selection of a commercial product should include the ability to exchange data and metadata.

 

This can be achieved by XMI, which is discussed in the answer to the next question.

 

Question : What are XMI and MOF ?

XMI and MOF are both standards put forward by the Object Management Group (OMG).

 

XMI stands for ‘XML Metadata Interchange’ and the most common use is as an interchange format for UML models.

 

MOF stands for ‘Meta-Object Facility’ and can be viewed as a standard to write Metamodels.

 

Wikipedia has a useful entry for XMI :-

and for MOF :-

 

 

 

 

 

 

Question : How do I generate SQL Scripts to load data, manage users and so on ?

Loading data into Databases is a frequent requirement, especially for people playing the role of a Database Administrator (DBA).

 

Loading high volumes of data into an Access database is easy if the data is in a Comma-Separated-Value (‘CSV’)  format, where each field in a record is separate by commas.

 

Unfortunately, loading data into other types of database is not so easy.

 

Loading a few records is easy enough to do manually, but loading hundreds or thousands is simply not feasible.

 

Fortunately, help is at hand in the form of a technique that uses Access with the power of SQL to read data from tables and generates Scripts

The steps involved are as follows :-

1)       Load the CSV file into an Access Database

2)       Run an Access Query with the SQL to generate a file   

 

For example, if we have a simple Customers Table in both Access and SQL Server that looks like this :-

Customers (customer_number,customer_name,street_name)

 

Then we write an Access Query that looks like this :-

 

SELECT DISTINCT  'INSERT INTO Customers  (customer_number,customer_name,address) VALUES (',customer_number,customer_name, street_name, +')'

FROM    Customers

WHERE customer_number IS  NOT NULL;

 

 

Question : How do I establish a Strategy for Data Quality as an Enterprise Issue ?

A successful Strategy must include both organization and technical aspects.

 

Typical Organization aspects are :-

  • Commitment from senior management
  • Establishing the slogan “Data Quality is an Enterprise Issue” as a top-down edict.
  • Identification of the ‘Top 20’ Applications and Data Owners across the Enterprise
  • Agree sign-off procedures with Data Owners and Users

Technical aspects

  • Establish Key Quality Indicators (‘KQIs), for example Duplicate Customers records
  • Agree target Data Quality percentage
  • Define KQI Reports and dashboards
  • Develop SQL to measure KQIs
  • Define procedures to improve KQIs

 

 

 

Question : How do I handle multiple types of Database, such as SQL Server and Oracle ?

The key to handling multiple types of Database is to thing of them in terms of an Integrated Data Platforms, where all types of data are presented in a common fashion.

This then defines the logical requirement.

There is a then a number of options to physically meet this logical requirement.

The Enterprise-level option is to use an appropriate commercial product, such as Informatica :-

 

Another option, which can be attractive for smaller Projects is to establish ODBC links to all data sources and then define a Standard Data Integration Platform.

 

 

For other commercial Database types there are sometimes specific commercial products available.

For example, for Salesforce.com, a product called DBAmp from ForceAmp.com can be used to provide an SQL Server interface to Salesforce data :-

 

 

 

Question : How do I combine Excel data in my Reports ?

Data in Excel Spreadsheets is structured in tabular format which corresponds exactly to the way in which data is stored in relational database.

Also Spreadsheets are commonly used and the data frequently needs to be integrated with other data within an organization.

Therefore we would expect to find a wide range of solutions are available to solve this problem.

Here is a small sample :-

  • An ODBC connection can be established for a spreadsheet.
  • Informatica allows Spreadsheets to be defined as a Data Source.
  • Microsoft’s SQL Server Integrated Services provides for Excel to be defined as a Data Source.
  • Oracle provides a facility to define EXTERNAL table which can be Spreadsheets.
  • Salesforce.com provides their Excel Communicator.

 

 

 

 

4.2.2.1 Getting Started and Data Architectures

Architectures are vitally important to an understanding of Data Integration.

After the appropriate Architecture has been correctly designed, the choice of Products can be made.

It is possible that sometimes these Products might be developed in-house, especially if an organization or individual has experience and a Library of Software Utilities has been established.

 

A number of different Architectures are included in this Section to provide a starting-point for specific Projects.

 

For planning the Steps for a Project, here is a general Approach  :-

Step 1. Establish the Scope of both Sources and Targets.

Step 2. Identify the key Data Owners within the Scope.

Step 3. Define the Mappings between Source and Target Data Items

Step 4. Agree the minimum acceptable Data Quality standards.

                For example, every Address will be validated.

 

This page lists some useful Web Links for Customer Data Integration :

http://www.databaseanswers.org/customer_data_integration.htm

 
 
4.2.2.2 Mapping Data from Source to Target

Mapping is defined at the field level between all Sources and Targets.

For example, for Local Government, a Voter from the Electoral Register can be mapped to a Customer in the Customer Master Index.

A Parking Ticket Vehicle Owner can also be mapped to the same Customer.

 
4.2.2.3 Architectures for Data Integration

This diagram shows details of the Data Integration Component in the Architecture shown above.

 

Each of these Steps is a major activity on its own. Several vendors offer software products to help in the implementation of these Steps.

 

An excellent way to gain an understanding of how to solve the problems for a particular organisation is to attend a presentation by a Vendor.

These presentations are free, and are offered on a regular basis all around the world.

 

Data Quality is a major aspect because it is usually the case that Data Integration anywhere in the world, involves cleaning up inconsistent and incorrect data.

 

Generic Data Models can be a great help in making progress because they offer productivity and an insight into the generic thinking to a specific area. Examples available from Database Answers include Customer Services, Entertainment, Law Enforcement and Students.

 


4.2.2.4 Service-Oriented Architecture (SOA)

Question : What is a Service-Oriented Architecture (SOA) ?

An SOA is an Architecture based on loosely-coupled Web Services which can extended or combined much more easily than is possible with traditional methods.SOA is frequently associated with an Enterprise Service Bus and expensive commercial products, such as Tibco. However, this is not essential and the SOA approach can be adopted and implemented as Web Services.

 

This diagram shows that Data is taken from a number of Data Sources, is processed by a Data Services Layer and provides clean and consistent data for a series of Performance Reports.

The SOA Design Approach means that Services can be added to an open-ended Architecture in a flexible manner. It implies that standards are in place based around the use of Web Services.

More details are provided in Section A.19.

The diagram shows that Schedules are loaded from Spreadsheets, then Inspections are made using PDAs and the resulting Inspections data is loaded into a remote Database, where it is used to populate Virtual Data Marts from which Performance Reports are produced.

These Reports include Traffic Lights Maps where areas in trouble are shown in Red and Amber and areas without problems are in Green.

This is based on the use of Key Performance Indicators, (‘KPIs’).

 

SOA Architecture

 

 


 

4.2.2.5 Web-Services and Data Quality

The use of Web Services allows some Components in this Architecture to be distributed in the Clouds.

 

This diagram shows a Scenario which will be feasible in the near future, for example, using Informatica OnDemand.

 

In this Scenario, a Data Quality Administrator would use a DQ Admin Console to run Data Quality Procedures on Data which could be either on-premises or in the Clouds.

 

The results of these DQ Procedures would be stored in the Data Dictionary and used for planning, control and audit purposes.

 

 Web Services could be used to apply Clean-Up Rules stored in one remote location to Data Sources stored in another remote location.

 

 

 

 

 

 

 

 


4.2.2.6 Data Quality

Data Quality is a very important aspect of Data Integration.

A Strategy for Data Quality as an Enterprise Issues can be established based on the use of Key Quality Indicators,(KQIs’).

 

Data Profiling Tools can be very effective in measuring Data Quality and KQIs.

 

 

4.2.2.7 Duplicate Records

Question : How do I obtain a Single View of the Customer ?

This requires a method of matching Customers, de-duplication and the maintenance of a Customer Master Index, (‘CMI’) supports a Single View of a Customer.

 

When there are many sources of similar data, such as Customers, there are frequently duplicate records.

For example,  in the US, John Doe could be also called Jon Doe, Johnny Doe, Mr.J..Doe and so on.

In the UK, Joe Bloggs could also be called Joseph Bloggs, Joey Bloggs, Mr.J.Bloggs and so on.

 

The rules for recognizing and resolving this kind of problem has led to the development of software for De-duplicating records. This process is informally referred to as ‘de-duping’, especially by people who do a great deal of it.

 

Best Practice is to look for a commercial product, rather than to write your own bespoke software because it usually takes longer than expected and commercial products can be quite cheap.

 

This page on the Database Answers Web Site is an excellent starting-point :-

http://www.databaseanswers.org/deduping.htm

 

4.2.2.8 Customer Master Index

Removing Duplicate records for Customers leads to the creation of a Customer Master Index (CMI) which  is very important in establishing a Single View of a Customer.

 

The CMI consists basically of cross-references between each Source System and the single Target System.

 

It will contain one record for each Customer in each Data Source.

 

A Customer Master Index, (‘CMI’) supports a Single View of a Customer.

 

Customer Master Index

 


4.2.2.9 Master Data Management

Question : What is Master Data Management (MDM) ?

One of the major components in Master Data Management (‘MDM’) is Customers.

MDM can be defined a ‘Providing a Single View of the Things of Importance within an organisation

 

Master Data Management applies the same principles to all the ‘Things of Interest’ in an organisation.

 

This can typically include Employees, Products and Suppliers.

We have discussed ‘A Single View of the Customer’ and MDM involves the same kind of operations as a CMI.

 

That is, identification and removal of duplicates, and putting in place to eliminate duplicates in any new data loaded into the Databases.

 

There is a wide choice of software vendors offering MDM products.

De-duplication and Address validation is a niche market in this area.

 

 

 

On the Database Answers Web Site, there is a Tutorial on Getting Started in MDM :-

 

 

 

 


 

4.2.2.10 Data Platform

The building-blocks in this Data Platform represent successive levels that can be put in place in a controlled manner.

 

Each building-block builds on the previous manner.

This can be used in the planning and control of the Data Management.

 

Data Governance provides a thread of continuity through the process and can ensure the integrity and consistency of the data.

 

Steps to the Data Platform

 

 

 


4.2.2.11 Data Models

Data Integration requires mapping of data from (usually) multiple Sources to a single Target.

 

Mapping is defined at the field level between all Sources and Targets.

 

For example, for Local Government, a Voter from the Electoral Register can be mapped to a Customer in the Customer Master Index.

A Parking Ticket Vehicle Owner can also be mapped to the same Customer.

 

The best way to plan and control this mapping process is to create Data Models for all Sources and Target.

 

This is a generic Mapping Data Model, and a specific example is shown in the Case Study.

 

 

                                               

 

 

 

 

4.2.2.12 Data Lineage

Question : What is Data Lineage ?

Data Lineage can be defined as being able to the trace the derivation of all items of data that appear in any important Performance Reports and Management Information.

That includes :-

  • Who owns the original source data
  • What validation and transformations are applied to the data in its life cycle

 

Question : How do I verify the derivation of all data (the Data Lineage’) in your Reports ?

This requires the use of an Information Catalogue that will record the source, processing steps and final delivery in a Report. The transformations in the processing Steps must be specified in both the processing language, for example SQL, and in plain, unambiguous English so that the Data Owner can sign-off on the Data Lineage.


 

4.2.3 Templates

Here’s a page on the Database Answers Web Site discussing Performance Reports :-

http://www.databaseanswers.org/tutorial4_integrated_performance_reporting/index.htm

4.2.3.1 Information Catalogue

The Information Catalogue records a range of critical data related to a Data Migration activity.

For example, a list of Entities, Tables, Fields Mappings and  Rules for Relationships and Validation.

4.2.3.1.1  Mapping Entities

This Templates is used to define the mapping of Entities or Tables from a specific Source to a specific Target.

For example, from an Electoral Register to a Generic Customer Services Data Model (GCDM).

 

This Transformation is supported by Mapping Specifications and the appropriate software.

This software can be either manually-coded SQL, a specialized solutions, such as Salesforce’s Excel Connector,

or a general-purpose commercial product, such as Informatica.

 

Source Table

Target Table

Comment

Example : Electoral Register

Example : Customer

 

Example : Elections

Example : Customer_Event

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4.2.3.1.2  Mapping Attributes

This Template defines the correspondence between Fields in Data Sources and Targets.

An example of this Template in use is included in Section 5.2.3

 

SOURCE

TABLE

DATA

ITEM

TYPE

VALIDATION

TARGET

TABLE

TARGET

ATTRIBUTE

COMMENT

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4.2.3.1.3  Rules for Relationships

These Business Rules define the conditions that Relationships between Entities must support.

They can be translated into SQL which can be applied as Test Conditions for the Data Warehouse.

A sample is provided for as an example.

 

  1. Example : An ADDRESS can be associated with zero, one or many CUSTOMER ADDRESSes.

For example, many people can live at the same Address.

  1. Example : A CUSTOMER can be associated with zero, one or many CUSTOMER ADDRESSes.   For example, Home, Work, Billing, Delivery and so on.

 


4.2.3.1.4  Rules for Validation

These are the Rules for validation of the data in a Table.

Two example Rules are provided for guidance.

 

 DATA ITEM

 TYPE

 VALIDATION

 COMMENT

Example : address_id

 Integer

 >0 and unique

 Unique Identifier for each Address.

 Example : easting

Integer

 A six-digit number, less than 660000

 The Easting coordinate for a BLPU

 

 

 

 

 

 

 

 

 

4.2.4 Tools

It is quite common to develop bespoke software for smaller internal projects with limited scope.

Organisations frequently build up a Library of Data Integration software

Major vendors for Integration Tools include Informatica and Microsoft.

Details are shown in a separate document.

4.2.5 Tutorials

There are three Tutorials on the Database Answers Web Site that are helpful in planning and executing Data Integration :-

i) Data Quality                      - http://www.databaseanswers.org/presentations/Strategy_for_Data_Quality.ppt

ii) Master Data Mgmt          - http://www.databaseanswers.org/tutorial4_bp_in_mdm/index.htm

iii) MDM and Ref Data        - http://www.databaseanswers.org/presentations/MDM_and_Ref_Data.ppt

4.2.6 How do I ?

4.2.6.1 Plan the Data Integration process ?
  • Identify the Data Stewards
  • Obtain buy-in from key Stakeholders within the organisation.
  • Determine with the Users, the quality KPIs for key data items.
  • Define the Field Mappings from Sources to Targets.

4.2.7 Qualities for Success in Data Integration

Informatica offers Certification in Data Integration :-

http://www.informatica.com/products_services/education_services/certification/Pages/index.aspx

 

To be competent in this area it is important to have a clear understanding of the end-to-end process of transforming Source data into Target data and to derive satisfaction from achieving the end-result of seeing good-quality data loaded and available for subsequent analysis and reporting.

 

Someone who works in this area is happy to work with Developers, Managers and End-Users.