Inductis
Who We Are

Who we are

THE ANALYTIC CONSULTING PROCESS
MARTIN AHRENS, PH.D. Vice President, Methodology & Quality Assurance, INDUCTIS, INC.
 
DEVELOP -- Manage the Data and Create the Model

Data Management

Sidebar 1:

A risk inherent in operational data

The statistical techniques used in analytic consulting were developed for scientific experiments. In an experiment, the levels of the factors to be observed are systematically
randomized in order to support a balanced unbiased analysis. When experimental designs are carefully structured, a lot of information can be gleaned from a relatively small number of
observations. On the other hand, when historic business data is mined, there is a lot more data
available but no systematic randomization. For example, credit card prices assigned to
accounts are intended to optimize profit – issuers do not intend their pricing to be suboptimal. From a scientific point of view, this represents a bias. We can’t draw conclusions about combinations of APR and customer characteristics that don’t exist. To put it more generally: abundance of data can’t compensate for an absence of experimental design. For example, consider a risk modeling challenge: a large customer portfolio is to be used to build a model to predict the likelihood of default. No matter how large it is, this portfolio has three levels of systematic bias: (1) there is no default information available for applicants who were rejected (the “reject inference” problem), (2) offer recipients who did not apply or (3) other members of the population who were not sent offers. Handling challenges like this requires the combined understanding of statistics, experimental design and the credit card business. Possible solutions include the use of external data sources, ancillary designed experiments and careful limitations on the interpretation and application
of conclusions based on operational data

The data used in most analytic consulting projects was not collected or maintained for that purpose. Rather, it comes from systems designed to manage customer accounts, execute marketing programs, support corporate accounting, manage inventory and perform other routine business functions. One of the key drivers for the rise of analytic consulting over the last 15 years has been the accessibility of all these data sources due to developments in information technology. Sophisticated analytic tools now enable the implementation of comprehensive business models that leverage information about customer behavior at the most granular levels. Marketing, risk and customer management can now be optimized and customized at levels that could only be dreamt of only 20 years ago. However, improved data access and better analytic tools also have disadvantages, including the risk of misinformation (see Sidebar 1).

The data management phase of a project begins with the preparation of a Data Request specifying exactly what data is needed. Usually this entails extracts from more than one system (for example, account management, marketing and external credit files), and proper alignment of these sources is a challenge. Perhaps records in some files correspond to customers while in others they are at the account level. In this case, a decision must be made as to which format to retain for modeling and how to incorporate information that is in the other format. Another challenge is the existence of fields in different files that refer to the same (or almost the same) thing. For example, fields representing "current balance" - it could be a monthly average or a value from the beginning or end of the month or a daily value. Not all systems are updated at the same time. The core challenge of the data management phase is decision making and implementation around these issues of alignment and understanding of the data.

It is often useful to have efficient access to (and familiarity with) major national compiled databases of demographic, household and firmographic data. These sources often significantly extend the datasets available directly from clients.

Once the project master dataset(s) has been prepared, every field must be systematically profiled in order to understand the distribution of available values. Typical summary statistics include:

Data Profile Statistics

Numeric Variables
Character Variables
  • mean
  • median
  • mode
  • standard deviation
  • skew
  • kurtosis
  • minimum
  • maximum
  • various percentiles
  • number of observations
  • number of missing values
  • most common values
  • list of all values found
  • count of values found
  • number of observations
  • number of missing values

Definitive information about the meaning of each field, the range of admissible values and their interpretation is also essential. This is usually obtained from the client, but is not always readily available since most corporations do not have any one person who is deeply familiar with all fields across multiple systems.

Data discovery also entails the examination of relationships among the variables. Usually we are most interested in relationships between key response metrics (for example, default events or spend values) and other variables. This involves the preparation of numerous cross-tabulations, correlation matrices, scatterplots and preliminary regressions.

The last major task in data management is the imputation of values strictly for the purpose of subsequent modeling. This entails decisions about the handling of missing and extreme values. Imputation is often necessary in order to enable a particular record to make an appropriate contribution to the analysis. Several decisions need to be made for each field: Why is it missing or extreme? Is there information implicit in this status? What value (if any) should be inserted in order to avoid adding bias to the subsequent analysis? Depending upon the answers to these questions, several different imputation solutions can be employed.

Create the Model


It is a common joke among statisticians that anyone can operate statistical software packages.…even if they don’t know what they are doing! Under these circumstances, selecting the software comes down to the selection of the package with the easiest data entry, the friendliest interface and the output that is easiest to interpret …. whether or not it is accurate or reliable! Obviously, the critical issue really isn’t software selection. In fact the selection of a particular statistical procedure, the variables to be incorporated in the model and the particular options chosen (in SAS, for example) all entail specific hypotheses about the shape of the relationships among the variables; these assumptions need to be validated through specific output statistics. One of the appeals of tools that are more robust with respect to distributional assumptions (classification trees for example) is avoidance of the assumptions of classical statistics. However, the decision to use only distribution-free tools ends up being less accurate over ranges of the data where classical distributions do apply. So, no matter how “user friendly” the tools become, some significant level of understanding is still required for the most reliable results possible. Specification of a particular model also entails the acceptance of particular metrics for model accuracy and precision. Many statistics are available to assess how well the model fits the data as well as the confidence limits on the results.

The model specification phase of the project also entails decisions about how to handle sample bias. This may entail an explicit recognition of the limits to the application of the expected modeling outcome or it may require the development of supplementary models using additional external data sources.

Once the modeling approach has been decided, it is usually necessary to create a number of derived variables. This may include the response variable. For example, in a pricing optimization model the response might be a sensitivity measure relating spend change to price change, with adjustments for differences due to seasonality or other patterns. Some models may require the presence of transformations such as logs, exponents, squares, square roots or splines. Other derived variables could include presence/absence indicators, variables that split continuous values into discrete intervals and sums across multiple fields or over time for the same field, etc.

After the creation of derived variables, the master dataset is usually split into modeling and validation sub-samples. It is a systematically random split – in other words, the sub samples are systematically similar with respect to key variables such as the response variable, but random with respect to others. Completely separate validation samples may also be prepared in order to assess the performance of the model outside the modeling population – for example, out-of-time validation is performed to assess the robustness of the model over time or a sample from a different geographic territory might be used to assess the ability to use the model there.

At this stage it is usually necessary to check the relationships between the final response variable and other variables in the datasets. This serves two ends: it facilitates the selection of variables for inclusion in the model and also acts as a check that variable derivation and dataset sub-sampling haven’t generated any unexpected surprises.

Model estimation is usually done with specialized software so it is essential that modelers have access to a wide range of analytic software (SAS, CART, MARS, SPSS, Excel etc.). In some cases it is necessary to write original code to implement a model that does not exist as an automated procedure in commercial software. This requirement is facilitated by access to a searchable library of macros. In any case, there are a number of optional decisions to be made. Some may result in meaningful differences in the outcome (for example, selecting the link function in a regression or setting misclassification penalties for a classification tree); others are matters of convenience for subsequent processing (for example, saving predicted values for all records to a file or controlling the level of detail in printed output); still others determine the statistics that will be displayed for use in model assessment (for example, statistics for goodness of fit, influence or partial correlations). Most models require the selection of the most predictive set of variables from a much larger number of candidates. Under these conditions, the best subset is selected through one or more iterative processes that can be either manual or automated (depending upon the software). Models are often improved by the incorporation of terms representing the interaction (multiplicative contribution) of two or more independent variables. Not all software procedures make this an easy task so a lot of manual intervention may be required.

The acceptability of a model is assessed on several levels. Statistical significance simply assesses the likelihood that the relationship implied by the model really exists. This often proves to be of little interest – in multivariate modeling situations, very highly significant relationships may be of no practical interest because they fail on other criteria. Other statistics test for the presence of bias over part or all of the data space, the possibility that certain observations have excessive influence over the results, the confidence limits on the model predictions and the relative stability of the contribution of each variable. Graphs and statistics can also be used to identify patterns in residuals, indicating lack of model fit. No model should be accepted if it does not pass muster on all of these dimensions. However, the model also needs to perform adequately in the overall rank correlation between predictions and the original response values. This is the model performance that is summarized in output such as the Hosmer-Lemeshow statistic, Lorenz Curves, ROC Curves and the KS Statistic.

All of the above assesses the model’s performance on the original model-building dataset. In order to test it on new data, a tentative first step is the application of the model to the hold-out validation data sample. By using the algorithm produced by the model, it is hoped that the ranking or classification of the second sample will be similar to that obtained with the first. Alternatively, we might hope to build a new model using the second sample with parameter estimates similar to those obtained on the original. Outof- time or out-of-market validation goes a step further. Good performance on these samples requires that there have been no pertinent changes in the relationships between the predictors and the response between time periods or between territories. But it is also quite possible that the competitive environment, the economy or even the client’s own marketing or decisioning have introduced changes that degrade the performance of a model built on an earlier dataset. These are, of course, reasons why it is essential to perform this check before launching a model into production. They are also reasons why it is good business practice to regularly use recent data to update models that drive decision systems.

DELIVER

Implement and Operate the Solution

The delivery phase of a project is highly customized depending upon the client’s specifications. Some projects result in a complete handoff while others entail further implementation steps. Given the nature of the historic operational data that drives the learnings in analytic consulting, compelling conclusions from an analytic study can lead to specific controlled market test designs. Other projects lead into the development of a production decisioning environment. In any case, the final deliverable is rarely a model per se – rather it is a business solution or insight into a business challenge that leverages the results of the analytic modeling process.

Model results are delivered in support of business insight. For example, a pricing optimization study might identify specific price changes for specific classes of customers and the expected economic benefit. A model focused on credit card default would be used to identify proposed decisioning rules for use in the application review process and the attendant impact on future write-offs. Irrespective of the particular form of the deliverable, the reliability of the result is directly dependent upon the quality of the analytic process that produced it.

Implementation into the client’s production environment requires extensive testing. The decisioning system is linked through data “pipes” to internal and appended data sources. It is essential that the variables accessed in this way are in the same format as those used in the modeling environment. Scoring with live data needs to be tested for conformity to the results from modeling. Finally, once the system has been launched as part of the client’s decisioning process, regular validation checks must be planned to ensure that performance remains consistent with model projections.

Page 1 2
Quick Links for Financial and Insurance Consulting Services and More...
Apply For Insurance Consulting Services-Inductis

APPLY TO INDUCTIS

Inductis - Focusing On Professional Financial Consulting & Insurance Services
FOCUS AREAS
Case Study of Best Financial Consulting Services & Insurance Consulting-Inductis and More...
CASE STUDIES
  Select examples of how Inductis teams have achieved results for a variety of clients ...more >>
Best Financial Consulting Company- Inductis
PUBLICATIONS
  Our thoughts on how organizations can elevate their performance ...more >>
Site Map -Inductis
SITE MAP
Contact Us for Financial Services and Insurance Consulting Services - Inductis
CONTACT US
Copyright © 2002 - 2008 Inductis Inc.