The development of a predictive model is a complex process, which encompass several challenges, assumptions, theoretical implications, and an understanding of the reality of what is being modeled. Models are not just developed because someone feels inspire to develop one. A model is develop from well established theoretical conceptualization of the reality that is to be modeled. With that in mind, the modeler gathers the necessary datasets, which he believes can be used to build the model to predict outcomes that could more likely achieve whatever outcome that is being predicted. 

Thus, it is practically irrelevant and repetitive to always build models to predict similar outcome using "good" (quality and quantity) historical data of the reality that is to be modeled. Once a model is developed, it can be applied to new databases or data tables with the similar historical data and context used to develop the model with again an understanding of what is being predicted and what the historical context of the new data looks like.

For example, an auto insurance company developed a predictive model to target new customers who might likely enroll in their auto insurance policy, but wants to score these new customers to know their risks using a range of variables (contain in the model). The insurance company wants to determine a score for each customer, which could be used in the computation of their insurance cost and premium as well as the potential risk that could be predicted and the associated liability. 

A predictive model already developed can be applied to these new customers to compute their predictive scores, which represent their levels of risk. In scoring the new dataset, it is important that each variable used in the model development is presented in the new data table that is to be scored. Once the model is applied to the new data table/database, a predictive score of their risks is compute and this score can be used by their system in the determination of the cost of insurance per year. The higher your score the more unlikely you will cause a substantial impact to the insurance company in turns of the number of claims within a given year. The lower your score, the more likely you will file a claim. The scores can be ranked into decile. For example, decile 1 are more preferable than decile 2, than decile 2 to decile 3, etc. Those in the upper 1-2 decile have low insurance cost, while those in the lower decile have high insurance cost.

With that said, there could be so many things wrong with this approach. The assumptions we used both theoretically and in application itself could lead to negative outcomes. 

It is very important that we should understand the historical context of the data that we seek to model or that which we seek to apply a model on. 

A retail company (let say Best Buy), could decide to develop a predictive model to assess, which catalog campaign circulation have a great performance. The historical context of each campaign needs to be well understood as well as the design, paper, day of the year, season, etc when the catalog was mailed as these could have an impact on the response/performance. Thus, developing a model using the historical circulation data of one catalog campaign and using that model to score other campaigns could yield positive or negative outcomes.

Thus, what do you folks think we need to cautious using a model already developed and applying that model on new datasets with similar variables to predict similar outcome? What we could be doing it wrongly? What do we need to be careful off? What might we want to change, if necessary?

More Jenkins Macedo's questions See All
Similar questions and discussions