One of the growing discussions and debates within the data science community is the determination of inputs or variables that should be included in any predictive analytics algorithm. This type of process is more commonly referred to as feature engineering. Historically, this process is typically the most time-consuming element in building any predictive analytics solution as the practitioner can usually create hundreds of variables that might be considered in a predictive model. But what is involved in this process. It is not simply the consumption of all this information into a data lake and then simply inputting this information into a predictive model. Instead, we might think of feature engineering as consisting of two key components. The first component is the creation and derivation of fields/variables from raw data while the second component is a filtering-out process that identifies the set of variables to be considered within a predictive analytics solution. In this article, we will focus on the first component which is the creation and derivation of fields and/or variables. The next article on feature engineering will focus more on the techniques that are used in filtering out variables.
Within the first component, there are a number of stages that are involved in this “feature engineering” process. The first stage is the extraction stage which has changed quite dramatically in our Big Data world as new capabilities are required to extract data beyond just the traditional structured data environment . The ability to read in simple rows and columns of data from structured data has now evolved to the extraction of meaningful information from social media posts,images, sensor data,etc. Although the extraction process has evolved into higher levels of complexity from a technical perspective, the approach to extracting the right information is no different than the extraction of the right information within the more traditional structured data environment. The practitioner needs to understand the business problem and to then identify the critical information that is required to potentially solve the business problem. At this point, the practitioner has the necessary data elements required to solve the business problem. But as stated above, this is only the first stage. It is the second stage of this process where most of the actual data mining/data science work is done in building a predictive model. Intensive data manipulation is conducted against these extracted data elements in order to convert and transform this information into meaningful variables. Arguably, this is the most critical component in the entire predictive analytics exercise. Here the data scientist relies on their knowledge of data structure and linkages as well as their understanding of the business and the underlying business challenge or problem. Let’s cite a few examples in both the structured and semi-structured/unstructured to better illustrate the tasks involved in this stage.
In the structured world, many different tables and/or files could be identified as the source data from the extraction process. Under this scenario, the data scientist needs to understand what files to link and how to link them. For example, how are the files linked? Are they one to one, one to many, or many to many. Once the linkage approach between files has been determined, the data scientist can then derive potentially hundreds of variables from the source extracted data. Routines are written to generate the following type of variables:
Typically, the routines to generate these above type of variables represent the most laborious part of the model-building exercise. But at the end of this process, an analytical file is created where a dependant or target variable is created alongside hundreds of independent variables.
In semi-structured data, the practitioner looks at the non-text data or meta data which relates to the characteristics of that event. For example , in twitter feeds, information relates to when the feed occurred, the number of followers(who you are following vs. who is following you), type of device, what URL they came from , location, etc. Our challenge with this data is to identify the unique person or the record of interest that will be ultimately used in creating the analytical file. Once this information has been identified, the derivation routines described above can then be employed to create twitter-related variables. The same sort of approach can also be applied to other types of social media such as Linked-In, Facebook, Youtube, etc. The critical fields, though, in building any analytical file using meta data within this medium are the unique ID that relates to a unique user and of course the time stamp pertaining to the occurrence of this event. It is this ability to identify the record of interest or the user ID along with when that event occurred which is the key to building powerful solutions that might be used in a predictive analytics solution.
Besides social media data, sensor data represents another source of semi-structured data that is relatively new to the information arsenal of the predictive analytics practitioner. Within the mobile world, information is collected around the device if the phone is WIFI enabled. Each phone has a unique ID which is referred to as the MACID and information can now be collected around this unique ID. If I am at a restaurant and my phone is WIFI enabled, information pertaining to where I am sitting(i.e. distance from the router),when I both entered and exited the restaurant, as well as the specific restaurant can all be collected around this specific event . As long as the phone is unique to me, the MACID in a way represents a type of unique customer ID. New customer behavior about me is now gathered through my movements which are now being tracked through my phone. An analytical file can now be built using the MACID as the record of interest in creating all the derived behavioral variables. Examples of this rich behavior are listed below:
Besides mobile data, we are increasingly seeing the use of more sensor type data in predictive analytics solutions particularly within manufacturing processes. The Internet of Things as more and more devices become digitally enabled will just add to the potential of using sensor information to create meaningful variables in any predictive analytics solution.
At this point, we have discussed feature engineering in both structured and semi-structured environments. Although, some might argue that sensor data is unstructured, the discussion above is referring to the use of the meta data where the data can be contained in some semi-structured type format. Most unstructured data, though, is often discussed within the realm of text mining and text analytics. Emails, phone calls,tweets,facebook posts all represent typical forms of unstructured data that can leverage the utilization of text mining techniques. Here, text mining tools and processes are applied to this unstructured data where the practitioner or data scientist is attempting to identify themes or topics from the text data. At the end of this process, each individual is then assigned to the topic or theme based on the content of the communication with the end result being variables that can potentially be used in a model.
We all know that the development of predictive analytics solutions is not simply trying to find the optimum machine learning or mathematical algorithm. The recognition of what you input as variables is arguably the most critical factor to success presuming that the business problem has been properly identified. In this discussion, we have attempted to summarize some key points when going through this process of creating the right variables which is the first c omponent of feature engineering. But besides the creation of these variables, the ability to filter out certain variables is the second component within feature engineering that will optimize the input variables into any predictive analytics solution. This will be discussed in the next article.
Richard Boire, B.Sc. (McGill), MBA (Concordia), is the founding partner at the Boire Filler Group, a nationally recognized expert in the database and data analytical industry and is among the top experts in this field in Canada, with unique expertise and background experience. Boire Filler Group was recently acquired by Environics Analytics where I am currently senior vice-president.
Mr. Boire’s mathematical and technical expertise is complimented by experience working at and with clients who work in the B2C and B2B environments. He previously worked at and with Clients such as: Reader’s Digest, American Express, Loyalty Group, and Petro-Canada among many to establish his top notch credentials.
After 12 years of progressive data mining and analytical experience, Mr. Boire established his own consulting company – Boire Direct Marketing in 1994. He writes numerous articles for industry publications, is a well-sought after speaker on data mining, and works closely with the Canadian Marketing Association on a number of areas including Education and the Database and Technology councils. He is currently the Chair of Predictive Analytics World Toronto.