After working with a client’s data for over three weeks with no real progress, you finally hit upon a real breakthrough. You’ve been searching for insights that will help identify which customers are most likely to turn into regular purchasers; the ultimate goal is to focus the company’s marketing efforts on this group in order to earn more revenue per advertising dollar. Studying customer purchase history has been unfruitful. Suddenly, you find that customer geography seems to be a better predictor of future purchases. You have a few more weeks to explore that connection, so you should be able to find some real value for the customer, right? Unfortunately, you’re trapped in your current course. Because you listed it in the initial Work Breakdown Structure, you are committed to delivering a production quality purchase history model. If you spend the rest of the project building and deploying that model, you won’t have time left to look into a geography-based model. On the other hand, if you spend the rest of the project exploring the geographic connection to repeat purchasers, you run the risk of failing to complete a major project deliverable. So what now?
This problem occurs all too frequently on data science projects. Customers rightfully require documentation of what exactly the team will accomplish at the kickoff of the project. Yet once the project begins, this initial project planning can quickly become irrelevant as new events, priorities, and insights into the data drive the team in new directions. In today’s constantly evolving information and technology economy, these types of changing requirements are common in many industries, though the field of data analytics is particularly susceptible to the problem. The condition of the data, corporate business rules surrounding data usage, technology requirements, and the exploratory nature of data analysis all combine to make it nearly impossible to define up front exactly what tasks will be needed and how long each will take. As shown in the image below, changing requirements can lead to excessive rework in a traditional Waterfall management approach, as each change sends the team back to the drawing board, negating much of the progress that has been made.
Shifting priorities often threaten to derail data analysis projects, but that doesn’t have to be the end of the story. This challenge is not unique to data science, and solutions may be gleaned from other domains. We can learn key lessons from: Agile Software Development (agilealliance.org), and the Lean Startup Method proposed by Eric Ries (theleanstartup.com).
In the field of software engineering, the problem of evolving requirements has been recognized for many years. Through the 1990’s and early 2000’s, software practitioners began developing new methods that embraced these changing requirements rather than shunning them in favor of plans established before the project kickoff. The methodologies that emerged out of this movement are collectively known as Agile Software Development. Though many specific methodologies fall under the “Agile” umbrella, what they have in common is an emphasis on working in close collaboration with the client, releasing or demoing working software early and often in the process, and encouraging the customer to review and adjust development priorities as the product emerges and requirements become more clear. As illustrated below, an Agile approach can save huge amounts of rework through more frequent iterations and adjustments.
Changing requirements cause rework in the Waterfall Method as all work is completed before the customer validates the product. A more agile approach focuses on incremental releases of working software, allowing for earlier course correction.
I experienced the benefits of an Agile approach firsthand during a recent software development project. At our regular biweekly progress review, the client mentioned that he had been fielding questions from customers all week about why a particular table couldn’t be exported to a spreadsheet. It turns out that the workflow for a number of users required them to pull data from our system in this way and upload it into another system — a use case that hadn’t been anticipated in the initial product design. Rather than push back and explain that this feature wasn’t in the original design, we were able to pull up the current development priorities and allow the client to decide where the export functionality fell in terms of business value. Unsurprisingly, he placed this feature at the top of the list. With discussion, we agreed to begin work on the export feature and to push off our development of a new graph on another page. If we had stubbornly stuck to our initially agreed-upon feature list, we would have produced an application which was worthless to an entire segment of users who required the export functionality in their daily work-flow. Instead, Agile management processes allowed us to respond to evolving requirements and develop the tool which actually drove the most value for the customer.
While Agile Software Development is focused on delivering software products that create business value for customers, we can draw a strong parallel to data science projects, where the main products are data insights that similarly deliver business value to customers. Just as with projects in the software world, data science projects can benefit from working in close collaboration with the customer and delivering insights as early as possible in the analytic process. The idea here is that customers are going to have a much stronger understanding of their domain and the main problems they face than the consultants analyzing their data do – certainly at first.
Delivering insights as early as possible in the process allows the customer to see the direction that the project is going and course correct where necessary as early as possible. A major concept in Agile Software Development is to “maximize the work left undone.” This means allowing the customer to direct development effort towards the most impactful areas before much time is spent developing less valuable features. In the same way, encouraging the customer of a data analysis to review preliminary results and offer feedback as early as possible can allow them to correct any improper assumptions and ensure that the consultant spends the greatest percentage of their time solving the problems that will give the customer the most value.
A major challenge in any data science project is that the customer may not completely understand their business problem before the project begins. Surely then, the understanding of the consultant in that area will be even farther from the ground truth. This makes it extremely difficult to define requirements for the project before it begins. The requirements that are created before the project begins are sure to change as the project goes on due to the constantly shifting business environment as well as new insights uncovered as the data is analyzed. Again drawing from the Agile approach, we can embrace these changing requirements by developing only a preliminary list of requirements and tasks before the project begins, and inviting the client to reprioritize and change these lists at regular intervals as the project goes on. In this way, we can ensure that the team is adding value based on the most recent understanding of the business environment, and not on only what was known before the project began.
We can see similar planning challenges in the domain of entrepreneurial start-ups. The main goal of a startup is to turn new ideas or technologies into products or services that deliver value to end-users. Of course, before the product or service is created, it can only be assumed that these ideas or technologies can be leveraged in particular ways to drive value. In his groundbreaking The Lean Startup, Eric Ries posits that the most successful startups are the ones that can test these assumptions efficiently and either double down on validated ideas or pivot quickly to new ideas when assumptions prove incorrect. This way of thinking revolutionized the world of startups, and the methodology laid out in The Lean Startup has become best practice across that domain.
The full process behind the Lean Startup method is beyond the scope of this article, but in short, the book argues that to be successful, startups should get a minimum viable product into customers’ hands as early as possible, then proceed by iterating over three steps. First, they should define a metric which measures the success or failure of improvements to their product (such as the percentage of visitors who actually return and become regular users for a website). Second, they should begin making changes or adding features which test individual assumptions (for example, adding a “share” button to test the assumption that users want to use a website in a more social way). Finally, they should observe the changes in the metric based on the changes made to the product or service and use this information to readjust their assumptions before beginning the process again in an iterative manner.
Just as in new business ventures, initial assumptions going into a data analysis project are just that: assumptions. Following the Lean Startup Method, we should first make these assumptions explicit and choose (or design) a metric by which to measure the value of the insights we are drawing. Then, we can iteratively test assumptions by measuring the effect of small changes or new features on our metric, revising assumptions accordingly, and repeating with new changes or features. In this way, we can revise or throw out any assumptions which prove incorrect and pivot towards changes which drive greater value.
To illustrate this process, let’s look at an example. I began this article with the hypothetical story of a data scientist who is analyzing the data of a consumer products business in an attempt to predict the customers most likely to become regular purchasers. To apply the method outlined above, the first step would be to identify any assumptions and define a metric to measure the success or failure of experiments that will be used to test them. Data scientists employ a number of metrics to evaluate their models – such as lift charts, ROC curves, Gini coefficients, mean absolute error, etc; there are many ways to evaluate the effectiveness of a model at predicting a particular target. Using these metrics is a good way to measure the effectiveness of model changes, though I would argue that, since the goal is to measure the value to the customer, it is most helpful to go one step further and create a metric which directly ties to business value. Perhaps in this case, we could measure the money that would be saved by not sending marketing information to users who were unlikely to become repeat purchasers.
Once assumptions and a viable metric have been established, we can begin testing and revising these assumptions through iterative experiments. At first, this client believed that customer purchase history would be predictive of repeat purchase behavior. To test this, we could create a basic model focused on that predictor, then calculate the value of this model according to our metric. If the results of the model pointed to an improvement in our metric (by allowing us to more accurately target our marketing and achieve similar results on a lower marketing budget), we could consider this assumption validated and move on to test another assumption. If our experiment did not point to an improvement in the metric, we could pivot in our analysis approach by revising or throwing out this assumption. We could continue in this way, continuously testing and revising assumptions, thereby improving our understanding of the business problem and our model (as measured by the metric we chose to represent value to the customer).
Obviously, this is an overly-simplified example of the complex business problems which many data science analyses attempt to solve, but it illustrates the way that, by systematically defining and testing our assumptions, we can continually refocus our efforts into areas that drive the most value for our clients.
While there is a lot of value to be gained by embracing, rather than fighting or ignoring, changing requirements in today’s fast-paced business environment, this method also presents some challenges. The most basic of these challenges is customer buy-in. Many customers will be used to more traditional project lifecycles, with deliverables and deadlines set in stone up front. It will take a good amount of trust, especially for external customers, to buy into a plan without a well-defined deliverable. The best way to work around this issue is to work as closely as possible with the customer and integrate them completely in the decision process through which priorities and deliverables can be shifted or changed.
A client of a recent data science project was having a hard time accepting a more Agile management approach. How could he be sure he was getting everything he paid for if the project plan didn’t define precisely which models would be included in the final deliverable — a visualization designed to prioritize cases for investigators? Upon explaining that it was impossible to say for sure which models would be most useful before their development, he reluctantly agreed to our approach. Throughout the course of the project, he found that Agile’s focus on embracing changing requirements actually gave him more control over what models were included in the final tool. Rather than having to guess what would be useful before we began work, he was able to listen to the feedback of his investigators as models were built and refined and direct our work towards the areas that made more of a difference in the investigators’ daily workflows. When the time came to begin a new project for this client, he insisted that we apply an Agile approach towards deliverable and requirements definition!
Even when a customer buys into the process, they may still have a requirement for formal documentation. Traditional project plans often require an SOW (statement of work), which lays out deliverables and hard deadlines. These plans don’t leave much room for shifting priorities or changing requirements. A great area for future analysis would be seeking the best way to create a truly effective Agile project plan that can provide flexibility while meeting reporting and communication requirements. A possible starting point for such a document would be an Agile Project Initiation Document, of which there are a number of different templates floating around the internet. Another possibility would be an adaptation of Alexander Osterwalder’s Business Model Canvas (businessmodelgeneration.com), a light-weight business plan template designed within the principles of the Lean Startup.
Beyond customer buy-in and process documentation, there are the additional concerns of scope creep or lack of vision. Precise deliverables aren’t fully defined before the project begins, and the customer is invited to change and shift priorities over the life of the project. This can lead to frequent pivots without an overarching vision, which can make it difficult for the project to gain momentum in any one direction. The opposite problem is also possible, with the customer requesting more and more work, without a well-defined scope to keep these requests to a reasonable level for the project. The new more dynamic approach to priorities, requirements and deliverables certainly requires a strong project management focus to avoid these problems.
In today’s business climate, priorities can change in the blink of an eye. It is challenging to know at the outset of any project exactly what outcomes would drive the most value for the customer. To make matters worse, data science projects are exploratory in nature, meaning that it is impossible to know, at the outset, what direction the findings will lead. As more data becomes available and projects become more complex, it will only get harder to predict the lifecycle and requirements of a data analysis project before it begins. All of these problems necessitate an approach which embraces changing requirements and uses them to course correct in order to drive the greatest possible value. This can only be done by iterating quickly and working as closely as possible with the customer.
Similar problems have been researched extensively in the domains of software engineering and entrepreneurship. Data Science practitioners would do well to draw on the lessons learned from these fields to improve their own processes. Still, in doing so, there are a number of challenges that remain. Further analysis and wisdom on how these principles can be applied to data science projects, and especially how they fit with formal project documentation, would be extremely helpful.
Andy Janaitis is a Software Engineer and Scrum Master for Elder Research (elderresearch.com). In those roles, Andy works collaboratively with developers and customers alike to deliver desktop and web-delivered data analysis and visualization applications. Andy is most passionate about designing elegant solutions around clients’ business needs and driving efficiency through process improvement. Andy is a proud alumnus of the University of Pittsburgh, where he received bachelor’s degrees in Industrial Engineering and History. When not in the office he enjoys running, brewing beer, and supporting Pitt football and basketball.