This article gives an overview of data science, and then walks through a real-world example of applying data science to transportation planning. For additional insights on data science, check out the Domino Data Science blog.
There is currently a lot of hype around data science and especially AI (artificial intelligence). From my 20 years of experience in data science, I’ve found the following definitions to be helpful:
To perform data science well, you need a combination of computer/programming skills, math and statistics knowledge, and industry/domain expertise.
Without math and statistics knowledge, data science models can be misused, and results can be misinterpreted.
Data science within companies has evolved significantly over the last decade. Most Fortune 500 companies are now at the “Proactive” phase, and are capable of making accurate predictions in certain areas of their business, such as predicting the response rate to direct mail marketing campaigns. The industry is now aspiring to reach the “Dynamic” phase, where their operations are extensively model-driven.
I frequently see six phases in a data science lifecycle.
You can learn more about the data science lifecycle and how it applies to your business at www.datasciencelifecycle.com. I am now going to walk through how this data science lifecycle applied to a real world project.
A few years ago I helped a large city optimize their bus routes. They were trying to encourage citizens to use public transportation, but occasionally there would be a long line at a station, and passengers could not fit on the bus and would be stranded. I was tasked with developing a model so route tables could be optimized two days in advance, large capacity buses could be paired with high demand routes for that day, and people would not be stranded.
The client requested that the entire project, except for the model delivery, be completed in six weeks’ time, which is far too short for a project of this nature. I delivered what I could in that time frame. In a post-mortem of the project, I found it very instructive to review what I was and wasn’t able to deliver. The steps I had to skip due to time pressures are steps that are too often skipped, and provide insights into what actually happens in a data science project.
Drilling into the Ideation phase, I assessed feasibility and defined the problem statement in a series of sessions with business owners. I did not have time to do a thorough prior art review. This is a critical step because seldom are you working on a problem for the first time, and a lot can be learned from internal and external peers. Quantifying the value/ROI from this project was also neglected. In this case, it would have been feasible to estimate the costs of stranding a passenger and then aggregate the value as stranded passengers decreased.
After completing the Ideation Phase, I started exploring and acquiring relevant data. I ended up gathering data from the prior two years on the following:
From these data sources I engineered the following additional features to enrich the dataset:
I gathered the data from the BI team (who, I found out later, got their data from IT). In hindsight, going directly to the source (IT) would have been better as it turned out that the BI team’s data was not as complete as the original source.
My goal of data preparation was to make each row represent a stop in order to try a numerical prediction model (regression modeling rather than classic timeseries). Data from prior days for that stop would need to be visible on the row. The image below shows a few of the inputs/columns I engineered. The last three columns were used for graphing and as the target for predictions. The other columns were inputs to the model.
I wish I had access to a platform like Domino back then, as I would have benefitted from the following Domino features:
After completing the Data Acquisition & Exploration phase, I started the Research & (Model) Development phase. I tested about 30 algorithms serially on the modeling environment using one machine learning software solution in particular. This was done on datasets with hundreds of millions of rows. I captured the accuracy metrics of each approach, such as RMSE, R-Square, and Residual Analysis.
The final model, a regression model with regularization, turned out to be fairly predictive. 80% of the stop-level predictions were within +/- one rider of what actually occurred.
I was happy with those results, but again, I wish I had access to a platform like Domino then. During this model development phase, I would have benefitted from much faster development. Domino supports elastic access to hardware, environments, and code packages, so I could have run parallel execution rather than serial execution of the models. Domino also supports a wide variety of open-source and proprietary algorithms and IDEs that would have helped in my experiments. I likely could have developed a more accurate model without these constraints.
Validation was not in the scope of this project, due to the time constraints. Leaders should be sure to budget for this increasingly important step in the lifecycle to mitigate model risks. The BI team built a Power BI dashboard on top of the Enterprise Data Warehouse to deliver predicted route capacity and allow exploration of future prediction as well as past results.
The client could have benefitted from a platform like Domino here for a few reasons:
Hopefully, this blog gives you a good overview of data science, how it is applied during real world projects, and where corners are usually cut when constraints do not account for reality. Data science can be an extremely powerful tool to solve predictive problems around large data sets.
As your team grows, a data science platform like Domino can be invaluable to applying data science as effectively as possible. Domino can help throughout the process, including:
Visit Domino News for press releases and mentions.
Visit the Data Science Blog to learn about data science trends, tools, and best practices.