The Past, Present, and Fascinating Future of Data Science (Mike Tamir, Chief ML Scientist and Head of Machine Learning/AI, SIG)

Summary
Transcript

The title of “Data Scientist” leapt into prominence in 2012 when the Harvard Business Review named it the “sexiest job of the 21st century.” Almost ten years later, what’s changed? And what’s next?

In this episode, Dave Cole is joined by Mike Tamir, Chief ML Scientist and Head of Machine Learning/AI at SIG, to break down the shifting trends in data science, NLP, and ML—and what it all means for leaders in the field.

The conversation covers:

The past, present, and future of data science
The different roles and responsibilities within a data science team
New and exciting advancements in NLP
When models are right for the wrong reasons

Dave Cole

Hello, welcome to the Data Science Leaders podcast! I’m your host, Dave Cole, and we have Mike Tamir on today’s episode as our guest. Welcome, Mike.

Mike Tamir

Thanks for having me.

Dave Cole

Mike, just a quick background on you. You are currently the Chief ML Scientist and Head of Machine Learning/AI at Susquehanna. You want to tell me a little bit about what is Susquehanna?

Mike Tamir

Yeah, Susquehanna is a financial institution. We work with investments and looking at different signals in the market. It’s a pretty exciting place to do machine learning. I’ve been interested for years in what are the challenges and is it possible to separate signal from noise in this very low signal-noise ratio context, especially with the whole spectrum of ML techniques. It’s pretty exciting to be able to actually test a lot of those hypotheses the last several years.

Dave Cole

Yeah. Well, if you look over your background, your LinkedIn, you’ve also spent time at Uber, but you’ve been a data science leader for many, many years. On today’s podcast, we’re going to dive into a few things. One of the things that also is interesting about Mike is Mike is a very famous data science leader. He’s a prolific Twitter user—please follow him on Twitter—and he sends out multiple tweets a day it seems. We’ll talk a bit about that, sort of building your social profile as a data science leader.

The other thing is he’s on a list of top 100 most influential people in AI from AI Experts. I looked it over and it’s got Elon Musk in there as well as Mike, so your reputation precedes you! You’ve clearly had an interesting career. Love to get your perspective on the future of data science as well as what you’ve seen in the past, sort of where you see the trajectory of AI, machine learning, etc. Also, we’re going to talk a little bit about machine learning platforms. With your experience at Uber… Uber has open-sourced a data science platform called Michelangelo.

Taking a step back, where do you see a place for a data science platform? How important do you feel it is for a team of data scientists in terms of improving their performance? Then maybe we’ll even touch a bit on fake news. We’ll talk a little bit about that towards the end because I know some of your expertise is in NLP and there’s some interesting things that you’ve done there. Well, without further ado, let’s talk about a bit about just what you’ve seen over the last 10 years and then what do you see going forward in the world of data science?

Mike Tamir

Sure. We’re coming up on 10 year anniversary of that Harvard Business Review article where I think was the first time most people heard the title of data scientist, right? “The sexiest job of the 21st century.” Certainly, the terminology has evolved quite a bit. Maybe people that were once traditionally referred to as analysts are now… at least some of them are doing business intelligence or a lot of them have rebranded themselves as data scientists. Something that has certainly been a change in the industry is the level and degree to which companies are open to using machine learning. And so data science in many cases encompasses the machine leaning engineer’s and the machine learning scientist’s role.

Also, in certain companies data science sort of merged with that role and that function of being the people that do machine learning as well as doing maybe some data engineering, data exploration to people that do all of the exploration and the analysis, whereas then you have specific specialists that focus on doing the research and the architecture and the advanced machine learning modeling, which is something that sort of came in and came out again.

I don’t know if the term data science, or data scientist specifically, what the future for that is, but certainly there are these core functions of doing the engineering on the data side, doing the engineering on the platform side, which I know we’re going to talk about, and doing the actual experimentation which is more of a science component as well as analyzing or exploring the results of those experiments. If there’s a common thread in what’s core to data science, I would say that’s it even more than doing machine learning or using specific techniques.

Dave Cole

Right. I guess in summary, if you look just from a role perspective, looking back 10 years, there’s been…first of all the role itself, data scientist, was coined about 10 years ago, but then there’s these other roles that have come to be that maybe always existed, but there wasn’t a specific name for the role. Maybe IT handled it or your statistician was doing some of this and didn’t see it as their full time job. But now you have ML engineers, folks on the MLOps side, folks who are responsible for putting models in production. I mean, is it fair to say that it has improved things, streamlined things, allowed the data scientist to focus more on the experimentation, more on the science?

Mike Tamir

Certainly we’ve matured in our understanding of how to build a data science team. Those teams have benefited quite a bit from being able to be more precise on the different kind of roles. I don’t just hire data scientists anymore, I hire ML engineers and data engineers, and scientists to do the experimentation, and people for all of those specific roles. There are certainly different skill sets that are required for those. What certainly happens if you don’t do that is either you wait until you hire someone that is excellent in everything…they used to talk about data science unicorns.

Dave Cole

Yeah, good luck on that approach!

Mike Tamir

Yeah, and then you never hire people or you hire people and you’re disappointed because they can do A and B, but not C. Or you figure out specific skills, you test for those skills, maybe it’s good that they have overlap in other skill sets, but if you hire for the specific, just like building any other team, if you have a job to do, make sure that you have coverage of all the skills that are needed and the ability for everybody to grow and maybe move in other directions rather than waiting for the perfect person that can do everything. They talk about, “What if they get hit by a bus?” Then you’re in trouble, right?

Dave Cole

Yeah, the key person risk problem. Just for our audience and even maybe for me, if you can just give a one-sentence description for each of these roles. Let’s start with the easy one: how would you define data scientist? And let’s go down to ML engineer, data engineer, and so on.

Mike Tamir

Data scientist, certainly in the ‘meaning is use,’ may be the broadest of the terms. I mentioned earlier the core center of that Venn diagram that is really in every data science role is the data exploration and experimentation. Whether it’s a heavy machine learning role or not, if you have models that are tasked with estimating or you’re predicting or figuring out what’s going to happen given the data, an ineliminable job that any data organization needs is the ability to see how well did this model…or how well are we doing? Looking at the outputs and seeing how close they are to the target that you want to hit and looking for stratification of the different data points, looking at where your mistakes are and coming up with hypotheses for why maybe, “Oh, we’re using this word here and it’s confused about that word because it said president and it thought it meant Trump and not Biden because there was just an operation.” Things like that.

Dave Cole

Yeah, the temporal aspect of it.

Mike Tamir

Things like that, that you can kind of explain what’s going on and come up with tactics to fix it. You may call it residual analysis, figuring out “how seriously can we take these top line results if we dig down?” On the other end, which we’re going sort of in reverse order, before we even build the model you have to actually understand your data. You have to understand which features there are, what kind of feature engineering you need to do for those features, you need to come up with hypotheses with subject matter experts on what to throw out. So preparing the data and getting it ready. Then if it’s a machine learning focused data science role, then you’re going to also build the model, you’re going to get that ready and sort of stitch it all together.

Now let’s move out from there. Or we can move out, actually, you can think about it in two directions. If you have an ML scientist or an ML scientist who’s doing the experimentation…“Hey, I think that I’m seeing this phenomenon, maybe I’m doing the data science function…we’re making this mistake or maybe the numbers are low in this or that region.” Or you’re somebody that’s working with a data scientist or I am playing that scientist role, then I’m going to come up with hypotheses. “I think that by changing the architecture in this way, or by changing the feature preparation in this way.” Or by doing whatever other mechanisms you have, like how the model trains and how other techniques that you’re reading in your research. The best way that they could do this, the ML scientist, is to set up an experiment.

With my Berkeley students, they have to create a term paper every semester. I always say, just like with my son who’s in fifth grade who does his science fair report, “What’s your hypothesis? What’s your treatment? What’s your control? How are you going to…” This is the way, if you’re an ML scientist, a data scientist role, you’re really going to approach it that way. Now, what your phenomenon of experiment is that combination of ML tactics or modeling tactics more generally, and data. You make a hypothesis that, “Hey, this is the thing that’s happening. I want to fix it. I think tactic A or B or C is going to improve that.” Then you experiment and now you have to analyze your results. There’s that virtuous loop between the data science role of residual analysis and feature engineering and the ML side, which data scientists hopefully get a chance to do that, too. I find that to be some of the most fun. Now, let’s work our way outward, right? That’s kind of the middle part. Let’s work our way outward.

So, getting the data. There’s also this role, data engineering. I really see that as a role that may be interested in machine learning and what happens to the data and how it’s analyzed, but is really more focused on making sure that at scale, you’re getting the data you need, you’re processing in the right way for production purposes especially, but also for high-scale experimentation purposes. That you’re having that be prepared and in a distributed computing environment such that your data is ready at the speed that it needs and goes through the right transformations that will minimize the extra… you want the scientist to focus on the experimentation, you want the engineers to focus on the optimization and the speed with which the data gets as close to ready as possible for experimentation.

Now, feature engineering is quite a bit of experimentation. That’s not to say that data scientists don’t do feature engineering and don’t work with that, they do quite a bit. That’s a core part, but getting it to that process and eventually handing off as much of that boilerplate or those known settled transformations as possible, are going to be important, in particular for production. Maybe going more horizontally, so where are you actually building this modeling? Where are you running these experimentations?

Again, the name of the game here is the people responsible for experimentation, you want to make sure that they’re focused on running the experiments, analyzing the results of the experiments, and doing that iterative process, and not worrying about doing all the boilerplate of, “Am I running everything on my GPUs right? Am I distributing across my GPUs? Do I have the right metrics? Are these the same metrics that the other departments are using? Is this the right data? Is there any corruption due to data leakage somehow? Am I using a pre-trained model that already saw test data somehow?”

All of these management components ideally are going to be controlled by a platform. A centralized platform that any modeling group, any group responsible for working with data and drawing predictions from the data and experimenting on that data, is going to be able to use. That’s really where I see ML engineering coming into play. There’s basic engineering, some of it overlaps with what data engineering is, which is just making sure you’re keeping track of your data sets, your data leakage type stuff. There’s making sure that the implementation and the environment and the way you utilize your machines, in particular GPUs and where you store your data and are you storing data in a high access place, in a low access place? All of those things.

Then, of course, the techniques themselves. What kind of modeling techniques do the scientists want? Do they want to use these models that are available on PyTorch, these models that are available on scikit-learn? And making that streamlined. What a platform should not be is a competitor to PyTorch or TensorFlow. That’s not what I mean by creating a platform, because no matter how many people you hire, you’re going to not be outrun by the world’s contributors. What you want to do is be compatible with them and be able to leverage all of the things that the world is creating open source, rather than competing with them and trying to just outdo and rebuild wheels that have just been built, so to speak.

Dave Cole

We’ve hit upon a bunch of topics. Let me try to summarize here a little bit. If you’re looking at sort of the core responsibilities of a data engineer… A core responsibility of a data engineer is sort of getting the data prepped so that the downstream data scientists can really do as many sort of high quality experiments as possible. So the projects that they’re working on, they can do more throughput from an experiment project perspective. Then what else? The ML engineers are really helping to build the underlying sort of data science platform, which we’ll talk about in a little bit, which stores the actual experiments and the results of the experiments themselves as well as, I assume, has some of the data that’s being used for this experiments so that those experiments can be recreated and that you have collaboration amongst data scientists.

There’s also a challenge of reinventing the wheel. You want to make sure that data scientists aren’t running…if you have Fred over here and Mary over there, both are data scientists and they’re not working on the same experiment just from two different angles. You want to make sure that they’re collaborating and sort of prior art is being discovered in a great way.

Mike Tamir

Yeah. Collaboration, working on the same data, working on the same problem is something that happens. I’ve never met scientists who see someone say, “Here’s the data. Here’s the problem I’m trying to solve. These are our model results.” That didn’t say, “Oh, I could do better!”

Dave Cole

Right, yeah! That’s the fun part, or the frustrating part, of being in the data science world. It’s not like software engineering where you can develop an app that is to spec and to requirements, then it’s done. There’s always new algorithms, there’s new approaches, there’s different features you can use. If you’re a confident data scientist, you’ll always see a problem and say, “I can do better than that!”

Mike Tamir

Yeah. So how do you actually get to the bottom of that? The cliché failure is they go off, they take the data, they build a model and yeah, maybe they can do a little bit better. They say, “Hey, my MSE was this or my cross-entropy got down to this or whatever it is. My F1 got to this and that’s better, right?”

Dave Cole

Yeah.

Mike Tamir

That’s where things start to breakdown because you say, “Well, I don’t know if it’s better. They built that two years ago in a different department. I don’t know. I don’t have the metrics recorded. Or I don’t know what metrics they used. Or I don’t what test data they used. Or I don’t know if you got to use more data so maybe your model’s not as good, but it’s just got more data.” There are all these different ways in which the apples are not compared to other apples, and so a platform helps to standardize that.

It helps to make sure that you’re using the same library for metrics. Even if you don’t know exactly the data, you at least know that you’re using the same ruler. Ideally, you’ll have a good data infrastructure and tagging so you actually know and you’re versioning the datasets so you know which data was used in order to get these results, what the volume is, where there might be leakage.

Dave Cole

Yeah, that’s huge.

Mike Tamir

You have that locked in, as well. Then it’s a matter of the particular model type. Maybe you did it offline and then you brought it in and you migrated to your platform. There’s an adoption thing, especially when you’re first starting out with a Michelangelo or an internal platform of getting people to use your implementation in the platform versus what they can pull down from a public library. Ideally, you’re pulling down what’s in that public library, just making it accessible within the framework of the platform. Then once people see, “Hey, this is a real benefit,” then they’ll migrate over.

From a high leverage point perspective, there are ways of encouraging this. The easiest one is to say, “Hey, what goes into production has to be used with the standardized library that we have for metrics.” Then people will start to experiment. That’s the stick, then the carrot is you make it a lot easier to iterations more quickly, you make it easier to do what you already wanted to do so you get over that hump of adoption, that is people are used to doing what they’re doing on the command line or in notebooks or RStudio. They have to not just see that it’s as good, but see that it’s much more effective than whatever they were using before.

Dave Cole

As host, I have to say we started with the question, which is “how has data science changed the last 10 years?” We are actually hitting on this, audience. We started with the fact that we had a data science role being created but then there’s these other roles that sort of were created. I mean, data engineering’s been around a long time. In fact, one could argue that running experiments and building models has been around for decades, as well. There’s a new name for it, but I think certainly the ML engineer and building out a data science platform is absolutely new. I mean, that to me didn’t exist 10 years ago but they do today and Michelangelo is one example of that.

Then what we’re talking about now is what are some of those key features that you want to see in a data scientist platform? That ability to be able to not just only look at the code that was developed and understand what algorithm was used for a particular model and being able to see what features were being used, but also the dataset used to originally train that model is also important, too. Then also, what were the model outputs back in the day? Maybe the various model metrics were just as good or better as some model that you created right now and all you really need to do is just retrain the model on a newer dataset or something like that and make some small tweaks with the model itself. All of that, data science platforms can help. What other things did you see in your days at Uber or do you want to see from a data science platform?

Mike Tamir

Just to also to touch back on the question of what’s evolved, I mentioned machine learning and the willingness to use machine learning has certainly evolved from “Hey, why don’t you try this” to “So you read an article or you thought that this is possible and now you just want it to happen.” Even more so with deep learning. It’s gone from “Go back to academia,” to “Why aren’t we using deep learning? We have to have AI and deep learning in whatever we’re doing because that’s the thing that works the best.” Finding a middle ground there where you’re using traditional methods and then maybe you’re seeing if it’s worth it, and you can get greater signal from doing more advanced techniques is really where we’re at now, where the rule or the set of rules has evolved.

Dave Cole

Yeah, and I sort of glossed over the fact that you’re a professor at Cal as well in your spare time. I don’t know how you lead a normal life with everything you got going on. I believe you have expertise in not just NLP, but also in deep learning as well. I would be remiss if we didn’t mention the cloud as playing a large part in sort of helping data science really evolve over the last 10 years. Certainly, this played a huge part in deep learning; the access to GPUs and things like that that were extremely expensive, very difficult to get your hands on maybe 10 years ago whereas today, thanks to the cloud and other frameworks, are much easier.

Mike Tamir

Yeah, certainly. We used to beg GCP for vouchers to give to our students so they could actually do their deep learning models and now you have to wake up at midnight and restart your GPU if Colab goes down but for less than the price of a Netflix subscription, you can get a Colab subscription pro. You get lots of GPUs, you can run them for days and that’s really night and day from trying to figure out how to make sure you have the right firmware and CUDA and all of that set up on your Linux box, which was not so long ago.

Dave Cole

Absolutely. Let’s touch a bit more on getting back to the data science platform. Maybe you can even touch on where you see the future of data science going, maybe that’s bigger and better data science platforms, maybe it’s other things that you want to talk about.

Mike Tamir

If you look at the lifecycle it kind of feels like around 2012, 2013 we hit this turning point in image space tasks and we’ve been doing marginal improvements and figuring out all sorts of core cases and learning complexities and details of adversarial and generative challenges with text. That seems to have happened maybe at a five, six year delay from there. Now attention mechanisms sort of is to language as convulations are to images.

That’s not to say that either one is the best and final way of working with these, but you’re sort of getting into that role where you just kind of pour on more and deeper and longer training on using these sorts of fundamental tactics. You have a language model that is better than any other language model. It’s just a matter of making it bigger and bigger and bigger. What we still haven’t done is we haven’t really solved the challenge of merging a lot of these together. There’s things like image captioning that are impressive but also sobering, right?

Wow, it was able to say “there’s the monkey” but you can find all these corer cases. In language tasks in particular, the GLUE benchmark which was replaced by SuperGLUE which has a slightly different flavor of benchmarks. But BERT and RoBERTa, and all those, and T5 and I think ERNIE is now the leader. A lot of the GLUE benchmark tasks were all about natural language inference. And in parallel with really just destroying natural language inference benchmarks, GLUE came out in 2018 and it has beat human performance within a year and so they had to come out with SuperGLUE.

We’re finding that there’s a lot of information leakage there. There’s a lot of papers. I think a really good one if you want to get into this is looking at adversarial and NLI, looking at the adversarial NLI papers or literature and also looking at right for the wrong reasons. That’s a good entry point into seeing what goes wrong with these. These models will be really good at detecting signals that are highly correlated like, “Hey, they mentioned the consequent in antecedent.” The model’s really good at saying, if the “if” part of the sentence said X and the “then” said something that sounds similar to that, I’m going to say that the “if” entails the “then” and it’s a good entailment. Whereas maybe if they don’t have that, then they don’t. And they’re able to game the system to figure out that actually when you have that pattern but it’s clearly the “if” doesn’t entail the “then,” the models do much, much worse.

There’s another layer of figuring out how much are we cheating? Are our models cheating and looking at things that maybe they shouldn’t be making decisions to, even if it’s from a statistics perspective in the particular dataset that we are training them on. It’s a very useful tool to detect that signal. This is a data quality side thing but it’s also coming up with better tactics for making sure that our models are not right for the wrong reasons. I think it’s possible there’s literature on adversarial imagery and adversarial features that you can inject into imagery that may, it’s tough to say, but it may have a similar role where you have these adversarial features that models can use in order to be more accurate on their test set but you get the feeling that if you just pulled those adversarial features out or you applied them for some other false positive context, then you’re going to get the wrong answer.

We’ve done a lot of great work over the last five, six years in understanding optimization, thinking about the way models learn, and different kind of optimization tactics and ensembling and weight averaging and thinking about how local minima tend to cluster and how that generalizes. What we haven’t done such a good job is thinking about ways to solve these right for the wrong reasons style problems. That’s another thing that I think would be really nice to move into. I haven’t even talked about reinforcement learning yet. There’s a lot of really cool stuff that’s happening there in terms of having the agents create these representation models for world models, is the turn of art for what’s going on so they can do high speed planning on tactics. More than the other two sort of really fun generic playground areas, they’ve been really good at kind of taking from what we’ve learned in deep learning with image recognition and using that for generative modeling and using that in order to make the reinforcement learning agents better.

Dave Cole

In summary, help me out here, Mike. That was a lot for me, personally, to take on here. But NLP is improved quite a bit, although it’s been a laggard when it comes to sort of image processing and you see there’s still a way to go when it comes to NLP and improving the inferences that come from the various models, both in NLP and then switching gears and looking more at reinforcement learning. Again, there’s some pretty exciting things going on there. If you were to dumb it down for the likes of me—I won’t speak for anyone else who’s listening—but when you say NLP and you say right for the wrong reasons and you talk about reinforcement learning…just help us understand a little bit more about what you’re referring to.

Mike Tamir

Yeah, I’m not going to remember some of the examples presumably as well as I should. You can have freeze matching, is one of the failures. Or let’s just take a simpler case which is maybe less compelling. The judge gave an answer to the defendant. If the judge gave an answer to the defendant then—the “then” sentence, or does that entail—the defendant gave an answer to the judge. Or the judge said X to the defendant, or the judge said X near the defendant, or something else that has strings of words that are in both the antecedent and the consequent.

Dave Cole

Yeah, so you’re trying to understand who was the judge talking to and when the defendant replied, was it then replying to the judge or someone else? That kind of thing?

Mike Tamir

Yeah. What happens is as a symptom of the dataset itself, the A sentences and B sentence combinations tend to have entailment more when just those kind of brute force patterns emerge, like there is a copy of the text in A and there’s a copy of that text in B. They’re more subtle versions that they go into also. It turns out that when you remove that overabundance…maybe that is accurate natural representation of the way people speak, but from balanced modeling, maybe the model needs to learn how to take care of those cases but also take care of the cases where A and B don’t have that matching pattern of sequences of text. By figuring out how to get a model to learn from those harder cases, it’s going to be more effective at generalizing into strata of text that don’t have this kind of pattern, right?

Dave Cole

Right.

Mike Tamir

It’s this weird thing because, well, it works and you might say, “Yeah, but maybe you’re just overfitting to the training set, right?”

Dave Cole

Right.

Mike Tamir

Then the rejoinder there is, “Well, yeah, but we tested it with a whole dataset and we validated it with a whole dataset so it works on the real data, too.” Then you’re getting kind of into the area of, “Well, yes. It’s effective on the natural samples of text that you might get, of if-then sentences you might get, or A and B and A entails B or A doesn’t entail B sentences, or combinations that you might get. But it’s very fragile in the face of making these perturbations, changing the A and B so that it’s getting overdependent on its sort of cheating way of detecting that A entails B.”

Dave Cole

Right, right, right. I mean, the bottom line is if a human can see the text and can sort of imply if A is leading to B and interpret it, then gosh darn it, our models eventually should be able to get pretty close to that. Maybe not perfect but much better, certainly, than where it is today. I think bottom line, going back to the question at hand, where are we going from a future standpoint? There’s still a lot of room for improvement on the NLP side, as well.

That’s certainly fascinating. It’s been an interesting journey just from a data science platform perspective that we’ve seen over the last 10 years from the various roles that have been carved out because of that. We also touched a bit on just the improvements from data science itself, the algorithms that are being used and the advent of deep learning. All of these are sort of improvements that we just see more of going into the future. I also think that, just speaking personally, there is a change management aspect in terms of data science. The work and the output that we’re doing and having folks sort of embrace it as being more than sort of voodoo and that explainability aspect…I think there’s still some room for improvement there, too.

Actually having models being put into the center of enterprises and being used to make business decisions. I don’t know if you agree, but I tend to think that we still have a little bit of a ways to go there. I think some companies have embraced it, maybe in Silicon Valley more so than the rest of the world. I don’t know. I don’t know if you see that as a challenge, as well.

So actually having, say, the business side embrace the outputs of models and putting them at the core of the business. I imagine Susquehanna, being a firm that is dealing in data and having models at the forefront of what they do, that’s probably not a challenge. Then at Uber, probably not either, having been a startup that is all about data, all about driving efficiencies and driving affordable trips for folks getting from A to B. But there are other companies who are not as data driven, but who are building out data science teams and who might be struggling to understand which use cases that they’ve had traditionally and understanding where data science can be applied in those use cases. I think it takes sort of a strong data science leader who’s not just very strong from a data science perspective, but also can actually change hearts and minds and say like, “Actually, we can solve this problem through the use of data science.” I think that could be a challenge, right?

Mike Tamir

Certainly. You don’t have to necessarily make the case that using modeling is going to be valuable or could potentially be valuable, but when it comes to putting a specific model into production, mistakes could be costly. Mistakes can be costly if you’re at a big company compared to a small company, and so the more complex, the more ostensibly not transparent. There’s been a lot of work in transparency even for deep neural networks and understanding how it works and why it works.

But that really comes back to one of the things that we started with, which is the residual analysis. Understanding if you get a top line number that’s great, but understanding how that top line number is, where is it making mistakes, why is it making mistakes and where is it getting things right. Is it getting things right because of the signal, like the right for the wrong reasons examples. That’s not even just because maybe I had data that shouldn’t be in the training set, that’s because there is information in your training data just by the way it was sampled, or a large enough proportion of your training data, that model learns how to cheat. It learns how to cheat on the test data because of the way it was sampled.

Dave Cole

Yeah, bias in your data.

Mike Tamir

Right. You got it. Yeah. I remember hearing an interview but one of the early Kaggle competitions, the woman who won saw that the serial numbers were actually not random and she put that in a feature and it was really just listing an index, but she ended up winning the competition. That sort of thing is maybe the most extreme example but these are much more…less that was just a face slap mistake, and more this is a complicated issue in terms of data sampling and data preparation and stratification of what you’re training your model on.

Dave Cole

Yeah. I mean, the output of that residual analysis, if it can help with the explainability of why your model is off by a certain percent, why it’s not 100% accurate. If you can explain that to your business counterpart, it can help build trust and it can build faith in that there’s actual thought being put into how these models are being sort of architected, designed and improved over time. The more that happens, the more I think people will embrace data science and putting it in critical parts of the decision process. We’ll just see more of that. I think there’s still some room to grow there. I think there are companies who have data science and put it in the core of who they are. Netflix is always an example that comes up all the time. Then there are others that just really started that journey over the last 5+ years and there’s just a lot of nascent use cases where I think more and more adoption will be seen, hopefully.

Last but not least, I do want to touch a bit on the brand that you’ve built. You are very prolific on Twitter, I did mention that at the outset. I did want to talk about that. How do you do it? Are you reading all these articles? Because I don’t know how you do it.

Mike Tamir

Yeah. I don’t do it three times a day like clockwork, if you notice. What I do is every few days I’ll sit down, I’ll read all of the different links that I’ve saved throughout the last half week. I’ll go through it, anything that I think is interesting…I don’t necessarily share something because I think that it’s the right tactic, but at least it’s an interesting tactic. I’ll go through those and then I’ll queue them up. I have one of those social media things that just come out one at a time.

Dave Cole

Yeah, sends it out on a scheduled basis. That’s fancy pants.

Mike Tamir

Yeah, maybe doing it all at once, it’s a little bit less onerous than every few days. It’s a lot easier than doing it 24/7.

Dave Cole

That’s your trick., that’s your trick. It’s @MikeTamir. It’s very easy to get to. Follow Mike! He’s on the cutting edge of data science and he’s always sharing lots of great articles. I think it’s clearly done a great job for you in terms of building your brand. I think the important lesson, though, really is more than just making a name for yourself on social media, but clearly you’re passionate about the world of data science and you’re doing this to improve you as a data science leader. You’re reading these articles, I imagine, to really up your game. I think that is a lesson to all of us, that we always need to be learning, we always need to be reading. And there’s always improvements, new algorithms, new approaches that we should be aware of and make us better at our jobs. Absolutely.

Mike Tamir

When I left grad school I had this terribly wrong impression. I was like, “Okay, well, I finished, there’s the learning part. That’s done and now I don’t have to learn anymore.”

Dave Cole

If only! Yeah.

Mike Tamir

It was maybe less of a dusting off my hands and more of…I almost felt like I had to mourn the fact that, “Oh, I’m not going to be learning all the time anymore.” And I was pleasantly surprised, it’s not like that at all. It hasn’t even been hard to make it part of my day to day and learning new things. None of the tools—I guess I still use Python—but very few other tools, almost none of the tactics aside from fundamentals of “this is good experimental practice, this is good machine learning practice.” That might have stayed but the particular techniques change so much. It’s fun to keep up with it and to use it and to see these ideas actually get put into practice, which is really where a lot of the juice is for me in my work.

Dave Cole

Coming full circle, I mean, that’s why it’s called data science because there is a science aspect to it. Find a scientist out there who’s not experimenting and trying new things. They’re not a scientist anymore, right? They’re not actually doing their job. Their job is to experiment, their job is to learn from their mistakes, to uncover additional insights through research and that’s what makes our space fun, is that it is purpose built to learn. Learning is core to being a great data scientist and it should be core to being a good data science leader, as well.

If you want to learn more about Mike, follow him on Twitter, @MikeTamir. Also, you can find him on LinkedIn and reach out to him. But thank you so much for being on the podcast. I really appreciate it, Mike.

Mike Tamir

Absolutely. Great talking with you!

Popular episodes

How Computer Science & Statistics Fundamentals Can Advance Data Science in 2021

29:22 | Episode 16 | August 17, 2021

Listen now

Getting Started with Deep Learning in the Enterprise

40:04 | Episode 15 | August 10, 2021

Listen now

Communication in Data Science: Know the Data & Know the Business

38:29 | Episode 14 | August 03, 2021

Listen now

The Right and Wrong Place for the Citizen Data Scientist

26:54 | Episode 13 | July 27, 2021

Listen now

Listen how you want

Use another app? Just search for Data Science Leaders to subscribe.

About the show

Data Science Leaders is a podcast for data science teams that are pushing the limits of what machine learning models can do at the world’s most impactful companies.

In each episode, host Dave Cole interviews a leader in data science. We’ll discuss how to build and enable data science teams, create scalable processes, collaborate cross-functionality, communicate with business stakeholders, and more.

Our conversations will be full of real stories, breakthrough strategies, and critical insights—all data points to build your own model for enterprise data science success.

Dave Cole

Host, Data Science Leaders

The Past, Present, and Fascinating Future of Data Science

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Dave Cole

Mike Tamir

Popular episodes

How Computer Science & Statistics Fundamentals Can Advance Data Science in 2021

Getting Started with Deep Learning in the Enterprise

Communication in Data Science: Know the Data & Know the Business

The Right and Wrong Place for the Citizen Data Scientist

Listen how you want

About the show

Dave Cole

See why Fortune 100 data science leaders choose Domino