Data Science Leaders | Episode 06 | 44:16 | June 08, 2021
Fiona Hyland, Director of R&D, DNA Sequencing Informatics
Thermo Fisher Scientific
The field of bioinformatics plays a critical role in medical breakthroughs like the COVID-19 vaccine.
Fiona Hyland, Director of R&D, DNA Sequencing Informatics at Thermo Fisher Scientific, teaches all of us about how it happened in the latest episode of Data Science Leaders.
What we talked about:
Check out these resources we mentioned during the podcast:
Welcome to another episode of the Data Science Leaders podcast, I’m your host Dave Cole, and today we have a special guest, and honestly a special episode. Today our guest is Fiona Hyland. Welcome to the Data Science Leaders podcast, Fiona.
Thank you, thank you. Delighted to be with you.
Great. So Fiona, you’re the Director of R&D, DNA Sequencing Informatics at Thermo Fisher. What we’re going to do today on the Data Science Leaders podcast, I’ll let you briefly explain what Thermo Fisher is as a company. I want to dive into the world of bioinformatics, which is your world. And for the audience, what that’s going to entail is a little bit of biology, so I did some research and I’m going to try and get us all up to speed on your ninth grade biology if you remember back in high school, and then we’re going to dive in and we’re going to talk about something that’s very topical to all of us here on the Data Science Leaders podcast, and listening to the podcast, and that is the coronavirus and what is going on in the world of bioinformatics to come up with a vaccine and get us out of this mess, and I’d love to learn a little bit about the background.
In the course of this episode, obviously we’re going to learn how data science plays a role in both the bioinformatics field as well as in coming up with the vaccine and addressing this coronavirus situation that we’re in. Sound good?
Yeah, great!
Alright, let’s do it. So first of all we need to do some Biology 101 stuff. I’m telling you it’s going to be super helpful. First of all, what is a genome? Let’s start there.
So the genome is like a book of instructions for how to make a cell and how to make a body. And this book of instructions only has four letters: A, T, C, and G. So it’s a very small information space. It codes for 20,000 different genes.
Every human has 20,000 different genes?
Yeah.
And what is a gene?
So what a gene is, it’s a unit of information that codes for a protein that tells the body how to make a protein, what the sequence of the protein should be.
Got it. So we have these, I think they’re called bases, these A, G, C and T. The combination of those, there’s like certain combinations of them that make up a gene, is that correct?
That’s exactly right, yeah.
And then how many, roughly speaking, make up a gene? Does it vary?
It varies a lot. Some genes are really long and some genes are really short. A gene may be as short as 100 of those bases or it may be as long as thousands of those bases.
Okay, alright. Great. So we have 20,000 of those genes. Now, fun fact here, and I did some research here, but the 20,000 genes that are genetic material, the As, Cs, Gs, and Ts, represent only 2% of your entire genome’s genetic information. So there’s 98% that is left remaining. Now is that just garbage? Is that not needed? What is the thinking on that?
Oh, that’s a great question. So there’s been a lot of discussion in the genetics community about what is that 98%? It was called junk DNA a long time ago. Now we don’t consider most of it to be junk, necessarily. So one thing about the genes is that those genes are split up. So you have a quarter of a gene and then you have a big space and then you have the next quarter of the gene, and then you have a big space and then you have the next quarter of the gene. And those are called exons, and that’s the way it is in all plants and animals, but not bacteria and not viruses.
So some of that 98% is actually interspersed between the exons within the gene. Some of it is between genes, and some of that sequence is used for things like regulation of the genes. Which genes get turned on in which cell at what time, and in response to what stimuli.
This is fascinating. I’m sure I learned this way back in the day. So there’s basically switches, and even time-based switches to figure out when the gene should be activated during certain points of your life, I guess?
Yeah, in which cell, in which point of your life. There are also a phenomenal number of feedback loops in biology. And so if one gene gets expressed too much, oftentimes there are systems by which other genes will notice that and push it back down again. So there are these tremendously complex negative feedback loops which keep things in balance, which is called homeostasis.
Okay. We’re going to like tenth grade biology now. Okay, maybe not, but let’s get back. The other important statistic I have in front of me here is 99.8% of the genetic material that makes up the human genome is the same amongst all humans and only 0.2% is different. So if you do the math, there’s 3.2 billion bases that make up your DNA, those As, Cs, Gs, and Ts. You do the math, multiple that out, there’s like 6.4 million bases that make up the genetic material that makes us unique. So is that all you work with, is just that 0.2%? So we already talked about that junk DNA, but now we’re talking about the stuff that’s actually unique, or no? Do you think that other 99.8% is also important?
We’re most interested, typically, in what’s different between humans, or indeed what’s different between viruses, but that’s not all we’re interested in. We always have to take account of the full context, and all of those three billion bases of DNA because typically what we want to do is we want to pull out the genes that we’re most interested in, and a lot of the genes are actually very similar to each other. Sometimes they exist in gene families that are quite similar to each other. And so in order to pull out only the genes that we want, we have to understand the relationship between the different genes and how the genes differ from each other, even if they’re the same in every human, how the genes differ from each other so that we can precisely target just the ones that we care about.
Gotcha. Okay.
And then we do that in order to measure the variation that may exist in a particular sample or in a particular individual.
Okay, and we’re going to talk a bit about that process too because I’m interested in how this all turns into data and where the data science comes in, so we will talk about that. The last thing I wanted…audience I apologize, but this is last thing, I promise, before we dive into the data science part of this and talk about bioinformatics…but RNA versus DNA. So we’re going to be talking a bit about the coronavirus. So humans have DNA, that much I know. The coronavirus, I believe, has RNA. What is the difference between the two?
Sure. So the DNA is the book that is encoding all of our genes, and the RNA is like a copy of one of the genes in our DNA that’s almost an exact copy except that one of the letters is replaced by U. So we have a slight change in what the letters look like, and most importantly it’s a very short snippet of RNA, mostly corresponding to a single gene and that single gene is then translated into a protein.
Okay, so I believe it’s like the messenger for that single gene that says this is the protein that you need to go build?
That’s right, and it’s one single molecule. The DNA is all together and then you take one little snippet of it, make that copy of that one little snippet and it’s that one snippet that is used to do the translation.
Okay. And viruses just have this one snippet, basically. They never had DNA, they just have this RNA thing. Is that right?
There are different kinds of viruses, most viruses are DNA viruses. But there’s a subset of strange and weird viruses that are RNA viruses, and the coronavirus is an RNA virus.
Bingo, okay. Alright, cool. So this is our foundation that we’re going to work from, I hope. Of course I’ll ask questions on behalf of myself and hopefully those in the audience who are not steeped in the world of genetics or bioinformatics will learn a little something along the way. Now what is bioinformatics? We have the basics of the biology, but what is bioinformatics?
Bioinformatics is kind of the intersection of biology, and especially genetics, and informatics. It includes aspects of computer science, it includes aspects of statistics, and it’s applying those disciplines to genetics.
Got it. So if you’re a data scientist out there, by the end of this episode I’m sure they’ll all be wowed and fascinated and want to work for your team, but you need to have a background in biology, specifically genetics, you need to be able to code, you also need to understand statistics. Anything else you need? Maybe even data engineering type work? What else do you need? What’s the logical list of things you need to be in the world of bioinformatics?
Bioinformaticians come out of a lot of different disciplines, actually. I’ve had bioinformaticians who started out as mathematics people, computational chemists, physicists, biologists, genetics people, computer science people. So people come from a lot of different backgrounds and there are, of course, majors in bioinformatics, you can do a masters in bioinformatics if you’re coming out of any one of those disciplines. But quite often people have strengths in some areas and gaps in other areas and they may learn on the job.
Right. So that’s good to know. So there’s hope if you didn’t go to Cornell and get your advanced degree in quantitative genetics for example, Fiona. Great school, by the way, Cornell.
Thank you.
So there is hope. So you have people on your team who have different backgrounds and different disciplines, and have sort of been trained, I imagine, on the biology of it all and are able to work in this field. So that’s great. Obviously it would be a bonus if you had all of these skills, but it is possible. Okay, so what is sort of a day in the life of, I hope I don’t butcher this…a bioinformatician? Okay, there we go.
So typically people are working in a project team and the project team is composed of people from multiple disciplines. The bioinformatics person is focusing on the data, as you might expect. At some stages of the project, they might be identifying what are the targets, like which are the genes that are important? At other stages of the project, there might be a bioinformatics person who is working to design a way to pull out just the genes that we want, so that’s called assay design. At other stages of the project, there might be somebody who is focusing on developing an analysis solution that will give results with high sensitivity, high specificity, reproducibility, and robustness. There are also people who work on more generic pipelines that can be used by a lot of different projects, and they are maybe collaborating with those teams, but focusing on one particular type of a pipeline, an analysis pipeline.
Okay, so let me break this down. Identifying which genes are important, that I assume you need to have somebody who has a genetics background. Well, before you even get there, I assume you’re working with data but what data are you actually working with? How do you extract this data?
Yeah, so the data that we work with is pretty varied actually. The first, most important piece of data is the sequence of the human genome or the sequence of the viral genome, whatever genome you’re working with. So that’s those three billion As, Ts, Cs, and Gs, and that is the core of everything that we do. Then there are other public databases that are repositories of knowledge that have been assembled and curated by academia, by government, and these include databases that list cancer-related variants like COSMIC, disease variants like ClinVar, copy number variants like ClinGen, and then the normal human population variants that you referred to earlier, like the 1000 Genomes Project, like gnomAD.
And so we take all of those public sources of information and also our own proprietary sources of information as the foundation, then of course there’s also the lists of genes. What are the properties of those genes? Where are they located? That’s one type of data. That’s the very pure biological data. And then another type of data is the data that comes from our instruments. So we do the DNA sequencing of these panels that we’ve designed.
Sorry, what is a panel?
A panel might be for example, all of the genes involved in a particular disease, or it might be the COVID-19 virus. So a panel is a collection of sequences that are just the genes that you want to interrogate.
That’s right, just the genes that you’re focusing on for the purpose of whatever project. And can you give me an example? What is a project that you might be focusing on?
Sure. So, for example we might design a lymphoma panel which focuses on the genes that are of interest in lymphoma. My team designed a SARS-CoV-2 panel, which focuses on all of the DNA in the coronavirus.
Right.
You might have a panel that focuses on solid tumors like lung cancer or melanoma. You might have a panel that focuses on the genes that are involved in a specific genetic disease.
Okay. And how do you know which genes are associated with a specific disease? Is that where the databases come in? That sort of collective knowledge of the folks in the field?
Yeah, that’s mostly collective knowledge. It’s the collective knowledge of the research community. Databases and also the scientific literature.
Okay. So I’m working on a project, maybe it’s focused on lymphoma, and I know which genes I want to include in my panel. Now what do I do? Is it looking at drugs…what do I do next?
Yeah, so then what we do is say how can we pull out the genes that we’re interested in? And we look at the human genome, we look at a lot of these public data sources that I told you about, and we also look at our own internal data to understand, how can we design primers that will bind very specifically just to the regions that we want, not to anywhere else? And that will pull out that DNA very effectively and precisely and specifically so that we will be able to sequence it?
Got it. So a primer is a way of sort of chopping up your DNA sequence to find that sequence that is the specific gene that you care about? Is that right?
Exactly. It’s a way of chopping up the genome and pulling out and amplifying, meaning making lots and lots of copies of just the genes that you want.
Got it. So if you want to look at my genetic material, Dave Cole’s genetic material, do you need a blood sample from me? What do you need to be able to look at my sequence?
You would be terrified about how little tissue I need!
Oh dear.
One great thing in our technology actually, we have a technology called AmpliSeq, and one of the great things about that, it can work on a tiny, tiny, minute amount of DNA. And we actually use this in PCR, we use this in forensics, where the most tiny amount of tissue that somebody leaves behind at a crime scene, we can amplify it and we can tell who that person is.
I’ve watched enough crime dramas to probably know that all you need is some flake of skin or whatever…
That’s right. I’ve worked on projects with that team, and that’s pretty fun. I can take a bit of your blood, I can take one of your hairs, I could take a cheek swab. Or if it’s cancer, then it would often be an FFPE sample, which means a sliver of the tumor. But again, because of the ability of our technology to work with tiny amounts of DNA, we can access tumor samples that are much lower in volume. And one of the newest technologies that we’ve been working with is ways to take blood and interrogate the blood and look for the very small number of cancer cells that are in the blood, and we can find out properties of the tumor cells just by looking at the blood.
Okay. So this is really cool. Thankfully I’ve never had cancer so I’ve never gone through this. Family members and I’m sure many in our audience have been impacted either directly or indirectly by cancer, but my understanding is you need to get a biopsy, typically. So you go in and you get some of the cells from the tumor itself and then those are then sequenced, studied, etc. But you’re saying there might be something on the horizon that says we don’t need to do some invasive surgery to get some tissue directly from the tumor itself. Instead we might be able to actually analyze blood samples to identify some of those cancerous cells that somehow have made it from the tumor into your bloodstream?
That’s exactly right, that’s exactly right.
Holy cow. I’m learning something here.
Yeah, you can take blood and pull out just the genes you’re interested in, and some fraction of the cells will have come from the cancer. And it actually turns out, typically, that the more advanced the cancer is, the higher the percentage of the cells that have come from the cancer.
That totally makes sense. Right. Yeah.
Yeah, it does make sense because you’ve got the tumor volume is larger, the mass is larger, and some of those cells are being shed into the bloodstream, and then you can figure out which of those mutations are coming from the cancer cells themselves. They’re at very low frequency, and so that required a new technology where we tag each piece of DNA individually to detect variants that are such low frequency.
So when you’re looking at the DNA sequence of a cancerous cell, there’s some mutation in there? Do I have that right? So there’s something messed up with the DNA sequence, is that right?
That’s exactly right, yeah. Almost by definition there’s something messed up in the DNA sequence if you’ve got cancer, and some of those mutations are called driver mutations and those are the ones that are actually causing the cancer.
Got it.
And what’s especially exciting is that there’s this new field of personalized medicine where now, drugs have been developed that are very precisely targeting those mutations, and if you know which mutations a person has, you can identify which is likely to be the best drug for that particular patient.
That is awesome. I’ve read a lot about personalized medicine, but that’s taking it… actually using your specific cancerous cells and the DNA sequence within those cancerous cells to get a specific type of drug. I’d imagine there’s a lot of data science there that is helpful to determine which drugs should be used. Is that fair to say?
Yeah, absolutely. The data science probably comes more during the drug development, but especially it comes at the stage of developing the methods of testing. The drugs were developed specifically for those mutations, so the drugs come first, I guess. The knowledge of the mutations came first and then the drugs were developed against those mutations, and now what we’re doing is we can sequence the people and their tumors to understand what are the properties of the mutations in their cells.
Go it. Okay. I do want to get to the coronavirus, but we sort of stumbled upon ways in which to help address cancer. I’m reluctant to use the word cure cancer, because I think there’s so many different types of cancer and there’s some treatments that are highly effective, there are types of cancer that are extremely deadly. I’m not an expert, but when I hear ‘cure cancer’ I’m kind of like, well which ones?
I think the way that we cure cancer is to increasingly develop better treatments and identify the best candidates for those treatments, and identify the best treatments for every patient. And also to find cancer earlier and to characterize the mutations in the cancer so that we really understand what’s happening in the patient.
Perfect. So let’s talk a little bit about the coronavirus. So all I know at this point in time, and this is recorded so this is the middle of April right now, but there are multiple variants out there, so my first question is what is a variant? And how does that happen? And then I think what is probably of most interest right now is, we have three different, at least in the US, vaccines that are being created en masse, and that are all being used to hopefully make us less susceptible to getting the coronavirus, substantially so. How was that actually created? What was the approach and the process to that? Maybe we can just start with what is a variant?
Yeah, so a variant is a change in the DNA, or in the RNA in this case, from the normal sequence. So we discovered very quickly the sequence of the entire virus, of the entire SARS-CoV-2 coronavirus. That sequencing was done quickly after the existence of the virus was discovered.
Do you know how quickly? I’m curious, what is quickly in your world?
I believe it was weeks. I believe it was weeks after the virus was discovered. And a variant means at least one base, one of those letters, changed. So in the regular sequence, maybe it’s AAATT. Well, something changed and now it became AACTT. And so that change of one letter to another letter is a variant. It’s not always a change of one letter. We could insert a letter, or we could delete a letter, or there could be multiple letters changed, but that’s fundamentally what a variant is.
Got it. Now, going back in time here, we have the RNA sequence complete for the coronavirus. Now what? How did Johnson & Johnson, Moderna, Pfizer all put together a vaccine? Can you tell us a little bit about the approach that they took once they knew what the RNA sequence was?
The first thing to do is to understand what are the genes in the coronavirus, and there are just a handful of genes in the coronavirus.
That’s good.
One of them is called the spike protein, the S protein, and that’s the one if you see these pictures of the coronavirus with these spikes on the outside of it?
Oh, yeah.
That’s the spike protein. So the spike protein is the protein that attaches to the human cells, and that was the candidate for the drugs because it’s on the outside of the virus.
Okay, keep going.
Once the sequence of the spike protein was known, then the decision had to be made, well where within that spike protein are you going to put your vaccine candidate? How are you going to choose your vaccine candidate? And so for that, one of the things that is done actually, is to look at the variants that have been discovered in the population to date, and see where they are in the DNA, and see are there regions in the DNA where there don’t seem to be any variants happening? And those are called conserved regions. So those are very stable, and the hope is that as time goes on you won’t get variants in those regions, and so you should focus on those regions. That’s important in developing tests as well, which is what my company does. Thermo Fisher doesn’t do vaccines, we do tests and we also do sequencing so that people can discover new variants.
Right, so you’re helping the companies that are creating the vaccine.
Yeah. Yeah.
So the vaccine itself, what is it actually… how is it helping? How is it attacking these spike proteins? What is it actually doing?
What’s happening is…this is a novel technology. These RNA vaccines are a novel technology. RNA is going into the body, and then inside the body that RNA is using the cellular machinery to create proteins. So those proteins are being created in the body, unlike a traditional vaccine where you would just put the protein directly into the body, here you’re putting the RNA into the body and then the protein is being generated inside the body using the normal cellular mechanisms. And that protein is then recognized by the body as being an attacker, right? As being strange, as being non-human, and then your body mounts and immune response against that protein.
Got it, so instead of just getting the protein itself, you’re getting the RNA which the body turns into the protein and then the body builds up antibodies to attack that protein which is similar to the coronavirus S protein, and that just helps us fend off the coronavirus?
Yeah, that’s right. So when the body has seen a threat once, it recognizes it. When it sees it again it recognizes it and it can much more quickly and effectively mount an immune response when it sees it again.
Yeah, it’s just kind of the nature of you get the flu once, and typically you’re not going to get the same version of that flu again.
Exactly.
It’s pretty rare, right? Because you’ve built up an immunity to it. But helping us to get immune without us all building up herd immunity and actually getting coronavirus itself, which is deadly, that’s… So somehow this is done in a way that is very safe, right? It gives us the protein, but doesn’t actually give us the coronavirus itself and all the symptoms that go along with it.
Yeah, it’s only one protein, so it’s not the whole virus, and that’s why it’s so safe. You need to have the whole virus for it to replicate inside you and cause the disease. You’re not giving people the whole virus, you’re just giving a portion of one of the proteins, and so that’s why it’s so safe.
Got it. Well this is awesome. Alright, so I think, so far, I’ve learned a lot about bioinformatics. I’ve learned a lot about the coronavirus itself. I’ve learned a lot about just genetics in general. So in your world, getting back to sort of where we started before we segued into all this good stuff and learning about the topic of the day, the projects and the type of work that you do on a daily basis from a data science standpoint, you’re isolating these genes, you’re pulling them out, and then you’re analyzing them somehow. So these sequences, I assume, turn into data, that we can store as data in databases and what have you, but then what would be the best analogy in terms of what type of algorithms or type of approaches you’re using to actually do the analysis of the various genes that you’re focused on?
A typical data set that the instrument would generate might be a million pieces of DNA that might be 150 bases long each.
Okay.
And those are As, Ts, Cs, and Gs. There’s a lot of processing to even get from the raw signal to there, but that’s where bioinformaticians typically start. So you have this file of a million, it could be 100 million…
100 million rows, but with 150 bytes, basically.
150 bytes, that’s right. And so what we’re trying to do is we’re trying to take that raw data and figure out where it belongs in the genome. So the first thing we do is called mapping, and there’s a number of different mapping algorithms, and we take each one of those reads in turn and we place it where it belongs. And we have to account for the noise in the data, we have to account for the similarity of different genes to each other, we have account for the fact that maybe it’s truncated in some way. But we place it where it belongs.
And then we have a pileup of reads. So you have lots and lots and lots of reads that all pile up on top of each other in the genes of interest, if you’ve done a good job with making the panel. And so then we take a look at that pileup of reads. So you might have hundreds of thousands or tens of thousands of reads on any one gene for a sample, and then you look at every base one by one and you say, does every one of those thousands of reads, do they all have an A at this position? What does the reference say? The reference is the sequence, the canonical sequence. What does the reference say? The reference says A, and what percentage of the reads that you have say something that’s not A?
So the reads are like the records basically that you’re working with, for that, in this case, the 150 bytes?
A read is the name of the one molecule of DNA.
Sorry. Got it, okay.
Then you look and say well okay, in this sample it looks like they have 10% Ts instead of As. And then we have a set of algorithms that interrogate the evidence. The two hypotheses are this is noise, or this is the signal. And in this case the signal means there’s a variant at this position and this sample, at a certain frequency. So we’re basically evaluating those two hypotheses, and evaluating the evidence in favor of those two hypotheses. So we need to understand the sources of noise that might make a T, and whether the signal is stronger than the noise. That’s kind of fundamentally what we’re doing.
Got it. So I don’t think this is unique, I think there’s a lot of different industries, right, that have to work with poor signal, right? But for some, I’m thinking data cleanliness, where does this noise come from? Was it in the taking of the sample itself? Is there something in the pipeline? Is there a bug in the pipeline or the code that created the Ts that shouldn’t have been there or something like that? What are some of the typical causes of this noise?
It can come from a number of different places. It can come from physical contamination, so that there was a few cells in your test tube that shouldn’t be there. It can come from the process of interpreting the raw signal and getting the sequence out of the raw signal. And there’s a few other places it can come from but those would be two of the more common ones.
Got it, okay. So now there’s a whole data problem, right? Or data science problem, analytics problem, to try to figure out like you said, is this noise? So was there some content contamination? Or is this actually correct? This is signal? And this is a variant and this is something that we need to look into and investigate and try to figure out if there’s someone who has this genetic sequence and has this variant and it could make them more prone to have cancer, or something like that?
Exactly, this could be the driver of their cancer.
Could be the driver of their cancer, yes.
Yeah, that’s right. So we take the properties of the data that we already know about and we create models, like any data scientist would. We try to model the sources of noise and the other thing we do is we take public data or proprietary data, where we can identify in advance, what are the positions in the genome that are most likely to have a clinically relevant variant? And those sometimes we call hot spots, and so we want to interrogate those sites especially thoroughly. And we know what some of the alternatives are. So if it’s normally an A, we can look at data from thousands and thousands of previous cancer patients and notice that that A is sometimes changed to C in patients that have one particular cancer.
So we can take that knowledge and now that C becomes an explicit hypothesis that we’re testing against. We’re not just looking for alternative variants in general, we’re looking at the specific allele, at the specific location in that specific cancer type. And because there’s a limited number of such positions, of those hot spot positions, A) we can use it as an explicit hypothesis that we’re looking at as an alternative hypothesis, and B) we can be more aggressive in identifying that as a true variant because we’re not looking at three billion bases anymore, now maybe we’re looking at just hundreds of thousands of those positions.
Right. So I imagine, the other thing that could be topical here is that there’s a tremendous opportunity—you mentioned there’s the publicly available data—there’s a tremendous opportunity for cooperation between teams. And I know it’s difficult because sometimes some of the work that you’re doing is proprietary, it could lead to a drug that your company’s creating and they spend a lot of money paying for your R&D team. I know your team isn’t specifically maybe focused on this but you know, coming up with drugs that will help to cure cancer, a form of cancer, and a certain way, shape, or form. But just as a human, just as the host of this podcast, I’d love for everyone to be collaborating as much as possible. What is your general sense for how well the bioinformatics community collaborates? Could it be better?
I think it’s a very collaborative community.
Okay.
The first thing is these public databases that I’ve mentioned a lot. There’s a lot of public databases, and there’s a huge drive by people to share data publicly. There are a lot of public tools that people build and develop and they put into the public domain, and those tools are pretty widely shared within the bioinformatics community. But then there are other ways that we collaborate. For example, my team developed an assay to measure something called tumor mutational burden, which is how many mutations does your tumor have? And this is a biomarker that can help to predict whether a person will respond well to immunotherapy which is one of the very newest forms of cancer therapy.
We developed this bioinformatics approach along with our panel, we actually collaborated in a consortium, a public-private consortium run by the Friends of Cancer Research, where many, many companies and other government and academic labs got together and we all shared what we were doing with tumor mutation burden detection, what were the methods that we were using. We all sequenced the same samples, we compared our results and we really all tried to work together so that we could learn from each other and develop some best practices, and also so that the TMB estimates from all of these different companies would be cost comparable and would be able to be used side by side. So that whichever one of the companies’ assay was being used, the numbers that came out could be interpreted in a similar way for every company. There’s actually quite a lot of collaboration of that nature happening.
The other way that collaboration happens is more indirectly, through the scientific literature and through the conferences. And so people publish what they’re doing, what they’re working on, and they learn from each other in that way.
That is, again as a human, as somebody who wants some of these terrible afflictions, whether it be cancer or the coronavirus, gosh, the common cold, out there, it’s great to hear that this collaboration occurs. And I think there’s probably some lessons learned there, just for the wider community outside of the bioinformatics community, other ways that other industries should be collaborating more. I mean, certainly conferences is one, but I think what can be helpful is when you sort of have an approach, sure you can patent it, but the fact that you’re creating research and literature that can be leveraged by others, I’d like to see data scientists do more of that, talk a little bit more about it. Obviously there’s a fine line between your proprietary work versus work that can be shared, but in some cases we’re all wanting to do… If it’s universal, if it can be helpful to all, certainly the open source community obviously has a huge, huge backbone in overall data science, but I’d like to see more of that, not just from a technology standpoint but also just more in terms of process, in terms of what algorithms, what features pop up in certain cases, and for certain use cases. That’d be great to see more of that collaboration, I don’t think there’s enough of it.
Okay, I’m going to get off my soap box here but it’s great to see that you all are collaborating as much as you are. Well I have learned a lot, Fiona. I probably could speak with you for another hour about this, this is absolutely fascinating. Today we learned about the world of bioinformatics, what the projects look like, some of the data challenges separating signal from noise. Obviously we just talked about collaboration as well, and we talked about something that’s hitting home for all of us with regard to the coronavirus and how companies went about creating the vaccine. I learned why these vaccines are safe, as you know there’s a lot of people out there who are not as well informed, and hopefully they do listen to this podcast to understand why these vaccines are safe.
So, Fiona thank you so much for being on the Data Science Leaders podcast. It was a pleasure.
Thank you, it was a pleasure talking to you.
Alright, well everyone have a great rest of your week, and thanks again Fiona. Take care.
Bye, now.
29:22 | Episode 16 | August 17, 2021
40:04 | Episode 15 | August 10, 2021
38:29 | Episode 14 | August 03, 2021
26:54 | Episode 13 | July 27, 2021
Use another app? Just search for Data Science Leaders to subscribe.
Data Science Leaders is a podcast for data science teams that are pushing the limits of what machine learning models can do at the world’s most impactful companies.
In each episode, host Dave Cole interviews a leader in data science. We’ll discuss how to build and enable data science teams, create scalable processes, collaborate cross-functionality, communicate with business stakeholders, and more.
Our conversations will be full of real stories, breakthrough strategies, and critical insights—all data points to build your own model for enterprise data science success.