One of the core tenets of creating a drug that can cure a disease is to find a target, such as a protein or gene, that will respond positively to the drug. Historically, positive outcomes might’ve been born more out of luck than anything else. Think of Dr Alexander Fleming returning from holiday in 1928, finding a mouldy petri dish of Staphylococcus bacteria in his lab, and seeing that the mould was preventing the bacteria from growing. Fast-forward two decades, and we had industrial-scale penicillin.
Today, the process of drug discovery is not left to chance, but conducted through meticulous testing at places such as university research labs and pharma company labs all across the world – producing volumes upon volumes of knowledge of, say, how a protein could be associated with a disease, or how a specific gene could be linked to insulin resistance. Organising all that data, seeing and understanding the potential relationships that could lead to new drug targets is no mean feat. But it’s a challenge, which a team of data scientists are meeting head on.
"The idea of the project is to use knowledge graphs. A specific kind of data structure where entities – real-world objects such as proteins or genes – are connected through links. The graph essentially represents the knowledge that is available right now in the research community. But some of this knowledge is incomplete. So, what we do is apply machine learning to predict new knowledge; to discover new connections between genes and diseases and find potential new drug targets."
And how does this work? Well, it’s a question of ranking genes according to key characteristics such as insulin resistance. Data sources are processed using data engineering and machine learning models to find and highlight specific properties of a gene or a protein. By using specific metrics to highlight the connections between the different entities, the data can then be presented in a knowledge graph to support the scientists in the labs in choosing what to focus their efforts on in creating the next generation of medicine.
Multiply by machine
The knowledge graph project is a testament to the potential of human ingenuity combined with computational power. The core project team was only comprised of four people – with subject matter experts dropping in and out – working at the intersection between biology, graph machine learning, programming, data science, data engineering, and cloud computing. Working mainly in Python, in terms of programming language, and using MLflow on Databricks for experiment tracking and for model tracking, the team also made use of Novo Nordisk internal computational options to run the project, namely NNEEDL, an internal data lake, and Marjorie, a high-performance computing platform.
The focus for this project has mainly been on insulin resistance and other core therapy areas, but the framework is generally applicable to many therapy areas, and therefore useful to researchers all over the world.
“The research we are doing will be published. So, the approaches that we have established can be generalised to other applications. We worked on insulin resistance right now, but the same approach can also be used to predict potential drug targets, if we stay at Novo Nordisk, for obesity, but also for other diseases potentially.”
Tankred Ott, Data Scientist, AI and Analytics Centre of Excellence, Novo Nordisk
And now, the focus is on outcomes. The Knowledge Graph project, which was initiated in 2022, has now concluded, and the team is focused on the publication of their results, and waiting to see the impact their work will have on target prioritisation within Novo Nordisk. Spoiler: the first indications are positive.