Science in Times of Crisis: Mapping the Ecosystems Behind Coronavirus R&D

A data-driven overview of research and development efforts related to Coronaviruses. Includes data from the 70s to-date with published R&D outputs on SARS, MERS and COVID-19.

Posted on: April 5, 2020

Motivation

The world's response to the COVID-19 outbreak has activated efforts in multiple interdependent fronts. One of the most visible and dramatic examples of this is the response led by the healthcare sector and its workers. They are our frontline in this war and the ones that sadly endure the heaviest casualties.

Also crucial, is the role played by the ones behind our logistic systems and those deploying and maintaining critical technologies (e.g., ventilators). They are those keeping our essential services and supply lines running, and we can think about them as our "engineer corps" in this crisis.

Finally, we have those relentlessly working on the research and development of new drugs, treatments, epidemiological models, containment policies, and many other R&D areas needed to respond to this outbreak. They are our "intelligence service" and the ones that we trust to develop the new weapons and tools badly needed by our frontline. For most, this last group is almost invisible, but how efficient and effective they are makes a big difference in the length as well as the costs (human and otherwise) of the COVID-19 crisis. They are the focus of this post.

This article and its dashboards are an attempt to advance our understanding of the complex network of global R&D efforts related to Coronaviruses and how science is made during emergencies. The central idea here is what can be described as "R&D capability mapping". With this mapping exercise we seek to describe some of the key characteristics of the evolving ecosystem of Coronavirus R&D research, a system that is spread across multiple locations, organizations, topics and individuals.

A second follow-up post focus on what can be described as R&D matchmaking, including prescriptive/predictive analytics. In that post we sketch out additional tools that leverage machine learning and graph embeddings, and that are aimed at improving the efficiency and effectiveness of Coronavirus-related R&D efforts through contextual data-driven insights.

UPDATE May 1st 2020: We are happy to be one of @axios sources for their special report about the science of pandemics. That article is available here.

About the author

Pedro Parraguez (PhD), is a researcher on complex socio-technical systems and startup founder, who worked for 7 years at the Engineering Systems Group of Danmarks Tekniske Universitet (DTU) and since 2019 is the CEO of Dataverz ApS in Copenhagen. His previous research includes analyses of the worldwide biofuel R&D ecosystem, sustainable energy production R&D, the Danish cleantech industry, and the design process of complex engineering systems, as well as work on national and international R&D ecosystem indicators.

Through Dataverz, he has worked with clients that include Novozymes (Denmark), the Nordic Institute for Studies in Innovation, Research and Education (NIFU, Norway), the Technical University of Berlin (Germany), the OPERA project (OPEn Research Analytics, DK), and the Andes Pacific Technology Access HUB (HUB-APTA, Chile).

Key contributors to this article include Dataverz's CTO Nelson Guaman (MSc) and Duarte O.Carmo (MSc), Business and Analytics Consultant at Jabra.

Challenge and Key Questions

R&D mapping is primarily an exploratory approach. However, we also have specific challenges and questions that we seek to address with this exercise. The overall challenge is to provide an enhanced overview about the worldwide stock of Coronavirus-related R&D capabilities and well as its dynamics. A key objective is to offer data-driven decision support that policy makers and others in positions of power can use to make more informed decisions.

Some of the key questions that we seek to address in the sections below include:

What are the nature and dynamics of the R&D ecosystem responses to Coronavirus outbreaks? What can we learn from those responses?
What are the key knowledge clusters, R&D capabilities and their evolution?
Is there a topical difference between the different coronavirus outbreaks?
What type of indicators and visualizations can we use to better understand the spatio-temporal characteristics of R&D related to Coronaviruses? What about indicators and visualizations for organizations and researchers?
Finally, what are some key takeaways that we can derive from the exploration and analysis of this data?

Data and Methods

Data

Instead of looking exclusively at COVID-19, we decided to analyze R&D outputs related to all Coronaviruses. This includes SARS and MERS as well as hundreds of other reported viruses under this family. One reason for this is because of how recent COVID-19 is, which restricts the amount of R&D outputs available to analyze. The main reason, though, is because the inherent R&D capabilities we have to fight COVID-19 today are the result of many years of accumulated related research. Since the research community and relevant skills directly connected to previous Coronaviruses are arguably closer to COVID-19 than any other areas, it made sense to define the relevant dataset in this way.

A more complete analysis could expand the boundaries and add to this dataset epidemiology, vaccine development and other areas, including outputs that are not directly connected to Coronaviruses. However, such broadening of the dataset might also make the interpretation of the results much harder.

The two main data sources used for this work include:

Microsoft Academic Graph (MSA) API: Subset of 10.000+ coronavirus R&D outputs, with data from 1970, including patents and scientific publications.
The Global Research Identifier Database (GRID): Used to enrich the organization data extracted from MSA.

After cleaning and enriching the data we extracted a total of 1.100+ organization and 26.700+ researchers spread across 90+ countries and 700+ cities.

We have made the data and other resources available here.

Model

Our data model connects a document (our research output) with its topics, terms, organizations and authors. Likewise, authors are connected to organizations (their affiliations at the moment of producing the research output) and the location of those organizations. The overall model is shown below in figure 1.

Once we connect our entities (nodes), we end-up with 60.500+ entities and 600.000+ relations between those entities. Once we implement the model, each document becomes a small network in its own right. Figure 2 below illustrates part of the network formed by a single document.

As we include more documents, the network grows and patterns start to emerge. The figure below illustrates part of the network generated when we retrieve two documents with shared co-authors.

Methods

Following our model, we loaded the data as a network into a graph database (Neo4j) which was later connected to an interactive dashboard (Tableau) to facilitate the rapid visual exploration of the results.

The network model allows querying the data from multiple entry points (e.g., author and its context, regional dynamics, topical evolution, etc). Having the data modeled as a graph also allows calculating network metrics that better describe relevant features (e.g., centrality, influence and brokerage metrics) about the authors, cities, organizations, etc.

Results: Mapping Coronavirus R&D

In what follows we will guide you through what we believe are some of the most salient findings of our R&D capability mapping exercise. You can access the main dashboard view directly and explore the data at your own pace here. The last time we updated the underlying data was on March 11th, 2020.

Important notes:

1) We do not have Coronavirus-specific knowledge. Our relevant strength is on data modeling, visualization and analyses of R&D research outputs in general. These results and dashboards are meant to serve just as a complementary source of information. Also, our interpretations are provided only as subjective (although data-driven) reference, and should not be taken as expert COVID-19 advice.

2) Many relevant R&D outputs are not publicly published and/or indexed by our data sources. The ~ 10.000 documents in our database are a subset of the total, much larger document universe containing relevant R&D outputs. However, after running queries in alternative databases, we estimate that this sample set provides a reasonably good window into the worldwide coronavirus-related R&D efforts, especially when it comes to formal R&D outputs that are openly published as patents or peer-reviewed research.

3) We strived to improve and clean the data as much as possible taking care of not introducing our own biases. However, some mistakes might still remain within our database. If you find issues or have any concerns please let us know and we will try to fix them.

4) This post is not a scientific article. To compress the information and make it more approachable I'm skipping some steps and leaving out a more detailed description of the methods and findings. We might add additional information and refine the analyses later on depending on the feedback we receive.

R&D Ecosystem Overview

Even though research on Coronaviruses represents only a tiny percentage of the total worldwide research; in 2019 it accounts for less than 0,02% of all the medical papers published that same year, the number of actors involved and the breath of knowledge makes it still a fairly complex socio-technical system in its own right. A system that has many thousands of interconnected parts and no single actor able to hold them or understand them all.

We developed the main screen of our dashboard as a first entry point to the R&D system and its context. The dashboard allows to simultaneously explore the people, organizations, geographical coverage, topics, evolution and connections between them within the same screen. The aim of this view is to quickly have a feeling for some of the main trends and actors, so that we can decide where to focus our attention in more detail later on.

We will present more specific takeaways in the sections below. What is worth noting at this point is that the amount of R&D outputs connected to Coronaviruses per year seems to clearly follow four distinct periods that overlap the 3 major Coronaviruses outbreaks:

1) the pre-SARS period that last until 2002, where there was no particular outbreak but many isolated incidents with different strains of Coronaviruses found mostly in animals

2) the "SARS period", directly connected to the SARS outbreak in Asia on November of 2002, that reached its peak on 2003,

3) the "MERS period", connected to the MERS outbreak that started on November 2012 and that affected most intensely the middle east and South Korea, and

4) the current COVID-19 period (2020).

From figure 5 below, we can also see that the overall trajectory of R&D outputs in medicine and on Coronaviruses are rather different. Outputs in Medicine show an almost uninterrupted upward trend (which is normal, as every year more researchers and resources are added to the worldwide R&D system). Instead, Coronavirus outputs are driven and rapidly activated by outbreaks.

Whenever a new outbreak hits, new research resources are rapidly added to the baseline amount of work in the area, some of which remain for a period of time after each outbreak. This shows the type of response capacity that exists to face new crises, something that can be thought of as research elasticity. A clear proof of this response capacity is the rapid increase of research outputs between January and March of 2020 in response to COVID-19. In less than three months in 2020 we have the equivalent of almost 4 years of prior Coronaviruses outputs. Of course this allows counting just quantity and not quality, but it is still a remarkable proof of flexibility and ability to refocus efforts within the R&D ecosystem.

Topics

To identify relevant R&D outputs on Coronaviruses, and later explore all related topics within those outputs, we leveraged the Shen, Zhihong, et al. (2018) method for concept discovery, concept-document tagging, and concept hierarchy generation. In brief, this method uses a combination of a) multi-label text classification, taking as a core training set Wikipedia and b) topic hierarchy construction, using word embeddings plus graph link analysis of the underlying graph structure.

In plain english, this means that whenever a document/research output is classified, the process inspect the text at multiple levels. For example, following figure 7, an article can be hierarchically classified at level "0" within "Biology, at level "1" within genetics, at level "2" within "virus", at level 3 within "RNA", and at level "4" RNA-dependent RNA polymerase. This hierarchical classification is important because it allows running later on more nuanced analysis about topic similarity and about the potentially untapped combinatorial possibilities that exists between researchers, research groups, technologies, etc. (e.g.).

In our case, for the Coronaviruses subset, this analysis lead to a total of more that 5.500 topics divided into 6 hierarchical levels. The first level includes topics such as medicine and biology, whilst the last level includes very specific topics such as "mycophenolic acid", "vancomycin resistant enterococcus", and "reverse transcription loop mediated isothermal amplification".

Figure 6 below shows an animation with a breakdown of the topics in their different hierarchies and across each of the 4 main R&D periods described above. Additionally, we also run the Louvain method for community detection to identify topic clusters with a network based on topic co-occurrence in documents.

The community detection method lead us to identify five large topic clusters: 1) "Virology", 2) "Symptoms and epidemiology", 3) "Genetics", 4) "Biochemistry", and 5) "Other viruses and Vet. Medicine" plus three smaller groups. The names for each of the large clusters were determined using as a guidance the higher levels of the hierarchy plus a qualitative inspection of the members within each cluster. This, in addition to the simplified hierarchical topical structure shown in figure 7, as well as the evolution of the topic clusters in figure 8, provides an overview of the main knowledge blocks involved in the study of Coronaviruses.

Some key highlights that can be derived from the study of the topics include:

Many areas of knowledge come together when it comes to the study and the development of solutions for Coronaviruses. What is required to create treatments, vaccines, diagnostic tests and appropriate containment measures combines applied medicine, fundamental biology, chemistry, genetics, virology and many other fields of study, which also include topics in computer and social sciences. One of the key challenges when it comes to managing so many knowledge fields, is to strengthen the efficiency and effectiveness of cross-boundary work. In other words, in times like this we need the best possible knowledge interfaces that we can have, so that we can increase the speed and positive impact of new solutions.
Despite the relatively clean depiction of topics and hierarchical structures offered in the figures below, the reality is much more complex and it is hard to represent visually. Many cross-cluster connections exist showing the bridges through which we build cross-domain solutions. What is more, the hierarchical structures evolve and their shape is more like a complex network than a linear tree. This means that strict, static taxonomies built by experts from the top-down will often fall short of what is needed to understand the complex dynamics of knowledge production. This also mean that traditional indicators are insufficient to describe some of the key characteristics of the research outputs and of the processes that produce them.
We tested if the proportion of documents on each of the topic clusters experienced statistically significant changes over time. The results is that at the level of the main clusters their proportion remains largely unchanged. This might be a sign that the proportion of the macro building blocks used to tackle Coronaviruses has not changed much across the different outbreaks (the proportion of topics further down in the hierarchy does change, though). One noticeable exception is the new topic cluster concerning "Data Science" that appears timidly on 2015 and that is already showing signs that it might experience a significant growth on the next months.
Although at the aggregate level the proportion of documents on each topic cluster does not seem to change much over time, we do see clear signs of topical specialization at the geographical and organizational levels.

Click here to open an interactive version of figure 6

Geographical Analysis

We used country and city-level information about the author's affiliations to connect each R&D output to one or more places in the world. As a result, we can explore the geographical variable in connection to time, topics, and R&D output volume, or any other variable in our model. In what follows, to avoid branching out in too many directions, we will focus primarily on the geographical evolution, but using the main dashboard, you can also explore some of the other angles yourself.

Figure 9, 10 and 11 below show a few different views for the evolving geographical distribution of R&D work around the world. Some of the main takeaways include:

Each period has a distinctive geographical footprint. Although the general trend of R&D becoming more global over time is also true for Coronaviruses R&D, what we see here is that the main countries and cities involved on each period are heavily connected to the places most affected by each outbreak. What is more, at the beginning of each outbreak we see a sharp local response that is then followed by a more global response.
During the first Pre-SARS years, we had primarily western R&D activity on Coronavirus. However, since the mid 90s and certainly from SARS onwards, the East has ramped up its local R&D activity (we will see more about specific collaboration on the next section). This meant that their research capabilities were relatively "fresh" and able to deploy a fast response to the current COVID-19 crisis. If we compare this with the case of Ebola or other crisis hitting countries with less developed R&D capabilities, we can start to imagine how much worse things could have been in Asia if the current local capabilities had not been in place.
The "glocalization" of Coronaviruses R&D: We can borrow the glocalization concept, usually used in business literature, to describe part of the dynamics reported in the previous two points. "Glocalization indicates that the growing importance of continental and global levels is occurring together with the increasing salience of local and regional levels". This is particularly important when it comes to something like a virus; which starts locally and its trajectory is heavily dependent on how it is first managed at that level, but that ultimately requires global strategies and collaboration to contain and mitigate its effects.

R&D Organisations

Looking within each country and city, we can find the R&D organizations working on Coronaviruses and their collaboration patterns, which we summarize on figures 12 to 15 below. Some of the key takeaways of these views are:

As with the analysis of geography, we see the heavy influence on the organizations' ranking of the original location of each outbreak and the places where it later spread. Once an organization appears, it tends to "stick around", but its ranking (in terms of number of research outputs) can experience significant fluctuations. This is particularly the case between the pre-SARS period and the following periods, where previously dominant organisations such as Utrecht University and the University of Southern California are rapidly superseded by "newcomers" such as the University of Hong Kong and the Chinese Academy of Science.
Collaboration clusters (identified based on the co-occurrence of organisations in document) tend to be based on geographical closeness. One key broker that escapes this tendency and is able to consistently bridge geographical boundaries is the CDC (US Centers for Disease Control and Prevention). Historically, the CDC seems to be one of the few "high-bandwith" actors connecting Asia to the West, hence it plays a crucial role in information flow and overall network connectedness. Another noticeable actor is the University of Hong Kong which has a very central position in the overall network, and alongside the CDC, has one of the most international collaboration profiles.
Some of the most recognized R&D organizations worldwide barely appear on the network. For example, The University of Cambridge, Harvard and Stanford, which are often topping the rankings in the fields of medicine and biology, have a relatively low number of records in our database. This is probably a combination of the very niche nature of this type of research, the lack of local drivers (outbreaks) for those institutions and possibly the slower response time that a larger and more prestigious institution might have due to longer-term strategic commitments.

Researchers

Without going into the analysis of specific people, which is much more sensitive and would require a more detailed and cautious analysis, what we want to introduce here are examples of what can be done when looking at the data at this higher detail level. On what follows we provide a short summary for the main takeaways for each the three figures below:

Fig 16, a one page summary profile for each researcher: Each person on the database has a sort of unique fingerprint based on the organisations, location, topics, collaborators and the time when each research output was published. One potential application for this profile, is to use it to quickly characterise a person or a group of people that could be potential collaborators (or our own research group). However, a more interesting and scalable use is to use the fingerprint as a mathematical object (a vector or matrix) that can be employed to create clusters and offer recommendations and other predictive analytics. In our follow-up post we will dive deeper into this.
Fig 17, identifying collaboration brokers using the network structure of co-authorships: A frequent problem when assessing the impact and productivity of a given researcher is that the metrics used typically end up being something like number of citations and/or number of research outputs. Whereas these metrics are certainly relevant, they are very narrow, and do not account for the multiple types of roles that a research can take. For example, some researchers spend energy and time building bridges between different knowledge areas and research groups. By means of doing that, they are creating value facilitating critical information flows and enabling recombinant innovation. However, this activity can be riskier and takes energy that could be used writing within a narrower field, and hence possibly getting more papers out. In figure 17 we see how, in general, the more research outputs a researcher has, the higher it is its brokerage indicator. People above the diagonal tend to focus their collaborations within narrower research groups that exist in tighter clusters. In turn people below the diagonal tend to produce research outputs collaborating with a more diverse group that goes beyond their own natural neighbors. This brokerage profile, alongside the more general profile described before, can be used in times of crisis as a way of identifying those that are in a better position to act as bridges between different knowledge areas and institutions, something that can be used accelerate the response rate and facilitate drawing resources and capabilities that might not be locally available.
Fig 18, collaboration dynamics as a crisis evolves: Examining the average number of people per publication lead us to an interesting finding; the number of collaborators on a research output seems to change during each outbreak. At the beginning of each outbreak the average number of people participating on a document starts relatively low, then peaks towards the end of the outbreak and then goes down again. This might be an indication of a natural trade-off between the response time available and the number of collaborators that can be efficiently managed. In other words, at the beginning of an outbreak, smaller teams with faster reaction times and lower coordination overheads are assembled to study and report results as fast as possible. Later, as the crisis matures, larger teams can come together to provide a more comprehensive response, including not only more people but also more organisations.

Final Reflections

As we saw above, the complex nature of a research ecosystem like the one behind R&D related to Coronaviruses, combined with the rich data and the number of analytical possibilities available, provides ample space to develop new indicators and decision-support tools. Despite these possibilities, the adoption rate of this type of data-driven approaches is low, which can lead to suboptimal decisions when it comes to resource allocation and the orchestration of global responses in times of crises.

To finish, I would like to offer a few final reflections based on the data and the points made above:

"Normal" science is a slow, incremental process that requires time and team efforts. The reason why we can see relatively fast responses in times of crises is because of the lengthy and mostly invisible process that happened before, and that allowed to build enough R&D capabilities and collaborative social capital. At the same time, it is easy to destroy crucial capabilities, either because a key organisation in the ecosystem is left without enough resources or because the wrong short-term incentives are put in place. For this reason we should protect and better understand our critical research ecosystems, so that they can be made more resilient.
Although we did not get to look at individual researchers on a high detail level, one very interesting possibility of this type of approach is the analysis of inter-temporal brokerage, i.e., the characterization of the individuals and the dynamics behind the passing of knowledge over time. Just like regional or institutional brokerage, one could argue that there are individuals who end up performing a crucial role directly passing knowledge and experience from the past (e.g. previous outbreaks) to the future. Unfortunately, it is hard to identify those individuals using standard measures.
We can learn some important lessons from systems engineering, that apply both to the challenge of manufacturing thousands of new ventilators in record time, and to the orchestration of complex R&D ecosystems. When it was suddenly clear that there would be a severe shortage of ventilators, the first impulse of many countries was to demand that some of their big manufacturers took immediate action and went from manufacturing cars, vacuum cleaners and aerospace components to ventilators. Soon thereafter, it became clear that is not that easy, not because it is not technically possible, but because no company manufactures any complex device in isolation. Instead, things are done in small manageable modules and the pieces are sourced from a myriad of suppliers and then carefully integrated and tested. As a result, the latest and more successful emergency strategies to build ventilators are based on going back to the key principles of systems engineering and supply chains, using modularity and complexity management as a way to design solutions that can be created from pre-existent modules. Likewise, in the R&D process of vaccines, treatments and other time-critical solutions, we need to leverage as much as possible our pre-existent solution modules and use data science to help us be better system orchestrators. The key is to provide contextual insights about the best combinations of capabilities available to us based on our relational, geographical and knowledge bases.
We are gaining local R&D capabilities in multiple parts of the world. This is important for the speed at which we can tackle crises. If the virus had emerge within a country with less resources than China, we would be probably significantly worse. This shows the importance of “glocal” science.
The CDC (USA) has so far played a crucial role as an institution with high global outreach and capabilities. However, in a world where the USA has turned increasingly inwards and away from its traditional international leadership, we need to make sure that in the future we still have an institution with the research muscle and resources necessary to complement the role of the CDC at the world level. Such an institution could be an independent CDC spinoff that is globally funded, an European institute or an R&D organization connected to the WHO.

You can learn more about me and our work at www.dataverz.net and at www.parraguezr.net

NextMatchmaking experiments using ML and Graph Embeddings

Last updated 5 years ago

hashtagMotivation

hashtagChallenge and Key Questions

hashtagData and Methods

hashtagData

hashtagModel

hashtagMethods

hashtagResults: Mapping Coronavirus R&D

hashtagR&D Ecosystem Overview

hashtagTopics

hashtagGeographical Analysis

hashtagR&D Organisations

hashtagResearchers

hashtagFinal Reflections