Matchmaking experiments using ML and Graph Embeddings

Link prediction for R&D collaborations

Posted on: May 15, 2020. Last updated May 18, 2020

This article and its analyses were written by Pedro Parraguez and Duarte O.Carmo

TL;DR (Summary)

What if we could use the network structure of co-authoring collaborations to identify untapped collaboration opportunities? This is what this post is about. To achieve this, we combine traditional machine learning methods (e.g. neural network regression) with graph embeddings (node2vec).

In the results, we show the pairs of researchers with the highest collaboration deltas, i.e. the cases where we found the biggest differences between the number of collaborations our model predicted between two researchers and the actual collaborations between them. The idea key idea is: when we predict a much larger number of co-authoring collaborations than what we see in the data, we have good grounds to believe that those authors have potential for increased collaborations.

This is part two of a two-part article about mapping and matchmaking coronavirus-related R&D. To read the first part and for additional context, please read part one here.

A Jupyter notebook with all the analyses and additional material is available here.

Duarte O.Carmo wrote some additional reflections about the machine learning part of the process here.


As we saw in the first part of this post, the additional speed and scale of research during a crisis can significantly increase the complexity of an R&D ecosystem. One crucial element here is how to balance this increase in speed and scale with the need for efficiency and effectiveness, avoiding too many unnecessary duplicated efforts and information silos.

One way of supporting this balancing act is through evidence-based recommendations for collaborations, i.e. the deployment of link prediction models based on hard data about the researchers and their collaboration history. From a management and research policy perspective this could, for example, help to identify areas where a re-allocation of resources is most likely to improve the chances of forming impactful collaborations.


Over time, researchers build a history of co-authoring collaborations. The reasons behind those collaborations are multiple. For example, we know that collaborations are more likely to occur when certain key "distances" are low (e.g. (1), (2), (3)). This includes distances such as:

  • The geographical distance, e.g. physical distance between the places where two researchers work.

  • The institutional distance, e.g. the overlap between the affiliations of two given researchers.

  • The knowledge distance, e.g. the overlap of research topics between two given researchers.

  • The relational distance, e.g. the overlap of common co-authors between two given researchers.

Here we use historical information about these distances and their relation to collaboration to develop a machine learning model that describes how these distances affect the number of effective collaborations. We then apply such a model to identify pairs of researchers that are collaborating less than what the model predicts and flag those results as areas of potentially underdeveloped collaboration.

More details about the overall data model are available here.

In this context, network visualizations can provide us with a useful graphical way of thinking about the distances we described above. For example, the animation below shows the co-authoring network for our coronavirus dataset, where each node is one author (we are showing the largest cluster that contains 14.000+ researchers) and each link is a set of co-authoring relationships (1 or more co-authored documents).

In this case, the animation shows how the network layout unfolds the structure to provide visual clues about the relative distance between researchers.

Network visualizations are two-dimensional representations of an n-dimensional object. As a result, the relative distances they represent are imperfect, and there are multiple possible representations of a given network object.

For this reason, they should only be used as a rough visual approximation of the actual network structure, not as if they were a map of physical distances on a two-dimensional space. Instead, is better to rely on network metrics and other multidimensional representations (which is what we do in this article).

Our methods in a nutshell

  1. Our starting point is the graph database and model that we originally applied to the description of the R&D ecosystem. For this part of the article, we mostly focus on the network of researchers and its attributes.

  2. We then apply a set of graph algorithms to gather additional information about each node (researcher) and each relation (collaboration) in our network. This includes global influence and centrality algorithms (e.g. Pagerank and Betweenness Centrality) as well as algorithms designed for link prediction that are applied at the level of each relationship between each researcher.

  3. Having the network enriched with additional network metrics on nodes and relations, we create data extracts in the form of node lists (a table with all the attribute data for each researcher) and an edge list (a table with all the pairs of collaborating researchers and attribute data for each collaboration).

  4. At this point, we can move directly into link prediction with machine learning or add an additional step; Node2vec. Adding information from node2vec allows us to not only acquire local data about each node and the relations between those nodes, but also it helps us to include information about the way a node is "embedded" within the whole network (a sort of coordinate for each node that provides a more accurate proxy for the distance between two given researchers, whether they are directly connected or not). This is useful because otherwise, the machine learning model is blind to the global structure of the network.

  5. We added an optional visualization step to validate the results from the node2vec embeddings using Google´s Embedding Projector. We will share some examples below in the results section.

  6. Using the vectors produced by Node2vec we calculate the "similarity" or "distance" between each pair of nodes in the network. This is used as one of the many independent variables that become features in our prediction model.

  7. Finally, we run multiple machine learning methods using as inputs information at the node and relation levels gathered in the previous steps with the aim of getting a predicted value for the number of collaborations that we can then compare against the actual value (for details see our Analyses Documentation section and the Jupyter notebook). We then use the trained models to get predicted collaboration values for all the research pairs in the networks (that already have at least one collaboration) as well as a full set of predictions for all possible pairs for one researcher at a time.


We describe the data sources here and provide access to the raw data here.

In the following sections, we describe part of the work we did to turn the original data into our input source for the machine learning models and the node2vec graph embedding.

Additional Details about our Analyses

To read a more detailed explanation of each step in the process please visit the analyses documentation page.


The analyses and a more detailed documentation of our results is available here. In this section we only share a few highlights.

Node2vec results

Node2vec and the embedding projector allow us to have the first glimpse into our results. This method also helps us have a more intuitive understanding of what later on becomes a key feature to train our machine learning models.

In the figure below we show a projection of our node2vec embeddings where each node is one researcher and the "physical" distance between each node is an approximation of the relational distance between two researchers (in this case the approximation is using simply PCA). The colours show the results of the Louvain community detection algorithm.

In the second figure, we see how each time we click on a researcher it shows the group of researchers closer to that person using cosine similarity (note that the visual layout provided by PCA does not always correspond to the results obtained via cosine similarity).

ML results

We have made available a more detailed list of our results in this Jupyter notebook and in more narrative form on this page. Here we will only focus on two aspects of our results; the most salient features/variables detected during the training of the machine learning models and on some results of our prediction models.

Key features/variables

As we expected, the variables/features identified by the model as the best predictors of collaboration intensity include network metrics at the node and relationship level (we also tested a wide range of non-network metrics coming from author metadata, but those inputs did not have much impact in the model).

At the node level, the network metrics that most consistently appeared across the models as good predictors include:

At the relation level, the network metrics that most consistently appeared across the models as good predictors include:

Prediction results - examples

As we previously explained, the idea is that pairs of researchers with the highest "collaboration deltas" are the most likely to have opportunities for increased collaborations/exchanges.

In order to test our model, we picked one researcher in the network: Professor "Luis Enjuanes" of the Autonomous University of Madrid and ran the collaboration prediction pipeline against all other authors in the network.

Some results, ordered by collaboration delta, can be seen on the table below. ‌

"rweight" is the number of actual collaborations, "rweight_neural_reg" is the predicted amount of collaborations (produced in this case by the Neural Network Regression model) and "delta" is the difference between the actual and the predicted amounts.

Some important notes regarding the results:

  • The amount of real-world co-authoring collaborations is not always well reflected in the input data. Some reasons for this divergence include:

    • Not all the produced documents are indexed and appear on the data source.

    • Sometimes we have more than one ID for the same person. For example, the same person can use on different papers variations of its name or the parsing/indexing process can lead to errors.

  • As a result of the previous points, we have situations where we see a big collaboration delt, but in reality if we were to combine all the IDs for the same researcher, the delta might be smaller or be insignificant.

  • Despite the issues pointed above, one advantage of using our method is that besides being useful as a way to identify collaboration deltas it can also be applied to identify author duplication problems and other consistency issues within the data.

Learnings and next steps

This work was an internal proof-of-concept to test a few ideas connected to link prediction in the context of scientific collaboration. The results are encouraging and demonstrate that collectively we have the data and the computational methods necessary to increase our ambition level regarding data-driven collaboration support tools.

In this process, we have identified the following next steps:

  • We should run this process iteratively, using a first round of the method as a mechanism to identify duplicates and other issues on the input data. We are currently developing an improved pipeline that integrates node deduplication and merging strategies as an additional step of the process detailed here.

  • On-the-ground validation is essential to fine-tune the ML models. We are seeking opportunities to work alongside researchers that appear in our database to review the results and identify the models that produce the best results.

  • Depending on the objective and challenges that trigger the desire to explore new collaboration opportunities, the parameters that matter the most are likely to change. For example, a clear and urgent gap for highly specific knowledge on a well defined technical area would require to prioritize a search space based on that specific expertise gap. In such a situation, the algorithm should focus on the topical layer and then provide recommendations that rank results based on the shortest overall distance. In contrast, more open challenges were the problem, and the solution spaces are less clear, where there is more time available, and where the ambition is to find creative/unexpected solutions, can reduce the importance of geographical and relational distance and allow for more unusual connections based on for example topical connections that have not been tried yet but that are in sufficiently close "neighbourhoods".

Last updated