Documentation of our analyses
Additional notes about our analyses
Once the data was exported in csv format, there are a total of 166.329 rows in the data, with a total of 48 attributes.
Every row corresponds to a connection between two nodes, p1 and p2. The attributes of each one of these rows can be grouped as described:
- Data related to the nodes themselves (.i.e node metrics): ID, page ranking, total citations, organisational affiliations, betweenness centrality in the network, etc.
- Data related to the relationship of these nodes, derived from the network (.i.e network metrics): topic similarity, the louvain community they belong to, the resource allocation score, etc.
- Numeric type changing: Changing columns that are floating point numbers into integer columns. (for example, the ID column)
In order to work with the Node2Vec approach (described in the forthcoming part - Methods), the data was also compiled firstly into a “edgelist” format. Here, every row describes the id of two nodes of the network and their weighted edge - or link.
Every row contains the following:
node1_id_int node2_id_int <weight_float, optional>
The data made available from neo4j amounted to 24.675 distinct node id’s, with an average weight of 1.0.
The main goal of the link prediction part of this study was to find reliable ways of predicting the probability of a collaboration occurring between two researchers. In the described dataset, this amounts to predicting the continuous variable “rweight”, based on the rest of the attributes.
As we know the value we are predicting, this constitutes a supervised learning problem. More specifically, a regression problem.
In order to build several prediction techniques, we first need to split the data into a training and test set.
To predict this probability, 3 regression techniques were used:
A linear regression looks to fit a linear model that multiplies each attribute by a coefficient w. Mathematically, it looks to minimize the sum of squares between the observed targets and the ones predicted.
This method looks to solve the following equation:
Where w is the vector of coefficients, X our observations, and y the observed values.
ElasticNet is a linear regression model that combines feature elimination and feature coefficient reduction. The basic premise of ElasticNet is that it will not use some of the features in our dataset (particularity of the Lasso model), but also reduce the impact of some features that are not relevant in predicting our target variable (particularity of the Ridge model).
The method solves the following equation:
Where alpha and p (or l1_ratio), need to be predetermined. To estimate these parameters, a 5 fold cross validation was used.
Finally, the last technique used to predict collaboration probability was what is called a “Neural Network Regression”.
Node2Vec is a technique and framework originally developed by Stanford University by A.Grover and J.Leskovec in 2016, that learns continuous feature representations from networks. To put it simply, it transforms a weighted network formed of nodes and edges, into a tabular representation of each node as a vector of N dimensions, while preserving a wide set of characteristics.
This implementation allows the creation of such vector representations of nodes (.i.e node embeddings), from a list of nodes and their corresponding weighted edges. It also allows for the determination of the number of dimensions this representation will have in the end.
The variable that was of prediction interest, is the number of co-authored documents between two researchers. In the dataset, this column is defined as “rweight”. Rweight, describes the number of such collaborations between two nodes (p1 and p2).
Let us look at the distribution of this variable in our dataset:
Rweight has a mean of 1.25, a standard deviation of 0.99, but also a 75th percentile of 1.00. This means, as shown in the figure, that the variable is very skewed. In fact, most authors only collaborate one time. However, there are outliers, for example, there is a rweight of 56 in one of the rows, which would mean that these two authors co-authored 56 documents together.
One additional metric introduced is the cosine similarity between nodes. Taking the node embeddings described in the second part of the study (LINK), it’s possible to calculate the cosine similarity between them, by applying the cosine distance function between two vectors.
Since we are looking to predict the number of collaborations between authors. Let us investigate some of the highest correlations between rweight, and the other attributes in the dataset:
The figure below, is a correlation heatmap that describes our dataset.
Looking closely, the attributes that have a higher correlation with the number of collaborations (rweight) are:
- resourceAllocationscore (0.50 correlation score)
- preferentialAttachmentscore (0.32 correlation score)
- adamicAdarscore with (0.29 correlation score)
These are all metrics derived from the original neo4j network LINK . This leads us to believe that some of the calculated network metrics will be very important in describing the collaboration between two authors.
The first model applied, the Linear regression produced the following results:
Moreover, when looking at the main features and their coefficients (figure below), we can note that the main features used by the linear regression are:
- Resource Allocation Score
- Cosine similarity
- And Topic similarity
Here are the metrics after fitting and ElasticNet with Cross Validation:
We can also observe the main coefficients used in this model, by plotting its coefficients:
By applying the Neural Network Regression, the following results were achieved:
Moreover, we can also observe the fitting of the model with the test set:
In addition to the ML methods previously described we also ran our analyses using Google AutoML tables. We decided to do this in order to evaluate Google´s service but also as a way to validate our results against a more "packaged" ML service.
The figure below shows a summary of the results in one of our early tests.
Taking the edge list of our network, the node2vec algorithm was successfully run. An important detail is that a weighted edge list was feeded into the algorithm.
python src/main.py --input corona.edgelist --output corona_128.emd --weighted --dimensions=128
The output of running this algorithm is a table where every row corresponds to a node, and every column corresponds to an embedding (a float) in each dimension.