Reflections on Inspecting and Visualizing Real-Life Graph Learning Problems

Anas AIT AOMAR
Nerd For Tech
Published in
5 min readMay 27, 2021

--

Disclaimer: This article and the ones coming soon are just reflections, and summarize my personal perspective, which needs to be updated by your comments. plus we are willing to start an open-source project that deals with inspection and visualizing large networks in the context of graph learning which will feed directly on the ideas of these articles and your contributions of course (Comments, DMs …).

Graph learning problems: Overview and context

These last few months I got interested in graph learning problems and GNNs (Graph Neural Network) to be specific. I like the idea of local message passing and the construction of latent representation that includes the graph structure. Plus there are many libraries that ease the graph model creation part like DGL or PyTorch-geometric.

Yet graphs are hard to visualize and thus inspect

However, my research on graph neural network applications showed that real-life problems build on top of large graphs (social networks, knowledge graphs, relational Databases) in which case it's hard to visualize them (the hole graph) using simple visualization libraries like Networkx or Pyviz or more advanced ones like Gephy (since the graph becomes complex). furthermore, users should relocate their data to another platform for visualization and inspection and do some manual work before seeing some results and start debugging their model results. Thus it makes it hard for researchers to iterate rapidly or identify problems quickly.

Can we at least ask the right questions?

Figuring out how to fix those problems and complexities forces us to formulate the problem by asking the right questions and locating the pain points in a typical graph learning pipeline

Why large graphs are hard to visualize?

Graphs are not Euclidean objects, thus we need what we call layout methods to project them in a 2D space without losing the local space structures in our graph. That’s why many layout methods are there to help us, for example, force-based methods that treat the graph edges as a system of springs. There are other solvers such as tree-based methods, but in my opinion, force-based methods give a neat visualization when dealing with small-world networks.

The visualization complexity starts when dealing with large graphs, not because the layout methods cannot render your graph, but because it looks messy since too many edges are rendered. For more information about large graph visualization problems and solutions check this article.

example of large graph [From lastfm interactive map, CC BY-NC-SA 2.5 HU]

Visualizing a graph as a whole will not let researchers or practitioners debug easily. Since we don't have access to local trends unless we zoomed in, without forgetting the computational complexity to keep the graph interactive.

What does a GL (geometric learning) practitioner want from graph visualization?

A graph learning pipeline looks almost the same as a traditional deep learning pipeline. You start by choosing your model, you feed in your data then look at some metrics as loss value changes over epochs, which helps you see in an aggregate way, how your model behaves given new data. Then any suspicious behavior as stagnation in your loss values or some high drop can be inspected further by some more personalized analysis.

An example that comes to my mind is when dealing with MNIST (MNIST handwritten digit database) classification. Your train your model, get your loss over epochs graph [see figure below] and you may see for example a stagnation in the first iterations, then a drop, you investigate more by inspection loss curve for each class and you see that stagnation is observable only in two classes 1 and 7 for example. Can your model be confused by these two classes? To investigate more we try a dimension reduction over softmax outputs of our model and then visualize the result in a 2D space.

(Left) loss values over epochs/loss values over epochs for each class | (Right) softmax outputs after t- sne dimension reduction, [From distill.pub article, CC BY 4.0]

And yes, the model softmax for our two classes overlap before separation in future iterations, unfortunately. Note that this can not be the case in other complex problems where a solution can be adding more labeled data for these problematic classes.

But what about graph learning problems, does this technique and others work? Yes, maybe, but the question is can we do more? In other words, if we put ourselves in the node-level learning context, the node features are not some independent points in our space, but they got some spatial relations in terms of their links, which are presented by our graph structure... thus why not try to build on top of this geometric bias to improve the inspection technics already used in traditional deep learning problems?

So let's try to build on this bias. First, we got some loss values that change while training, but we also have the graph structure, thus we can see the distribution of error at each region in our graph at each epoch.

[Illustration by author]

As you can see looking at this loss distribution while seeing the graph’s structure may help you inspect some problems with your model or data itself. For example, we may detect an error concentration within a specific community in our graph (data problem ) or some type of bias caused by some hub nodes in our graph which affect weak nodes (model problems). These are just examples but you got the main idea, this kind of visualization may help while inspecting.

But, getting this type of visualization requires a full rendering of our graph to localize spatial parts where there is some error distribution anomaly. This can be hard, as we have discussed before in the case of large graphs.

  • Solution?

To reduce the complexity of rendering our graph we could use some kind of embedding that preserves the local structure In addition to a dimension reduction to see the embeddings in a 2D plane (preferably a featureless embedding as deep walk do for example ). This way we can maintain the goal of our previous visualization. After identifying local spaces with anomalies, the practitioner may select this part and visualize the subgraph related without computational complexity.

[Illustration by author]

Conclusion

In this short article, I tried to reflect on the state of graph visualization related to graph learning problems which have to be objective-driven in the sense that we start from our objective that could be a loss function, graph expressiveness, or other examples, then personalize the inspection and analysis. But we also need caution when dealing with large graphs, where visualizing the hole structure is complex. Thus, we start from simple views (embedding view in our case ) to detect anomalies, then investigate the relative subgraph.

--

--

Anas AIT AOMAR
Nerd For Tech

A curious engineering student ,obsessed by learning new stuff everyday