Random Sum-Product Networks: A Simple and Effective Approach to Probabilistic Deep Learning

Is structure learning even necessary? By generating overparametrised, random model structures for sum-product networks, a class of tractable probabilistic graphical models, Peharz et al. manage to obtain models that are competitive with far more complex generative learning approaches as well as with discriminative approaches, whilst yielding well-calibrated probabilities and being robust to missing features.

## Sum-Product Networks

A sum-product network (SPN) is a deep probabilistic graphical model that compactly represents a joint probability distribution over a set of random variables $\mathbf{X}$ [Poo11S]. An SPN consists of

• leaf nodes that represent (marginal) distributions of random variables,
• sum nodes that represent a weighted sum (convex combination) of their children,
• product nodes that represent factorisations over their children,
• edges that connect nodes to form a directed acyclic graph.

The nodes ultimately connect to a root node which computes the desired joint probability distribution over $\mathbf{X}$. SPNs can also be used to represent discriminative models, in which case there can be more than one root node (e.g. one per predicted class).

To render inference tractable, two important structural constraints are typically imposed in SPNs:

• completeness: For every sum node, the children being summed must all be functions of the same subset of random variables.

• decomposability: For every product node, the children must be functions of disjoint subsets of random variables.

Importantly, these properties enable all marginalisations over subsets of the random variables to be reduced to marginalisations at the leaves, such that the desired quantities can still be computed in time linear in the size of the representation.

Intuitively, sum nodes represent a mixture of alternatives for the same set of random variables, and we can interpret each child as computing the probability of one of the values of a latent variable that the sum is marginalising over. Product nodes increase the scope of variables being considered by assuming (conditional) independence.

## Random and Tensorised Sum-Product Networks (RAT-SPNs)

The core idea of RAT-SPNs is to randomly construct an overparametrised, deep SPN structure satisfying completeness and decomposability, and to subsequently learn the model’s parameters using standard methods within a tensor-based deep learning framework [Peh20R]. By retaining generative semantics, the resulting models can satisfy desirable properties such as the ability to generate well-calibrated probabilities (that reflect empirical frequencies) or the ability to soundly handle missing data.

### Generating Randomised Model Structures

As an intermediate representation for the model structure, a region graph is created. A region graph is a bipartite tree comprised of region nodes and partition nodes. In essence, the region graph is constructed by recursively splitting the set of random variables $\mathbf{X}$ into two non-empty subsets of approximately equal size at random, representing the resulting subsets of $\mathbf{X}$ as region nodes and representing the splits that produced them explicitly as partition nodes. The root node is thus a region node representing the full set $\mathbf{X}$. In contrast to other region nodes, the root node is split multiple times in order to produce a variety of different decompositions of the full set of random variables. The construction algorithm is parametrised by

• $R$, the number of times the splitting algorithm is applied to the root node
• $D$, the depth (number of recursive splits).

A region graph is translated into an SPN structure as follows:

• For every leaf node (each being a region node), we create $I$ parametrised input distribution nodes in the SPN, each of which represents a probability distribution over the random variables that are contained in the leaf’s region. Specifically, the input distributions are Gaussians for real-valued variables and categorical distributions for discrete variables.

• For every partition node, we construct the set of all pairs of SPN nodes that were created for its children and add a product node connecting them for each pair.

• For the other region nodes, we create a number of sum nodes in the SPN: In the case of the root node, we create $C$ sum nodes, otherwise we create $S$ sum nodes. Every sum node computes the weighted sum of all SPN product nodes that were created for its partition node child.
For density estimation applications, we set $C=1$ such that the single SPN output corresponds to the probability/density to be estimated, and for classification, we set $C$ to the number of classes in question.

The translation thus introduces additional parameters $I$, $C$ and $S$.

Figure 1 shows a randomly generated region graph along with the corresponding SPN structure. The full set of seven random variables was split twice ($R=2$) with depth two ($D=2$), resulting in two levels of partition nodes (shaded rectangles). The SPN outputs are given by three sum nodes ($C=3$), which would be a suitable for a classification use case involving three classes.

The construction mechanism results in an SPN of depth $2D$ which satisfies completeness and decomposability.

### Learning

With the model structure pre-determined, the parameters $w$ of the SPN (i.e. the weights of sum nodes and the parameters of the input distributions) can be learnt straightforwardly. Given a set of i.i.d. observations of the random variables under consideration, we can formulate a corresponding optimisation problem, which we can then solve using standard methods.

For generative learning, we can maximise the log-likelihood using the EM algorithm for SPNs, which is free from hyperparameters that would require tuning and can be implemented via simple forward and backward evaluations in order to compute the required sufficient statistics.

For discriminative learning (classification), we can minimise the cross-entropy, or, alternatively, a hybrid generative-discriminative loss $\mathrm{H}(w)$, which trades off the cross-entropy $\mathrm{CE}(w)$ and the log-likelihood $\mathrm{LL}(w)$ via $$\mathrm{H}(w) = \lambda\ \mathrm{CE}(w) - (1-\lambda) \frac{\mathrm{LL}(w)}{|\mathbf{X}|}.$$ This can be beneficial in the case of missing values. In either case, the authors applied the Adam optimiser to perform the optimisation.

In the typical deep learning fashion, RAT-SPNs are usually configured to be overparametrised (selecting the generation parameters accordingly), and early stopping is used in order to control overfitting. The authors additionally implement an SPN-specific randomised dropout mechanism that further helps to control overfitting:

• Dropout at an input, which simulates a missing value, can be realised by setting the input distribution node to 1.
• Dropout at a sum node, which simulates the absence of a mixture component, can be realised by setting the respective child node’s value to 0.

The authors implemented RAT-SPNs in TensorFlow in order to leverage GPU acceleration, performing all computations in the log domain.

## Experimental Results

The authors provide an extensive evaluation, comparing the performance of RAT-SPNs to relevant alternative methods on a variety of benchmarks. The key results are:

• On generative learning tasks, RAT-SPNs produce results that are largely on par with those achieved by state-of-the-art SPN structure learning algorithms. Even the highly sophisticated structure learning algorithm ID-SPN was signifiantly better only in 7 out of 20 experiments. This result calls into question the importance of structure learning; If model compactness is not important, we can, apparently, do without in many cases.
Figure 2. Subfigure (a) shows the impact of discarding fractions of the input pixels on the accuracy of MNIST (top) and Fashion-MNIST (bottom) classifiers. RAT-SPNs are more clearly more robust to missing inputs than the MLP+ model using dropout, especially as the value of the coefficient $\lambda$ controlling the importance of the discriminative loss component is reduced.
Subfigure (b) shows the ability of RAT-SPNs to detect outliers/out-of-sample data. For RAT-SPN ($\lambda=0.2$) and MLP classifiers that were trained on MNIST, histograms of the input likelihoods of data points from three digit datasets were computed. In addition to MNIST, the Street View House Numbers (SVHN) dataset and the Semeion Handwritten Digit dataset were considered. In RAT-SPNs, the input probabilities can be computed by averaging over the $C$ model outputs. In MLPs, we can technically do the same, noting that the result does not correspond to a proper likelihood (hence the term “mock-likelihood”). Note that the RAT-SPN (top) is able to perfectly separate MNIST from Semeion, and less than 1% of the SVHN data points overlap with the MNIST data. By contrast, the MLP (bottom) achieves no meaningful separation. [Peh20R]
• In discriminative tasks, deep neural networks were considered as the obvious competitors. The authors explored multi-layer perceptrons of different sizes (varying both depth and width), matching the size of RAT-SPNs by adjusting the generation parameters accordingly, and then selecting the best size for each type of model via cross-validation. They show that RAT-SPNs are on par with multi-layer perceptrons (MLPs) that were trained using dropout. Even when comparing against more advanced MLPs that use batch normalisation and Xavier initialisation (MLP+), RAT-SPNs are sigificantly outperformed in only 2 out of 7 classification experiments.

Notably, RAT-SPN classifiers that were trained with the hybrid objective can leverage their generative semantics, as shown in Figure 2:

• RAT-SPNs can gracefully handle missing data.
• RAT-SPNs “know what they don’t know”: They can detect out-of-sample data/outliers by computing the likelihood of the input data. RAT-SPNs are thus able to provide another channel that can inform our decisions on whether to trust their predictions.

While the retention of generative reasoning capabilities through the hybrid objective is ultimately a trade-off, we do observe sweet spots where the discriminative performance hit is almost negligible. Thus, given their mostly competitive performance on the classification tasks themselves, the above aspects can render RAT-SPNs an attractive alternative to purely discriminative models in cases where the additional reasoning capabilities are deemed relevant.