The goal of density estimation is to find an estimator $\hat{p}(\mathbf{x})$, based on i.i.d. data points $\mathbf{x}_i$ drawn from the respective distribution. Traditional approaches for the task used histograms or kernel density estimators. However, those do not scale well to higher dimensions due to the curse of dimensionality. More recent neural network-based approaches to density estimation yielded promising results. Normalizing Flows, for instance, transform a Gaussian distribution into the target distribution via a diffeomorphism. The change of variables formula can be used to compute the density on the target space [Pap21N]. In order to ensure that the learned mapping is bijective, Normalizing Flows are subject to strong architectural constraints. As an example, autoregressive flows define an artificial dependence between the variables. Furthermore, the change of variables formula requires the same dimension in the input as in the output. Thus, it is impossible to use latent dimensions smaller than the dimension of the data space.
To overcome those limitations, [Liu21D] propose the use of Generative Adversarial Networks (GANs) for density estimation. The GANs are used to directly transform the variables from the latent space into the data space. This is in contrast to learning the parameters of a transformation. Also, the usage of GANs enables the use of different dimensions for both spaces. The idea of the proposed Roundtrip framework is to approximate the target distribution as a convolution of a Gaussian with a distribution induced on a manifold by transforming samples of the base distribution. The transformations to and from the data space are represented by two GANs $G$ and $H$.
Given two variables $\mathbf{z} \in \mathbb{R}^M$ and $\mathbf{x} \in \mathbb{R}^n$ where $\mathbf{z}$ has a known distribution of $p_{z}(\mathbf{z})$ and the target distribution $p_{x}(\mathbf{x})$ is unknown. The forward and backward mapping is learned using two neural networks in a bidirectional GAN architecture, known from the field of computer vision [Zhu17U]. If the base variable has a lower dimension, the approximation lives on a manifold $\mathbb{R}^n$ with intrinsic dimension $m < n$.
$$ G(\mathbf{z}) = \tilde{\mathbf{x}},~H(\mathbf{x}) = \tilde{\mathbf{z}} $$
The authors assume a Gaussian error in the approximation, motivating the usage of a Gaussian base distribution $p_{z}(\mathbf{z}) = \mathcal{N}(\mathbf{z}; 0, 1)$.
$$ \tilde{\mathbf{x}} = \mathbf{x} + \epsilon,~\epsilon \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}) $$
Based on the assumption and choice of base distribution, the target distribution is a conditional Gaussian distribution and the target density can be obtained via marginalization over the base distribution. The resulting integral is then evaluated using either importance sampling or Laplace approximation [Liu21D].
$$ \begin{eqnarray*} p_{x} (\mathbf{x}) & = & \int p_{x \mid z} (\mathbf{x} \mid \mathbf{z}) p_{z} (\mathbf{z}) d \mathbf{z}\\\ & = & \left( \frac{1}{\sqrt{2 \pi}} \right)^{m + n} \sigma^{- n} \exp \left( - \frac{| \mathbf{z} |^2_{2} + \sigma^{- 2} | \mathbf{x} - G (\mathbf{z})|^2_{2}}{2} \right) \end{eqnarray*} $$
In the forward GAN, the generator $G$ tries to generate samples $\tilde{\mathbf{x}}$ similar to the true data $\mathbf{x}$. The accompanying discriminator $D_x$ tries to distinguish the generated from the real samples. In the backward model, the generator $H$ tries to approximate the base distribution. The accompanying discriminator is denoted by $D_z$. The losses for both GANs are as follows:
\begin{align} \mathcal{L}_{\text{GAN}}(G) &= \mathbb{E}_{\mathbf{z}\sim p_{z}(\mathbf{z})} \left( D_x(G(\mathbf{z})) - 1 \right)^2 \\ \mathcal{L}_{\text{GAN}}(D_x) &= \mathbb{E}_{\mathbf{x} \sim p_{x}(\mathbf{x})}\left( D_x(\mathbf{x}) - 1 \right)^2 + \mathbb{E}_{\mathbf{z} \sim p_{z}(\mathbf{z})} D_x^2(G(\mathbf{z})) \\ \mathcal{L}_{\text{GAN}}(H) &= \mathbb{E}_{\mathbf{x}\sim p_{x}(\mathbf{x})} \left( D_z(H(\mathbf{x})) - 1 \right)^2 \\ \mathcal{L}_{\text{GAN}}(D_z) &= \mathbb{E}_{\mathbf{z} \sim p_{z}(\mathbf{z})}\left( D_z(\mathbf{z}) - 1 \right)^2 + \mathbb{E}_{\mathbf{x} \sim p_{x}(\mathbf{x})} D_z^2(H(\mathbf{x})) \\ \end{align}
Further more, a Roundtrip loss is added to ensure that data points being transformed back and forth stay close to their initial value. The goal is to have $\mathbf{x}^{\prime}$ close to $\tilde{\mathbf{x}}^{\prime}$, the result of $\mathbf{x}^{\prime} \to H(\mathbf{x}^{\prime}) \to G(H(\mathbf{x}^{\prime})) = \tilde{\mathbf{x}}^{\prime}$. The same holds for samples $\mathbf{z}^{\prime}$ and $\tilde{\mathbf{z}}^{\prime}$ from the base distribution. The similarity is measured using the Euclidean norm. The Roundtrip loss is defined as follows:
$$ \mathcal{L}_{\text{RT}} = \alpha \Vert \mathbf{x}^{\prime} - \tilde{\mathbf{x}}^{\prime} \Vert^2_2 + \beta \Vert \mathbf{z}^{\prime} - \tilde{\mathbf{z}}^{\prime} \Vert^2_2, $$
where $\alpha, \beta \in \mathbb{R}$ are weighting coefficients. The Roundtrip loss has been used before, e.g. for the CycleGAN [Zhu17U]. The Roundtrip loss is then added to the losses of the generators.
The authors tested and compared the Roundtrip framework to popular density estimation methods on several problems. Such include simulated and real studies. To asses the density estimation, three 2D data sets were generated, consisting of 20k i.i.d. samples each. After training, the Roundtrip model was compared to MADE [Ger15M], RealNVP [Din17D] and Masked Autoregressive Flows [Pap17M]. The results are shown in Figure 2. Clearly, the Roundtrip model outperformed the other methods.
Following the approach described in the Roundtrip framework, the authors have shown that
- GANs require less restrictive model assumptions than normalizing flows
- Principles of previous neural density estimators can be seen as a special case of the Roundtrip framework
- Roundtrip performs well on synthetic datasets (see Figure 2) while the chosen comparisons might not be optimal [Kim20S] for the tasks at hand