__ Summary and Contributions__: The paper proposes using sparse discrete distributions in problems where objective contains full sum over the discrete distribution. The imposed sparsity allows exact marginalization and deterministic gradients. The authors use top-k sparsemax, and sparseMAP to represent sparse parametric distributions. Experiments include semisupervised VAE with categorical class variable, emergent communication game and VAE with Bernoulli latent variables.

__ Strengths__: The idea of imposing sparsity on parametric distribution in cases where the real distribution is sparse sounds useful. In particular for categorical distribution in the setting where training converges to single or few choices, this approach can potentially bring significant compute savings. The idea seems novel to me.

__ Weaknesses__: The main weakness is the issue of scaling this approach to a setting with a large number of categories, or large number of binary variables.
1. The experiment on VAE with binary latents shows that if discrete distribution has large support than the proposed method isn't very useful.
2. Another situation where I suspect the method won't work very well is when learning distribution over a very large number of classes: early in training this categorical distribution will have full support and collapse only closer to the end of training. In this case it's possible that other baselines will perform better, because compute budget will be dominated by the early part of training and will be >>1. At the same time using top-k sparsemax might hurt the learning process again in comparison to baselines.
3. Comparison to existing baselines is very sparse: given the generality of the claims in the paper it would be useful to compare with more available baselines (e.g. REBAR, ARM)

__ Correctness__: I find the methods used in the paper appropriate for supporting the claims, however it seems that the proposed method might have scalability issues, which should be added to the main claims in the abstract and throughout the paper.

__ Clarity__: The paper is well written.

__ Relation to Prior Work__: The proposed method differs from previous approaches in application of sparse discrete distribution, at least as far as I know, this distinction is clearly stated in the paper.

__ Reproducibility__: Yes

__ Additional Feedback__: The paper will be much stronger if the authors could include analysis of their method in cases with large number of categories/binaries. Also plotting training objectives as a function of computational resources would be useful in understanding the trade-offs. Finally adding standard baselines(Gumbel, REBAR, ARM, etc) to all experiments, including bit-vector VAE, would be helpful.
****************************post rebuttal****************
I'd like to thank authors for addressing the above concerns, I have updated my score.

__ Summary and Contributions__: I read the rebuttal and I don't think these questions are well-explained, probably due to the space limit. I hope the authors will keep their promise and make major updates to the paper in the final version.
------------------------------- update -----------------------------------------
This work proposes a new method to solve the gradient estimation problem of marginalization over discrete latent variables.
The idea is to use sparsemax (a sparse alternative to softmax which serves as a differentiable relaxation of argmax) to define the distribution over discrete configurations, so that marginalization can be efficiently carried out given only a small number of configurations have non-zero probability.
To support combinatorial latent variables where plain sparsemax is computationally infeasible. The authors propose two principled modifications: 1) top-k sparsemax by restricting the feasible set of sparsemax to have maximally k non-zero probabilities.
2) sparseMAP that finds sparse solution over combinatorial structures through an active set algorithm.
The two methods are evaluated against several baselines on 1) a semi-supervised VAE on MNIST; 2) an emergent communication game; and 3) a bernoulli latent VAE.

__ Strengths__: This is the kind of work that I have been waiting to see for some time.
The sparsemax and sparseMAP are very beautiful tools and has the potential to address many combinatorial sum problems in differentiable programming, with or without structure.
These methods are well-recognized in the structured prediction communities while remaining under-explored in gradient estimation literature.
This paper makes timely contribution to the gradient estimation literature by exploring such ideas.
Although the idea of exact marginalization under sparsity is straightforward to apply to discrete gradient estimators, the execution of it has many challenges, which is made clear by the authors in the paper.
Specifically, as pointed out in L139, ``solving the problem in Eq. 2 still requires explicit manipulation of the large vector s \in R^|Z|, and even if we could avoid this, in the worst case (s = 0) the resulting sparsemax distribution would still have exponentially large support.''
The two solutions to this challenge is technically sound. I particularly like the top-k sparsemax formulation, which unifies the constraint into the feasible set (instead of using post-hoc truncation) and remains differentiable using results from sparse projection onto simplex.
The sparseMAP algorithm is more complex than top-k sparsemax and it has been a question for me that what the solutions are like when applying sparseMAP to multiple independent Bernoulli latents (as in section 5.3).

__ Weaknesses__: * I read the sparseMAP paper some time ago and I remember that the implementation details of the active set algorithm is a bit unclear to me.
I'd appreciate a more detailed discussion (ideally pseudo code) in the future version of the paper, especially when it is used in gradient estimators for multiple **independent** discrete latent variables.
* How does the q distribution with sparseMAP look like in the bernoulli latent VAE experiment (section 5.3), where the structure is essentially "independence"? How sparse are the optimal sparseMAP solutions in this case? Are they the same as the solutions obtained by the active set algorithm (which is guaranteed to be sparse)?
* Because the solution of the sparseMAP largely depends on the active set algorithm, it would be helpful to demonstrate such solutions in real examples.
* The experiment will be more convincing by comparing sparsemax/sparseMAP to advanced gradient estimation approaches such as VIMCO (multi-sample), Gumbel softmax, REBAR, ARM, direct loss minimization, etc. in the Bernoulli latent VAE experiment (sec 5.3).

__ Correctness__: Yes.

__ Clarity__: Yes. The clarity can be further improved with more details on the active set algorithm used in sparseMAP.

__ Relation to Prior Work__: * As far as I know, the idea has not been explored in any published work or preprint on gradient estimators for latent variable models.
* Some related work that also borrows the idea from structured prediction (e.g., direct loss minimization) to apply to gradient estimators is missing:
Direct Optimization through $\arg\max $ for Discrete Variational Auto-Encoders.
* The top-k sparsemax formulation is related to
A Truncated EM Approach for Spike-and-Slab Sparse Coding.
Though the approach taken by the submission here is considerably better (it is unified into the feasible set of sparsemax and remains differentiable).

__ Reproducibility__: Yes

__ Additional Feedback__: Please see the above suggestions. I'm willing to raise the rating if the questions are well-addressed.

__ Summary and Contributions__: The paper addresses the problem of training latent variable models with discrete latent variables. Current methods either marginalize over latent variables explicitly (which becomes intractable quite fast), or use biased or unbiased gradient estimators. The paper tackles the problem in a different way, by using sparse discrete latent variables. It does this by using sparse k-projections onto the simplex of appropriate dimension. With the sparse representation with low k, marginalization can be done efficiently and exactly (although forcing k-sparse representations lead to efficient marginalization, it may be suboptimal).

__ Strengths__: Novelty: Sparse projections onto the simplex were previously studied in the paper by Kyrillidis et al (2013). This work present a novel way to apply these methods for discrete latent variable models, which appears to work quite well in practice.
Relevance and significance: I think this paper is definitely relevant to the NeurIPS community, since new methods to train discrete latent variable models are continuously being developed, and there are several applications that use this kind of models.

__ Weaknesses__: I have some comments regarding the empirical evaluation:
1- Gumbel-softmax has a temperature parameters that might have a significant effect on the method's performance. For the second set of experiments the appendix states that a temperature of 1 was used. Were other values tested? Was this value used for the semisupervised VAE too?
2- Other baselines that have been observed to perform well could be included. For instance, VIMCO ("Variational Inference for Monte Carlo Objectives" by Mnih et al), and Rebar ("REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models" by Tucker et al).

__ Correctness__: They appear to be correct.

__ Clarity__: Yes.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: *** After rebuttal ***
I will maintain my score. I like the idea, it proposes a different way of dealing with discrete variables, and results look promising.
*** ***
- I think ELBO vs epoch plots would be a nice to have for the last set of experiments. Why were these not included? I usually find them quite informative and straightforward to interpret.

__ Summary and Contributions__: The paper introduces the idea of using sparse distributions in latent variable models to efficiently perform exact marginalizations of discrete variables. This is achieved by using the sparsemax operator (or its variants) instead of the usual softmax.The benefits of this approach are shown in 3 different settings, where the introduced method achieves similar results to dense marginalization while needing a much lower number of loss evaluations.

__ Strengths__: Being able to use discrete variables in deep latent variable models is a fundamental yet challenging task. The main issue lies in the fact that exact marginalization is often intractable, and the approximations commonly used (e.g. score function estimator, continuous relaxations like Gumbel-Softmax) are practically difficult to get to work consistently.
This paper introduces the novel idea of using sparse distributions over the discrete variables to solve this issue. In this way in fact it is computationally feasible to perform exact marginalization, since only a small number of terms will be non-zero (therefore greatly reducing the number of loss evaluations needed).
The method is sound and fairly easy to implement, so I believe it could have an important impact in the community.

__ Weaknesses__: The introduced method relies on sparse distributions, which is a quite strong assumption. While the authors address the main implications of this assumption, I think there should have been an even more detailed discussion/empirical evaluation to increase the impact of this work in the community:
- in the semi-supervised learning experiments in section 5.1 you use a VAE model which is relatively simple and by now quite outdated. Do you expect these results to generalize to more complex architectures? For example, if I took any of the SOTA semi-supervised deep generative models and just replaced the softmax with the sparsemax would you expect similar improvements?
- how does this method behave with challenging tasks that may contain many ambiguous data points? Would the model just use lots of loss evaluations throughout the whole training procedure (and not only in the beginning as in your experiments) or would the sparsity assumption make the model learn to be certain even for ambiguous data?
- since sparsemax is such a core component of this method, it would be useful to add some details on its forward/backward passes and their computational complexity wrt the softmax.

__ Correctness__: Yes.

__ Clarity__: Yes, I enjoyed reading it

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: *** reply to author feedback ***
Thanks for your rebuttal. After reading it I still argue for acceptance, since I believe that this relatively simple idea could have a good impact in the community.