In the spirit of cognitive information disruption, I thought it was fun how the acronym of this paper is completely unrelated to the subject matter. I do not support intentional cognitive dissonance (i.e. mind control), but I am used to being enveloped in it, as I imagine you guys are now. https://arxiv.org/abs/2211.01340 https://arxiv.org/pdf/2211.01340.pdf arXiv:2211.01340v1 [cs.LG] 2 Nov 2022 Randall Balestriero Meta AI, FAIR rbalestriero@meta.com NYC, USA ABSTRACT Deep Neural Networks (DNNs) outshine alternative function approximators in many settings thanks to their modularity in composing any desired differentiable operator. The formed parametrized functional is then tuned to solve a task at hand from simple gradient descent. This modularity comes at the cost of making strict enforcement of constraints on DNNs, e.g. from a priori knowledge of the task, or from desired physical properties, an open challenge. In this paper we propose the first provable affine constraint enforcement method for DNNs that requires minimal changes into a given DNN’s forward-pass, that is computationally friendly, and that leaves the optimiza- tion of the DNN’s parameter to be unconstrained i.e. standard gradient-based method can be employed. Our method does not require any sampling and provably ensures that the DNN fulfills the affine constraint on a given input space’s region at any point during training, and testing. We coin this method POLICE, standing for Provably Optimal LInear Constraint Enforcement. 1. INTRODUCTION Deep Neural Networks (DNNs) are compositions of inter- leaved linear and nonlinear operators forming a differentiable functional fθ governed by some parameters θ [1]. DNNs are universal approximators, but so are many other alternatives e.g. Fourier series [2], kernel machines [3] and decision trees [4]. The strength of DNNs rather lies in the ability of the func- tion fθ to quickly and relatively easily fit most encountered target functional f ∗ –often known through a given and finite training set– only by performing gradient updates on its param- eters θ [5]. This success, combined with the development of software and hardware making DNNs’ training and inference efficient, have made DNNs the operator of choice for most tasks encountered in machine learning and pattern recognition [6]. Although ubiquitous for general purpose function fitting, DNNs have had a much harder time to shine in settings where specific constraints must be enforced exactly on the learned mapping. A motivating example follows from the following Yann LeCun Meta AI, FAIR, NYU yann@meta.com NYC, USA Fig. 1. Classification task of orange versus purple 2-dimensional samples, the learned decision boundary is depicted in red and comes from a SGD trained POLICEd DNN of depth 3 with leaky-ReLU nonlinearities interleaved with dense layers of width 256 with the constraint that the network is affine (with unconstrained affine param- eters) in the region R delimited by the black lines (see Eq. (3)). This constraint is strictly enforced by the proposed POLICE algorithm (Algorithm 1), see Fig. 2 for POLICEd DNNs on regression tasks. optimization problem min∥fθ −f∗∥+λR(θ) θ s.t. Jfθ(x) = A, ∀x ∈ R, (1) where J is the Jacobian operator and A is a target matrix that the model fθ’s Jacobian should match for any input x in a given region R of the input space. We employed explic- itly a regularizer R with associated hyper-parameter λ as one commonly employs weight-decay i.e. l2 regularization on θ [7]. The type of constraint of Eq. (1) is crucial in a variety of applications e.g. based on informed physical properties POLICE: PROVABLY OPTIMAL LINEAR CONSTRAINT ENFORCEMENT FOR DEEP NEURAL NETWORKS that fθ should obey [8, 9, 10] or based on (adversarial) noise robustness constraints [11, 12]. Another motivating example emerges when one has a priori knowledge that the target map- ping f ∗ is itself linear with parameters A∗ , b∗ within a region R leading to the following optimization problem min∥fθ −f∗∥+λR(θ) θ s.t. fθ (x) = A∗ x + b∗ , ∀x ∈ R. (2) It is easy to see that one might arrive at numerous variants of Eqs. (1) and (2)’s constraints. Crucial to our study is the observation that in all cases the necessary constraint one must provably and strictly enforce into fθ is to be affine within R. As a result, we summarize the generalized affine constraint that will interest us as follows min∥fθ −f∗∥+λR(θ) θ s.t. fθ(x) is affine on R, (3) and from which any more specific constraint can be efficiently enforced. For example, given Eq. (3), imposing Eq. (1) is straightforward by using a barrier method [13] with ∥J fθ (v)− A∥ which can be done only by sampling a single sample v ∈ R and computing the DNN’s Jacobian matrix at that point. The same goes for imposing Eq. (2) given that Eq. (3) is fulfilled. The main question that naturally arises is how to leverage DNNs while imposing Eq. (3). Current constraint enforce- ment with DNNs have so far only considered questions of verification [14, 15, 16] or constrained parameter space e.g. with integer parameters θ [17] leaving us with little help to solve Eq. (3). A direct regularization-based approach suffers from the curse of dimensionality [18, 19] since sampling R to ensure that fθ is affine requires exponentially many points as the dimension of R increases. A more direct approach e.g. defining an alternative mapping gθ,A,b such as gθ,A,b(x) = 1{x∈R}(Ax + b) + 1{x̸∈R}fθ(x), would destroy key properties such as continuity at the boundary ∂R. In short, we are in need of a principled and informed solution to strictly enforce Eq. (3) in arbitrarily wide and/or deep DNNs. In this paper, we propose a theoretically motivated and provably optimal solution to enforce the affine constraint of Eq. (3) into DNNs. In particular, by focusing on convex poly- topal regions R, we are able to exactly enforce the model to be affine on R; a visual depiction of the method at work is given in Fig. 1, and the pseudo-code is given in Algorithm 1. Our method has linear asymptotic complexity with respect to the number of vertices P defining (or approximating) the region R i.e. it has time-complexity O(KP ) where K is the complexity of the model’s forward pass. Hence the proposed strategy will often be unimpaired by the curse of dimensionality since for example a simplicial region R ⊂ RD has only D + 1 vertices [20] i.e. our method in this case has complexity growing lin- early with the input space dimension. Our method is exact for any DNN architecture in which the nonlinearities are such as (leaky-)ReLU, absolute value, max-pooling and the likes, extension to smooth nonlinearities is discussed in Section 3. We coin our method POLICE standing for Provably Optimal LInear Constraint Enforcement; and we denoted a constrained DNN as being POLICEd. Two crucial benefits of POLICE are that the constraint enforcement is exact without requiring any sampling, and that the standard gradient-based training can be used on the POLICEd DNN parameters without any changes. 2. STRICTENFORCEMENTOFAFFINE CONSTRAINTS FOR DEEP NEURAL NETWORKS In this section, we develop the proposed POLICE method that relies on two ingredients: (i) a DNN fθ using continu- ous piecewise-affine (CPA) nonlinearities which we formally introduce in Section 2.1; and (ii) a convex polytopal region R. The derivations of the proposed method will be carried in Section 2.2 and will empirically validated in Section 2.3. 2.1. Deep Neural Networks Are Continuous Piecewise Linear Operators We denote the DNN input-output mapping as fθ : RD → RK with θ the governing parameters of the mapping. In all generality, a DNN is formed by a composition of many layers as in f =f(L) ◦···◦f(1) , (4) θ θ(L) θ(1) where each layer mapping f(l) : RD(l) → RD(l+1) produces a feature map; with D(1) D and D(L) K. For most architectures, i.e. parametrizations of the layers, the input- output mapping producing each feature map takes the form f(l) (v) = σ(l)(h(l)(v)) with h(l)(v) = W(l)v + b(l) θ(l) (5) where σ is a pointwise activation function, W (l) is a weight matrix of dimensions D(l+1)×D(l), and b(l) is a bias vector of length D(l+1), h(l) is denoted as the pre-activation map, and the layer parameters are gathered into θ(l) {W (l), b(l)}. The matrix W (l) will often have a specific structure (e.g., circulant) at different layers. Without loss of generality we consider vectors as inputs since when dealing with images for example, one can always flatten them into vectors and reparametrize the layers accordingly leaving the input-output mapping of the DNN unchanged. The details on the layer map- ping will not impact our results. One recent line of research that we will heavily rely on consists in formulating DNNs as Continuous Piecewise Affine (CPA) mappings [21, 22], that be expressed as fθ(x) = (Aωx + bω)1{z∈ω}, (6) ω∈Ω where Ω is the input space partition induced by the DNN architecture1 , ω is a partition-region, and Aω , bω are the cor- responding per-region slope and offset parameters. A Key result that we will build upon is that the CPA formulation of Eq. (6) represents exactly the DNN functional of Eqs. (4) and (5), when the nonlinearities σ are themselves CPA e.g. (leaky-)ReLU, absolute value, max-pooling [22]. 2.2. POLICEAlgorithmandOptimality Equipped with the CPA notations expressing DNNs as per- region affine mappings (recall Eq. (6)) we now derive the proposed POLICE algorithm that will strictly enforce the PO- LICEd DNN to fulfill the affine constraint from Eq. (3)) on a region R which is convex. We will then empirically validate the method in the following Section 2.3. As mentioned above, our goal is to enforce the DNN to be a simple affine mapping on a convex region R. In particular, we will consider the V -representation of the region R i.e. we consider P vertices v1 , . . . , vP so that the considered region R is, or can be closely approximated by, the convex hull of those vertices as in Algorithm 1 POLICE procedure given a DNN fθ (recall Eq. (4)) with activations σ which are CPA e.g. (leaky-)ReLU and the region R expressed by its vertices (recall Eq. (8)). POLICE strictly enforces that fθ be affine in R; for clarity we denote by MajorityVote(., axis = 0) the s vector from Proposition 1. Train time procedure for each mini-batch X: Require: X∈RN×D,V ∈RV×D 1: V (1) V , X(1) X 2: for l = 1,...,L do 3: H←V(l)(W(l))T +1Nb(l) 4: s = MajorityVote(sign(H), axis = 0) 5: c = ReLU(−Hdiag(s)).max(axis = 0) ⊙ s 6: X(l+1) ← σ X(l)(W (l))T + 1N (b(l) + c)T 7: V(l+1) ←σV(l)(W(l))T +1N(b(l) +c)T 8: endfor Ensure: X(L) ◃ Evaluate loss and back-propagate as usual Test time procedure to be run once: Do the train time procedure but without X, X(1), . . . , X(L) and by setting b(l) ← b(l) + c for each layer with the per- layer c from Line 5 above. are themselves convex polytopes, it implies that all the points within ω∗, and thus R since we showed that R ⊂ ω∗, will also have the same pre-activation patterns, i.e. the entire DNN input-output mapping will stay affine within R concluding the proof. Theorem 1 provides us with a necessary and sufficient condition but does not yet answer the principal question of “how to ensure that the vertices have the same pre-activation sign patterns”. There are actually many ways to enforce this, in our study we focus on one method that we deem the simplest and leave further comparison for future work. Proposition 1. A sufficient condition for a set of points in layer l’s input space column-stacked into V to have the same layer l’s sign pattern is 0 ≥ min Hp,ksk, (9) p∈[P ] withH V(W(l))T +1Nb(l) andwithsk being1if#{i: (H)i,k > 0} ≥ P/2 and −1 otherwise. Combining Theorem 1 and Proposition 1 is now enough to provide the POLICE algorithm which is given in Algorithm 1. We now turn to empirically validate it in a variety of tasks in the following section. 2.3. Empirical Validation We now propose to empirically validate the proposed POLICE method from Algorithm 1. First, we point out that POLICE can be applied regardless of the downstream task that one aims at solving e.g. classification or regression as presented ◃ Initialization ◃ Forward-pass R=VTα:αp ≥0,∀p∧αT1P =1, where we gathered the P vertices into the P × D matrix (7) V [v1,...,vP]T. We highlight that the actual ordering of the rows of V in (8) Eq. (8), i.e. which vertex is put in which row, will not impact (l) th our result. We will also denote by vp the p vertex mapped through the first l − 1 layer, starting with vp for l = 1 and the corresponding matrix V (l). By leveraging the CPA property of the DNN fθ , one should notice that a necessary and sufficient condition for the model to stay affine within R is to have the same pre-activation sign patterns (recall Eq. (5)) which we formalize below. Theorem 1. A necessary and sufficient condition for a CPA DNN as per Eq. (6) to stay affine with a region R given by Eq. (7) is to have the same pre-activation sign patterns between the vertices vp , ∀p ∈ [P ] and that for each layer. Proof. The proof is direct. If the pre-activation signs are the same for all vertices, e.g. sign(h(1) (vp )) = sign(h(1) (vq )), ∀p, q ∈ [P ]2 for the first layer, then the vertices all lie within the same partition region i.e. ∃ω∗ ∈ Ω such that vp ∈ ω, ∀p. Using Eq. (7) we obtain that R ⊂ ω∗. Now, because the parti- tion regions ω that form the DNN partition Ω (recall Eq. (6)) 1exact relations between architecture and partition are given in [23] and are beyond the scope of this study target function POLICE constrained DNN approximation target function Fig. 2. Regression task of a 2-dimensional space to a 1-dimensional space with target function depicted in the left column with leaky- ReLU MLP of depth 4 and width 256 and for which the POLICE constrains the DNN to be an affine map- ping within the region R de- limited by the black lines as depicted in the second column. On the right col- umn a zoom-in is provided demonstrating how gradi- ent based learning adapted the DNN parameters to very accurately approximate the target function. POLICE constrained DNN Table 1. Average (over 1024 runs) time (in ms.) and std to compute a forward-pass and a backward-pass with gradient accumulation in a standard DNN without any constraint enforced versus a POLICEd DNN for various input dimensions, widths and depths; given a mini- batch size of 1024. We recall that at test time a POLICEd DNN has zero overhead (recall Algorithm 1). input dim. D 2 2 2 784 784 3072 depthL 2. 4 4 2 8 6 step 5 step 50 step 10000 widths D(l) 256 64 4096 1024 1024 4096 Fig. 3. Evolution of the DNN function approximation to the target function (left) at different training steps (5,50,10000) in each column. At any point during training POLICE strictly enforced the constraints on the region R delimited by the black lines in a differentiable way to allow gradient based learning to discover a good value for the DNN parameters. naive Pytorch reproduction of Algorithm 1 and we expect the slow-down factor to greatly diminish by optimizing the implementation which we leave for future work. 3. CONCLUSIONANDFUTUREWORK We proposed a novel algorithm –POLICE– to provably en- force affine constraints into DNNs that employ continuous piecewise affine nonlinearities such as (leaky-)ReLU, absolute value or max-pooling. Given polytopal region R, POLICE enforces the DNN to be affine on R only by a forward-pass through the DNN of the vertices defining R (Algorithm 1) making it computationally efficient (Table 1) and lightweight to implement (no change in the model or optimizer). We be- lieve that strict enforcement of the affine constraint is key to enable critical applications to leverage DNNs e.g. to enforce Eqs. (1) and (2). Among future directions we hope to explore the constraint enforcement on multiple regions simultaneously, and the extension to DNN employing smooth activation func- no constr. 0.9±0 1.3±0 27.9±3 0.9±0 3.1±0 51.9±5 POLICEd 4.3±0 7.7±1 40.3±1 5.2±0 20.4±1 202.2±12 slow-down ×4.5 × 5.7 × 1.4 × 5.5 × 6.6 × 3.9 in Figs. 1 and 2 respectively. In either case, the exact same methodology is employed (Algorithm 1). To complement the proposed classification case, we also provide in Fig. 3 the evolution of the DNN approximation at different training steps. We observe how the POLICE method enable gradient based learning to take into account the constraints so that the approximation can be of high-quality despite the constraint enforcement. Crucially, POLICE provides a strict constraint enforcement at any point during training requiring no tuning or sampling. We provide training times in Table 1 for the case with R being the simplex of the considered ambient space, and recall the reader that POLICE has zero overhead at test time. We observe that especially as the dimension or the width of the model grows, as POLICE is able to provide relatively low computational overhead. In any case, the computational slow-down seems to lie between 1.4 and 6.6 in all studied cases. Additionally, those numbers were obtained from a tions using a probabilistic argument as done e.g. in [24, 25]. Regarding the application of POLICE, we believe that implicit representation DNNs [26] could be fruitful beneficiaries. 4. REFERENCES [1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436– 444, 2015. [15] Eric Wong and Zico Kolter, “Provable defenses against adversarial examples via the convex outer adversarial polytope,” in International Conference on Machine Learning. PMLR, 2018, pp. 5286–5295. [16] Changliu Liu, Tomer Arnon, Christopher Lazarus, Christopher Strong, Clark Barrett, Mykel J Kochenderfer, et al., “Algorithms for verifying deep neural networks,” Foundations and Trends® in Optimization, vol. 4, no. 3-4, pp. 244–404, 2021. [17] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi, “Training and inference with integers in deep neural net- works,” arXiv preprint arXiv:1802.04680, 2018. [18] Richard E Bellman and Stuart E Dreyfus, Applied dy- namic programming, vol. 2050, Princeton university press, 2015. [19] Mario Köppen, “The curse of dimensionality,” in 5th online world conference on soft computing in industrial applications (WSC5), 2000, vol. 1, pp. 4–8. [20] Herbert Edelsbrunner, Algorithms in combinatorial ge- ometry, vol. 10, Springer Science & Business Media, 1987. [21] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio, “On the number of linear regions of deep neural networks,” Advances in neural information processing systems, vol. 27, 2014. [22] Randall Balestriero and Richard Baraniuk, “A spline theory of deep learning,” in International Conference on Machine Learning. PMLR, 2018, pp. 374–383. [23] Randall Balestriero, Romain Cosentino, Behnaam Aazhang, and Richard Baraniuk, “The geometry of deep networks: Power diagram subdivision,” Advances in Neural Information Processing Systems, vol. 32, 2019. [24] Randall Balestriero and Richard G Baraniuk, “From hard to soft: Understanding deep network nonlinearities via vector quantization and statistical inference,” arXiv preprint arXiv:1810.09274, 2018. [25] Ahmed Imtiaz Humayun, Randall Balestriero, and Richard Baraniuk, “Polarity sampling: Quality and di- versity control of pre-trained generative networks via singular values,” in Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2022, pp. 10641–10650. [26] Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein, “Implicit neural representations with periodic activation functions,” Ad- vances in Neural Information Processing Systems, vol. 33, pp. 7462–7473, 2020. [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] Ronald Newbold Bracewell and Ronald N Bracewell, The Fourier transform and its applications, vol. 31999, McGraw-Hill New York, 1986. Jooyoung Park and Irwin W Sandberg, “Universal ap- proximation using radial-basis-function networks,” Neu- ral computation, vol. 3, no. 2, pp. 246–257, 1991. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Tal- walkar, Foundations of machine learning, MIT press, 2018. YannLeCun,IdoKanter,andSaraASolla,“Eigenvalues of covariance matrices: Application to neural-network learning,” Physical Review Letters, vol. 66, no. 18, pp. 2396, 1991. Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio, Deep learning, vol. 1, MIT Press, 2016. Anders Krogh and John Hertz, “A simple weight de- cay can improve generalization,” Advances in neural information processing systems, vol. 4, 1991. Lipman Bers, Fritz John, and Martin Schechter, Partial differential equations, American Mathematical Soc., 1964. Stanley J Farlow, Partial differential equations for scien- tists and engineers, Courier Corporation, 1993. Lawrence C Evans, Partial differential equations, vol. 19, American Mathematical Soc., 2010. Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy, “Explaining and harnessing adversarial ex- amples,” arXiv preprint arXiv:1412.6572, 2014. Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li, “Ad- versarial examples: Attacks and defenses for deep learn- ing,” IEEE transactions on neural networks and learning systems, vol. 30, no. 9, pp. 2805–2824, 2019. Joseph Kelaghan, George L Rubin, Howard W Ory, and Peter M Layde, “Barrier-method contraceptives and pelvic inflammatory disease,” Jama, vol. 248, no. 2, pp. 184–187, 1982. Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska, and Daniel Kroening, “Concolic testing for deep neural networks,” in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, 2018, pp. 109–119.