[ot][spam][ai] POLICE: Provably Optimal Linear Constraint Enforcement for Deep Neural Networks
Undescribed Horrific Abuse, One Victim & Survivor of Many
gmkarl at gmail.com
Thu Nov 3 10:22:57 PDT 2022
In the spirit of cognitive information disruption, I thought it was
fun how the acronym of this paper is completely unrelated to the
subject matter. I do not support intentional cognitive dissonance
(i.e. mind control), but I am used to being enveloped in it, as I
imagine you guys are now.
https://arxiv.org/abs/2211.01340
https://arxiv.org/pdf/2211.01340.pdf
arXiv:2211.01340v1 [cs.LG] 2 Nov 2022
Randall Balestriero
Meta AI, FAIR
rbalestriero at meta.com
NYC, USA
ABSTRACT
Deep Neural Networks (DNNs) outshine alternative function
approximators in many settings thanks to their modularity in composing
any desired differentiable operator. The formed parametrized
functional is then tuned to solve a task at hand from simple gradient
descent. This modularity comes at the cost of making strict
enforcement of constraints on DNNs, e.g. from a priori knowledge of
the task, or from desired physical properties, an open challenge. In
this paper we propose the first provable affine constraint enforcement
method for DNNs that requires minimal changes into a given DNN’s
forward-pass, that is computationally friendly, and that leaves the
optimiza- tion of the DNN’s parameter to be unconstrained i.e.
standard
gradient-based method can be employed. Our method does not require any
sampling and provably ensures that the DNN fulfills the affine
constraint on a given input space’s region at any point during
training, and testing. We coin this method POLICE, standing for
Provably Optimal LInear Constraint Enforcement.
1. INTRODUCTION
Deep Neural Networks (DNNs) are compositions of inter- leaved linear
and nonlinear operators forming a differentiable functional fθ
governed by some parameters θ [1]. DNNs are universal approximators,
but so are many other alternatives e.g. Fourier series [2], kernel
machines [3] and decision trees [4]. The strength of DNNs rather lies
in the ability of the func- tion fθ to quickly and relatively easily
fit most encountered target functional f ∗ –often known through a
given and finite training set– only by performing gradient updates on
its param- eters θ [5]. This success, combined with the development of
software and hardware making DNNs’ training and inference efficient,
have made DNNs the operator of choice for most tasks encountered in
machine learning and pattern recognition [6].
Although ubiquitous for general purpose function fitting, DNNs have
had a much harder time to shine in settings where specific constraints
must be enforced exactly on the learned mapping. A motivating example
follows from the following
Yann LeCun
Meta AI, FAIR, NYU
yann at meta.com
NYC, USA
Fig. 1. Classification task of orange versus purple 2-dimensional
samples, the learned decision boundary is depicted in red and comes
from a SGD trained POLICEd DNN of depth 3 with leaky-ReLU
nonlinearities interleaved with dense layers of width 256 with the
constraint that the network is affine (with unconstrained affine
param- eters) in the region R delimited by the black lines (see Eq.
(3)). This constraint is strictly enforced by the proposed POLICE
algorithm (Algorithm 1), see Fig. 2 for POLICEd DNNs on regression
tasks.
optimization problem
min∥fθ −f∗∥+λR(θ) θ
s.t. Jfθ(x) = A, ∀x ∈ R, (1)
where J is the Jacobian operator and A is a target matrix that the
model fθ’s Jacobian should match for any input x in a given region R
of the input space. We employed explic- itly a regularizer R with
associated hyper-parameter λ as one commonly employs weight-decay i.e.
l2 regularization on θ [7]. The type of constraint of Eq. (1) is
crucial in a variety of applications e.g. based on informed physical
properties
POLICE: PROVABLY OPTIMAL LINEAR CONSTRAINT ENFORCEMENT FOR DEEP NEURAL NETWORKS
that fθ should obey [8, 9, 10] or based on (adversarial) noise
robustness constraints [11, 12]. Another motivating example emerges
when one has a priori knowledge that the target map- ping f ∗ is
itself linear with parameters A∗ , b∗ within a region R leading to the
following optimization problem
min∥fθ −f∗∥+λR(θ) θ
s.t. fθ (x) = A∗ x + b∗ , ∀x ∈ R. (2)
It is easy to see that one might arrive at numerous variants of Eqs.
(1) and (2)’s constraints. Crucial to our study is the observation
that in all cases the necessary constraint one must provably and
strictly enforce into fθ is to be affine within R. As a result, we
summarize the generalized affine constraint that will interest us as
follows
min∥fθ −f∗∥+λR(θ) θ
s.t. fθ(x) is affine on R, (3)
and from which any more specific constraint can be efficiently
enforced. For example, given Eq. (3), imposing Eq. (1) is
straightforward by using a barrier method [13] with ∥J fθ (v)− A∥
which can be done only by sampling a single sample v ∈ R and computing
the DNN’s Jacobian matrix at that point. The same goes for imposing
Eq. (2) given that Eq. (3) is fulfilled.
The main question that naturally arises is how to leverage DNNs while
imposing Eq. (3). Current constraint enforce- ment with DNNs have so
far only considered questions of verification [14, 15, 16] or
constrained parameter space e.g. with integer parameters θ [17]
leaving us with little help to solve Eq. (3). A direct
regularization-based approach suffers from the curse of dimensionality
[18, 19] since sampling R to ensure that fθ is affine requires
exponentially many points as the dimension of R increases. A more
direct approach e.g. defining an alternative mapping gθ,A,b such as
gθ,A,b(x) = 1{x∈R}(Ax + b) + 1{x̸∈R}fθ(x),
would destroy key properties such as continuity at the boundary ∂R. In
short, we are in need of a principled and informed solution to
strictly enforce Eq. (3) in arbitrarily wide and/or deep DNNs.
In this paper, we propose a theoretically motivated and provably
optimal solution to enforce the affine constraint of Eq. (3) into
DNNs. In particular, by focusing on convex poly- topal regions R, we
are able to exactly enforce the model to be affine on R; a visual
depiction of the method at work is given in Fig. 1, and the
pseudo-code is given in Algorithm 1. Our method has linear asymptotic
complexity with respect to the number of vertices P defining (or
approximating) the region R i.e. it has time-complexity O(KP ) where K
is the complexity of the model’s forward pass. Hence the proposed
strategy will often be unimpaired by the curse of dimensionality since
for
example a simplicial region R ⊂ RD has only D + 1 vertices [20] i.e.
our method in this case has complexity growing lin- early with the
input space dimension. Our method is exact for any DNN architecture in
which the nonlinearities are such as (leaky-)ReLU, absolute value,
max-pooling and the likes, extension to smooth nonlinearities is
discussed in Section 3. We coin our method POLICE standing for
Provably Optimal LInear Constraint Enforcement; and we denoted a
constrained DNN as being POLICEd. Two crucial benefits of POLICE are
that the constraint enforcement is exact without requiring any
sampling, and that the standard gradient-based training can be used on
the POLICEd DNN parameters without any changes.
2. STRICTENFORCEMENTOFAFFINE CONSTRAINTS FOR DEEP NEURAL NETWORKS
In this section, we develop the proposed POLICE method that relies on
two ingredients: (i) a DNN fθ using continu- ous piecewise-affine
(CPA) nonlinearities which we formally introduce in Section 2.1; and
(ii) a convex polytopal region R. The derivations of the proposed
method will be carried in Section 2.2 and will empirically validated
in Section 2.3.
2.1. Deep Neural Networks Are Continuous Piecewise Linear Operators
We denote the DNN input-output mapping as fθ : RD → RK with θ the
governing parameters of the mapping. In all generality, a DNN is
formed by a composition of many layers as in
f =f(L) ◦···◦f(1) , (4) θ θ(L) θ(1)
where each layer mapping f(l) : RD(l) → RD(l+1) produces a feature
map; with D(1) D and D(L) K. For most architectures, i.e.
parametrizations of the layers, the input- output mapping producing
each feature map takes the form
f(l) (v) = σ(l)(h(l)(v)) with h(l)(v) = W(l)v + b(l) θ(l)
(5)
where σ is a pointwise activation function, W (l) is a weight matrix
of dimensions D(l+1)×D(l), and b(l) is a bias vector of length D(l+1),
h(l) is denoted as the pre-activation map, and the layer parameters
are gathered into θ(l) {W (l), b(l)}. The matrix W (l) will often
have a specific structure (e.g., circulant) at different layers.
Without loss of generality we consider vectors as inputs since when
dealing with images for example, one can always flatten them into
vectors and reparametrize the layers accordingly leaving the
input-output mapping of the DNN unchanged. The details on the layer
map- ping will not impact our results. One recent line of research
that we will heavily rely on consists in formulating DNNs as
Continuous Piecewise Affine (CPA) mappings [21, 22], that
be expressed as
fθ(x) = (Aωx + bω)1{z∈ω}, (6)
ω∈Ω
where Ω is the input space partition induced by the DNN architecture1
, ω is a partition-region, and Aω , bω are the cor- responding
per-region slope and offset parameters. A Key result that we will
build upon is that the CPA formulation of Eq. (6) represents exactly
the DNN functional of Eqs. (4) and (5), when the nonlinearities σ are
themselves CPA e.g. (leaky-)ReLU, absolute value, max-pooling [22].
2.2. POLICEAlgorithmandOptimality
Equipped with the CPA notations expressing DNNs as per- region affine
mappings (recall Eq. (6)) we now derive the proposed POLICE algorithm
that will strictly enforce the PO- LICEd DNN to fulfill the affine
constraint from Eq. (3)) on a region R which is convex. We will then
empirically validate the method in the following Section 2.3.
As mentioned above, our goal is to enforce the DNN to be a simple
affine mapping on a convex region R. In particular, we will consider
the V -representation of the region R i.e. we consider P vertices v1 ,
. . . , vP so that the considered region R is, or can be closely
approximated by, the convex hull of those vertices as in
Algorithm 1 POLICE procedure given a DNN fθ (recall Eq. (4)) with
activations σ which are CPA e.g. (leaky-)ReLU and the region R
expressed by its vertices (recall Eq. (8)). POLICE strictly enforces
that fθ be affine in R; for clarity we denote by MajorityVote(., axis
= 0) the s vector from Proposition 1.
Train time procedure for each mini-batch X:
Require: X∈RN×D,V ∈RV×D
1: V (1) V , X(1) X
2: for l = 1,...,L do
3: H←V(l)(W(l))T +1Nb(l)
4: s = MajorityVote(sign(H), axis = 0)
5: c = ReLU(−Hdiag(s)).max(axis = 0) ⊙ s 6: X(l+1) ← σ X(l)(W (l))T
+ 1N (b(l) + c)T 7: V(l+1) ←σV(l)(W(l))T +1N(b(l) +c)T 8:
endfor
Ensure: X(L) ◃ Evaluate loss and back-propagate as usual Test time
procedure to be run once:
Do the train time procedure but without X, X(1), . . . , X(L) and by
setting b(l) ← b(l) + c for each layer with the per- layer c from Line
5 above.
are themselves convex polytopes, it implies that all the points within
ω∗, and thus R since we showed that R ⊂ ω∗, will also have the same
pre-activation patterns, i.e. the entire DNN input-output mapping will
stay affine within R concluding the proof.
Theorem 1 provides us with a necessary and sufficient condition but
does not yet answer the principal question of “how to ensure that the
vertices have the same pre-activation sign patterns”. There are
actually many ways to enforce this, in our study we focus on one
method that we deem the simplest
and leave further comparison for future work.
Proposition 1. A sufficient condition for a set of points in layer l’s
input space column-stacked into V to have the same layer l’s sign
pattern is
0 ≥ min Hp,ksk, (9) p∈[P ]
withH V(W(l))T +1Nb(l) andwithsk being1if#{i: (H)i,k > 0} ≥ P/2 and
−1 otherwise.
Combining Theorem 1 and Proposition 1 is now enough to provide the
POLICE algorithm which is given in Algorithm 1. We now turn to
empirically validate it in a variety of tasks in the following
section.
2.3. Empirical Validation
We now propose to empirically validate the proposed POLICE method from
Algorithm 1. First, we point out that POLICE can be applied regardless
of the downstream task that one aims at solving e.g. classification or
regression as presented
◃ Initialization ◃ Forward-pass
R=VTα:αp ≥0,∀p∧αT1P =1, where we gathered the P vertices into
the P × D matrix
(7)
V [v1,...,vP]T.
We highlight that the actual ordering of the rows of V in
(8) Eq. (8), i.e. which vertex is put in which row, will not impact
(l) th
our result. We will also denote by vp the p vertex mapped
through the first l − 1 layer, starting with vp for l = 1 and the
corresponding matrix V (l). By leveraging the CPA property of the DNN
fθ , one should notice that a necessary and sufficient condition for
the model to stay affine within R is to have the same pre-activation
sign patterns (recall Eq. (5)) which we formalize below.
Theorem 1. A necessary and sufficient condition for a CPA DNN as per
Eq. (6) to stay affine with a region R given by Eq. (7) is to have the
same pre-activation sign patterns between the vertices vp , ∀p ∈ [P ]
and that for each layer.
Proof. The proof is direct. If the pre-activation signs are the same
for all vertices, e.g. sign(h(1) (vp )) = sign(h(1) (vq )), ∀p, q ∈ [P
]2 for the first layer, then the vertices all lie within the same
partition region i.e. ∃ω∗ ∈ Ω such that vp ∈ ω, ∀p. Using Eq. (7) we
obtain that R ⊂ ω∗. Now, because the parti- tion regions ω that form
the DNN partition Ω (recall Eq. (6))
1exact relations between architecture and partition are given in [23]
and are beyond the scope of this study
target function
POLICE constrained DNN approximation
target function
Fig. 2. Regression task of a 2-dimensional space to a 1-dimensional
space with target function depicted in the left column with leaky-
ReLU MLP of depth 4 and width 256 and for which the POLICE constrains
the DNN to be an affine map- ping within the region R de- limited by
the black lines as depicted in the second column. On the right col-
umn a zoom-in is provided demonstrating how gradi- ent based learning
adapted the DNN parameters to very accurately approximate the target
function.
POLICE constrained DNN
Table 1. Average (over 1024 runs) time (in ms.) and std to compute a
forward-pass and a backward-pass with gradient accumulation in a
standard DNN without any constraint enforced versus a POLICEd DNN for
various input dimensions, widths and depths; given a mini- batch size
of 1024. We recall that at test time a POLICEd DNN has zero overhead
(recall Algorithm 1).
input dim. D 2 2 2 784 784 3072 depthL 2. 4 4 2 8 6
step 5
step 50
step 10000
widths D(l) 256 64 4096 1024 1024
4096
Fig. 3. Evolution of the DNN function approximation to the target
function (left) at different training steps (5,50,10000) in each
column. At any point during training POLICE strictly enforced the
constraints on the region R delimited by the black lines in a
differentiable way to allow gradient based learning to discover a good
value for the DNN parameters.
naive Pytorch reproduction of Algorithm 1 and we expect the slow-down
factor to greatly diminish by optimizing the implementation which we
leave for future work.
3. CONCLUSIONANDFUTUREWORK
We proposed a novel algorithm –POLICE– to provably en- force affine
constraints into DNNs that employ continuous piecewise affine
nonlinearities such as (leaky-)ReLU, absolute value or max-pooling.
Given polytopal region R, POLICE enforces the DNN to be affine on R
only by a forward-pass through the DNN of the vertices defining R
(Algorithm 1) making it computationally efficient (Table 1) and
lightweight to implement (no change in the model or optimizer). We be-
lieve that strict enforcement of the affine constraint is key to
enable critical applications to leverage DNNs e.g. to enforce Eqs. (1)
and (2). Among future directions we hope to explore the constraint
enforcement on multiple regions simultaneously, and the extension to
DNN employing smooth activation func-
no constr. 0.9±0 1.3±0 27.9±3 0.9±0 3.1±0 51.9±5
POLICEd 4.3±0 7.7±1 40.3±1 5.2±0 20.4±1 202.2±12 slow-down ×4.5 × 5.7
× 1.4 × 5.5 × 6.6 × 3.9
in Figs. 1 and 2 respectively. In either case, the exact same
methodology is employed (Algorithm 1). To complement the proposed
classification case, we also provide in Fig. 3 the evolution of the
DNN approximation at different training steps. We observe how the
POLICE method enable gradient based learning to take into account the
constraints so that the approximation can be of high-quality despite
the constraint enforcement. Crucially, POLICE provides a strict
constraint enforcement at any point during training requiring no
tuning or sampling. We provide training times in Table 1 for the case
with R being the simplex of the considered ambient space, and recall
the reader that POLICE has zero overhead at test time. We observe that
especially as the dimension or the width of the model grows, as POLICE
is able to provide relatively low computational overhead. In any case,
the computational slow-down seems to lie between 1.4 and 6.6 in all
studied cases. Additionally, those numbers were obtained from a
tions using a probabilistic argument as done e.g. in [24, 25].
Regarding the application of POLICE, we believe that implicit
representation DNNs [26] could be fruitful beneficiaries.
4. REFERENCES
[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,”
nature, vol. 521, no. 7553, pp. 436– 444, 2015.
[15] Eric Wong and Zico Kolter, “Provable defenses against adversarial
examples via the convex outer adversarial polytope,” in International
Conference on Machine Learning. PMLR, 2018, pp. 5286–5295.
[16] Changliu Liu, Tomer Arnon, Christopher Lazarus, Christopher
Strong, Clark Barrett, Mykel J Kochenderfer, et al., “Algorithms for
verifying deep neural networks,” Foundations and Trends® in
Optimization, vol. 4, no. 3-4, pp. 244–404, 2021.
[17] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi, “Training and
inference with integers in deep neural net- works,” arXiv preprint
arXiv:1802.04680, 2018.
[18] Richard E Bellman and Stuart E Dreyfus, Applied dy- namic
programming, vol. 2050, Princeton university press, 2015.
[19] Mario Köppen, “The curse of dimensionality,” in 5th online world
conference on soft computing in industrial applications (WSC5), 2000,
vol. 1, pp. 4–8.
[20] Herbert Edelsbrunner, Algorithms in combinatorial ge- ometry,
vol. 10, Springer Science & Business Media, 1987.
[21] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua
Bengio, “On the number of linear regions of deep neural networks,”
Advances in neural information processing systems, vol. 27, 2014.
[22] Randall Balestriero and Richard Baraniuk, “A spline theory of
deep learning,” in International Conference on Machine Learning. PMLR,
2018, pp. 374–383.
[23] Randall Balestriero, Romain Cosentino, Behnaam Aazhang, and
Richard Baraniuk, “The geometry of deep networks: Power diagram
subdivision,” Advances in Neural Information Processing Systems, vol.
32, 2019.
[24] Randall Balestriero and Richard G Baraniuk, “From hard to soft:
Understanding deep network nonlinearities via vector quantization and
statistical inference,” arXiv preprint arXiv:1810.09274, 2018.
[25] Ahmed Imtiaz Humayun, Randall Balestriero, and Richard Baraniuk,
“Polarity sampling: Quality and di- versity control of pre-trained
generative networks via singular values,” in Proceedings of the
IEEE/CVF Con- ference on Computer Vision and Pattern Recognition,
2022, pp. 10641–10650.
[26] Vincent Sitzmann, Julien Martel, Alexander Bergman, David
Lindell, and Gordon Wetzstein, “Implicit neural representations with
periodic activation functions,” Ad- vances in Neural Information
Processing Systems, vol. 33, pp. 7462–7473, 2020.
[2]
[3]
[4]
[5]
[6] [7]
[8]
[9] [10] [11]
[12]
[13]
[14]
Ronald Newbold Bracewell and Ronald N Bracewell, The Fourier transform
and its applications, vol. 31999, McGraw-Hill New York, 1986.
Jooyoung Park and Irwin W Sandberg, “Universal ap- proximation using
radial-basis-function networks,” Neu- ral computation, vol. 3, no. 2,
pp. 246–257, 1991.
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Tal- walkar, Foundations
of machine learning, MIT press, 2018.
YannLeCun,IdoKanter,andSaraASolla,“Eigenvalues of covariance matrices:
Application to neural-network learning,” Physical Review Letters, vol.
66, no. 18, pp. 2396, 1991.
Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio,
Deep learning, vol. 1, MIT Press, 2016.
Anders Krogh and John Hertz, “A simple weight de- cay can improve
generalization,” Advances in neural information processing systems,
vol. 4, 1991.
Lipman Bers, Fritz John, and Martin Schechter, Partial differential
equations, American Mathematical Soc., 1964.
Stanley J Farlow, Partial differential equations for scien- tists and
engineers, Courier Corporation, 1993.
Lawrence C Evans, Partial differential equations, vol. 19, American
Mathematical Soc., 2010.
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy, “Explaining
and harnessing adversarial ex- amples,” arXiv preprint
arXiv:1412.6572, 2014.
Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li, “Ad- versarial
examples: Attacks and defenses for deep learn- ing,” IEEE transactions
on neural networks and learning systems, vol. 30, no. 9, pp.
2805–2824, 2019.
Joseph Kelaghan, George L Rubin, Howard W Ory, and Peter M Layde,
“Barrier-method contraceptives and pelvic inflammatory disease,” Jama,
vol. 248, no. 2, pp. 184–187, 1982.
Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska,
and Daniel Kroening, “Concolic testing for deep neural networks,” in
Proceedings of the 33rd ACM/IEEE International Conference on Automated
Software Engineering, 2018, pp. 109–119.
More information about the cypherpunks
mailing list