Generalization in Learning from examples

Generalization in Learning from examples

Function Space

Hilbert Space • A Hilbert spaceH is a real or complexinner product space that is also a complete metric space with respect to the distance function induced by the inner product. To say that H is a complex inner product space means that H is a complex vector space on which there is an inner product ⟨x,y⟩ associating a complex number to each pair of elements x,y of H, that satisfies the properties:

Hilbert Space properties • ⟨y,x⟩ is the complex conjugate of ⟨x,y⟩: • ⟨x,y⟩ is linear in its first argument. For all complex numbers a and b, • The inner product is positive definite: where the case of equality holds precisely when x = 0.

Hilbert Space properties • The norm defined by the inner product ⟨•,•⟩ is the real-valued function • the distance between two points x,y in H is defined in terms of the norm by • That this function is a distance function means • it is symmetric in x and y, • the distance between x and itself is zero, and otherwise the distance between x and y must be positive, • the triangle inequality holds, meaning that the length of one leg of a triangle xyz cannot exceed the sum of the lengths of the other two legs: • This last property is ultimately a consequence of the more fundamental Cauchy–Schwarz inequality, which asserts with equality if and only if x and y are parallel.

Examples of Hilbert space

Overfittingvs Generalization

2. Generalization • A problem is well-posed if its solution: • exists • is unique • depends continuously on the data (e.g. it is stable) • A problem is ill-posed if it is not well-posed. In the context of • this class, well-posedness is mainly used to mean stability of the solution.

generalization • eidetic generalization: • the process of imagination of possible cases rather than observation of actual ones. • eidos: properties, kinds or types of ideal species that entities may exemplify • eidetic variation: • possible changes an individual can undergo while remaining an instance of a given type of an essence

Stabilizer • Popper’s claim that • empirical data are not sufficient for obtaining any pattern. • in addition to empirical data, one needs some conceptual data expressing prior knowledge about properties of a desired function. • In 1990s, Poggio and Girosiproposed modifying the empirical error functional: • Ψ is a functional expressing some global property (such as smoothness) of the function to be minimized

Unstable example

stable example

3. Inverse Problems • For such operator A :X →Y between two Hilbert spaces , , an inverse problem determined by A is a task of finding for g ∈ Y (called data) some f ∈ X (called solution) such that • A(f) = g. • When X and Y • finite dimensional: linear operators can be represented by matrices. • infinite dimensional: typical operators are integral ones. • Fredholm integral equations of the first and second kind:

well-posed and ill-posed problem • Hadamard introduced the definition of ill-posedness. Ill-posed problems are typically inverse problems. • As an example, assume g is a function in Y and u is a function in X, with Y and X Hilbert spaces. Then given the linear, continuous operator L, consider the equation g = Lu. • The direct problem is to compute g given u; the inverse problem is to compute u given the data g. In the learning case L is somewhat similar to a “sampling” operation and the inverse problem becomes the problem of finding a function that takes the values f (xi ) = yi , i = 1, ...n • The inverse problem of finding u is well-posed when the solution • exists, • is unique and • is stable, that is depends continuously on the initial data g.

4. Pseudosolutionsof Inverse Problems • When there is no solution • For an operator A : X → Y, let • R(A) = {g ∈ Y|(∃f ∈ X)(A(f) = g)} • denotes its range and πclR(A) :Y →clR(A) the projection of Y onto the closure of R(A) in • every continuous operator A between two Hilbert spaces has an adjoint A∗ satisfying for all f ∈ X and all g ∈ Y,

Pseudosolutions of Inverse Problems • If the range of A is closed, then there exists a unique continuous linear pseudoinverseoperator A+ :Y →X such that for every g ∈ Y: • for every g ∈ Y • AA+(g) = πclR(g) • A+= (A*A)+A*=A*(AA)+

Pseudosolutions of Inverse Problems • To solve more general least-squares problems, define Moore–Penrose pseudoinverses for all continuous linear operator A : X → Y between two Hilbert spaces X and Y • every continuous linear operator have a continuous linear pseudo-inverse. • Just ones whose range is closed in Y. • If the range is not closed, then A+ is only defined for those g ∈ Y, for which πclR(A)(g) ∈ R(A).

condition number • Using the pseudoinverse and a matrix norm, one can define a condition number for any matrix: • A large condition number implies that the problem of finding least-squares solutions to the corresponding system of linear equations is ill-conditioned in the sense that small errors in the entries of A can lead to huge errors in the entries of the solution.

5.regularization • A method of improving stability of solutions of ill-conditioned inverse problems, called regularization. • The basic idea in the treatment of ill-conditioned problems • use some a priori knowledge about solutions to disqualify meaningless ones. • such knowledge can be: • some regularity condition on the solution expressed existence of derivatives up to a certain order with bounds on the magnitudes of these derivatives • some localization conditionsuch as a bound on the support of the solution or its behavior at infinity. • Tikhonov’s regularization: penalizes undesired solutions by adding a term called a stabilizer.

stabilizer • Ψ is a functional called stabilizer. • The regularization parameter γ plays the role of a trade-off between the least square solution and the penalization expressed by Ψ. • Typical choice of a stabilizer is the square of the norm on X, for which the original problem is replaced with a minimization of the functional :

Regularization • For this stabilizer, regularized solutions always exist. • pseudosolutions, which in the infinite dimensional case do not exist for those data g, for which πclR(A)(g) R(A) . • For every continuous operator A : X → Y between two Hilbert spaces and for every  > 0, there exists a unique operator:

Regularization • Even when the original inverse problem does not have a unique solution, for every γ > 0 the regularized problem has a unique solution. • due to the uniform convexity of the functional. • With γ decreasing to zero, the solutions Aγ(g) of the regularized problems converge to the normal pseudosolution A+(g).

Examples of stabilizers • Localization: • Smoothness:

6. Learning from data as an inverse problem • Learning of neural networks from examples is also an inverse problem • For a given training set find an unknown input-output function • operator performs the evaluations of an input-output function at the input data from the training set

Learning from data as an inverse problem • Empirical error function representation with • So minimization of empirical error function is an inverse problem Where v is output data vector • Finding a pseudosolution of this inverse problem is equivalent to the minimization of the empirical error functional Ez over X.

Conditions of inverse problem • to take advantage of characterizations of pseudosolutions and regularized solutions from theory of inverse problems, solutions of the inverse problem defined by the operator Lu should be searched for in suitable Hilbert spaces, on which • all evaluation operators of the form (5) are continuous • norms can express some undesired properties of input-output functions

7. Reproducing Kernel Hilbert Spaces (RKHS) • reproducing kernel Hilbert space (RKHS) as a Hilbert space of pointwise defined real-valued functions on a nonempty set Ω such that all evaluation functionals are continuous, i.e., for every x ∈ Ω, the evaluation functional Fx, defined for any f ∈ X as: is continues (bounded).

Properties of RKHS • every RKHS is uniquely determined by a symmetric positive semidefinite kernel K : Ω × Ω → R, i.e., a symmetric function of two variables satisfying for all m, all (w1, . . . , wm) ∈ Rm, and all (x1, . . . , xm) ∈ Ωm, K is Symmetric K is PD

Reproducing Kernel

RKHS and kernels

Examples of pd kernels

Using inverse problem • As on every RKHS HK(Ω), all evaluation functionals are continuous, for every sample of input data u = (u1, . . . , um), the operator is continuous. Moreover, its range is closed because it is finite dimensional. So one can apply results from theory of inverse problems

pseudosolution and RKHS • and K[u] is the Gram matrix of the kernel K with respect to the vector u defined as • f+ minimizes the empirical error. • f+ can be interpreted as an input-output function of a neural network with one hidden layer of kernel units and a single linear output unit.

Regularization and RKHS

RKHS and inverse problem • f+ and fγ minimizing and , resp., are linear combinations of representers Ku1, . . . , Kum of input data u1, . . . , um, but the coefficients of the two linear combinations are different.

ill-posedness of K[u] • K is positive definite • the row vectors of the matrix K[u] are linearly independent. • But when the distances between the data u1, . . . , um are small, the row vectors might be nearly parallel and the small eigenvalues of K[u] might cluster near zero. • small changes of v can cause large changes of f+.

types of ill-posedness of K[u] • the matrix can be rank-deficient • it has a cluster of small eigenvalues and a gap between large and small eigenvalues • the matrix can represent a discrete ill-posed problem • when its eigenvalues gradually decay to zero without any gap in its spectrum

8. Three reasons for using kernels in ML • Linear separation simplifies classification. In some cases, even data which are not linearly separable can be transformed into linearly separable ones.

reasons for using kernels in ML • Stabilizers of the form are special cases of squares of norms on RKHS generated by convolution kernels • For kernels the value of stabilizer at any is expressed as • Gaussian kernel is an example of convolution kernel with positive FT.

reasons for using kernels in ML • Reformulation of minimization of the empirical error functional as an inverse problem In RKHS, all evaluation functions are continuous, which is necessary for application tools.

Generalization in Learning from examples

Generalization in Learning from examples

Presentation Transcript

Learning from Positive and Unlabeled Examples

Learning-Theoretic Linguistics: Some Examples from Phonology

Learning from negative examples: application in combinatorial optimization

Learning from Negative Examples in Set-Expansion

Generalization and Specialization in Reinforcement Learning

Generalization

Structuring Cooperative Learning: Examples from Small group learning in higher education (SGLHE)

Learning Decompositional Shape Models from Examples

Learning from Infinite Training Examples

Learning Decompositional Shape Models from Examples

Reinforcement Learning Generalization and Function Approximation

Generalization from empirical studies

Generalization

Learning Semantic String Transformations from Examples

Learning from Only Positive Examples in Learning By Observation

“Learning From Bad Examples!”

Learning from Infinite Training Examples

Learning Description From Examples