Bayesian Learning in Intelligent Autonomous Systems

Uczenie w sieciach Bayesa Inteligentne Systemy Autonomiczne W oparciu o wyklad Prof. Geoffrey Hinton University of Toronto Janusz A. Starzyk Wyzsza Szkola Informatyki i Zarzadzania w Rzeszowie

Paradygmat Bayesa • Paradygmat Bayesazakłada, że zawsze mamy dystrybucję a priori dla wszystkiego. • Dystrybucja a priori może być bardzo niejednoznaczna. • Kiedy widzimy jakieś dane, łączymy naszą dystrybucjęa priori zwarunkiem prawdopodobieństwa (likelihood) aby otrzymać dystrybucję a posteriori. • Warunek prawdopodobieństwa wylicza jak prawdopodobne jest, że postrzegane dane są parametrami modelu. • To sprzyja ustawieniom parametrow, które sprawiają, że dane są bardziej prawdopodobne • Walczy z dystrybucja a priori • Z wystarczającą liczbą danych warunki prawdopodobieństwa zawsze wygrywają.

Twierdzenie Bayesa Prawdopodobieństwo łączne prawdopodobieństwo warunkowe Prawdopodobieństwo a priori wektora wag W Prawdopodobieństwo wystąpienia danych przy wagach W – warunek prawdopodobienstwa Prawdopodobieństwo a posteriori wektora W przy danych treningowych D

Dlaczego maksymalizujemy sumy logarytmów prawdopodobienstw? • Chcemy zmaksymalizować iloczyn prawdopodobieństw danych wyjściowych dzięki sytuacjom treningowym • Załozmy ze błędy danych wyjściowychw przypadku różnych sytuacji treningowych, c, są niezależne. • Ponieważ funkcja logarytmu jest monotoniczna, dlatego możemy maksymalizować sumy logarytmów prawdopodobieństw.

Minimalizacja błędu sumy kwadratow jest równoznaczna z maksymalizacją logarytmów prawdopodobieństwapoprawnej odpowiedzi przy zalozeniu rozkladu Gaussawokol zalozonego modelu. d = poprawna odpowiedź y = szacunkowo najbardziej prawdopodobna wartość Maksymalizacja warunku prawdopodobieństwa (maximum likelihood learning)

Maksymalizacja warunku prawdopodobieństwa (maximum likelihood learning ML) • Znalezienie zbioru wag, W, którezminimalizują błąd sumy kwadratowjestdokładnie tym samym co znalezienie takiego W, któremaksymalizuje logarytm prawdopodobieństwa tegoże model będzie dostarczał pożądanych wyjśc we wszystkich sytuacjach treningowych. • Domyślnie zakładamy żeszum Gaussao zerowej sredniej jest dodany do aktualnych danych wyjściowych modelu. • Nie musimy znacpoziomuszumu, ponieważ zakładamy, że jest on ten sam we wszystkich przypadkach. Więc on tylko skaluje błąd sumy kwadratow.

Twierdzenie Bayesa Prawdopodobieństwo łączne prawdopodobieństwo warunkowe Prawdopodobieństwo a priori wektora wag W Prawdopodobieństwo wystąpienia danych przy wagach W – warunek prawdopodobienstwa Prawdopodobieństwo a posteriori wektora W przy danych treningowych D

p(w) w 0 Zasada maksymalnego prawdopodobieństwa a posteriori (Maximum aposteriori learning MAP) • To zamienia prawdopodobieństwo a prioriparametrów przez prawdopodobienstwo danych przy zadanych parametrach.Szukane są parametry, które mają najlepszy iloczyn pradopodobienstwa a priori i likelihood. • Minimalizowanie sumy kwadratow wagjest równoznacznedo maksymalizacjilogarytmów prawdopodobieństwwagprzy rozkladzie Gaussa z zerowa srednia(maksymalizacja a priori) .

p(w) w Zasada maksymalnego prawdopodobieństwa a posteriori (MAP learning) • Maksymalizacja prawdopodobieństwo a posteriori jest równoznaczna z minimalizacja regularyzowanej funkcji sumy kwadratów błędów • z parametrem regularyzującym • lub minimalizującą funkcji kosztow 0

Pytania?

The Bayesian Learning Pelna wersja

The Bayesian framework • The Bayesian framework assumes that we always have a prior distribution for everything. • The prior may be very vague. • When we see some data, we combine our prior distribution with a likelihood term to get a posterior distribution. • The likelihood term takes into account how probable the observed data is given the parameters of the model. • It favors parameter settings that make the data likely. • It fights the prior • With enough data the likelihood terms always win.

A coin tossing example • Suppose we know nothing about coins except that each tossing event produces a head with some unknown probability p and a tail with probability 1-p. Our model of a coin has one parameter, p. • Suppose we observe 100 tosses and there are 53 heads. What is p? • The frequentist answer: Pick the value of p that makes the observation of 53 heads and 47 tails most probable. probability of a particular sequence

Some problems with picking the parameters that are most likely to generate the data • What if we only tossed the coin once and we got 1 head? • Is p=1 a sensible answer? • Surely p=0.5 is a much better answer. • Is it reasonable to give a single answer? • If we don’t have much data, we are unsure about p. • Our computations of probabilities will work much better if we take this uncertainty into account.

Start with a prior distribution over p. In this case we used a uniform distribution. Multiply the prior probability of each parameter value by the probability of observing a head given that value. Then scale up all of the probability densities so that their integral comes to 1. This gives the posterior distribution. Using a distribution over parameter values probability density 1 area=1 p 0 1 probability density 1 2 probability density area=1

Start with a prior distribution over p. Multiply the prior probability of each parameter value by the probability of observing a tail given that value. Then renormalize to get the posterior distribution. Look how sensible it is! 2 probability density 1 area=1 p 0 1 area=1 Lets do it again: Suppose we get a tail

After 53 heads and 47 tails we get a very sensible posterior distribution that has its peak at 0.53 (assuming a uniform prior). Lets do it another 98 times area=1 2 probability density 1 p 0 1

Bayes Theorem conditional probability joint probability Probability of observed data given W – likelihood function Prior probability of weight vector W Posterior probability of weight vector W given training data D

A cheap trick to avoid computing the posterior probabilities of all weight vectors • Suppose we just try to find the most probable weight vector. • We can do this by starting with a random weight vector and then adjusting it in the direction that improves p( W | D ). • It is easier to work in the log domain. If we want to minimize a cost we use negative log probabilities:

Why we maximize sums of log probs • We want to maximize the product of the probabilities of the outputs on the training cases • Assume the output errors on different training cases, c, are independent. • Because the log function is monotonic, so we can maximize sums of log probabilities

A even cheaper trick • Suppose we completely ignore the prior over weight vectors • This is equivalent to giving all possible weight vectors the same prior probability density. • Then all we have to do is to maximize: • This is called maximum likelihood learning. It is very widely used for fitting models in statistics.

Minimizing the squared residuals is equivalent to maximizing the log probability of the correct answer under a Gaussian centered at the model’s guess. Supervised Maximum Likelihood Learning y = model’s estimate of most probable value d = the correct answer

Supervised Maximum Likelihood (ML) Learning • Finding a set of weights, W, that minimizes the squared errors is exactly the same as finding a W that maximizes the log probability that the model would produce the desired outputs on all the training cases. • We implicitly assume that zero-mean Gaussiannoise is added to the model’s actual output. • We do not need to know the variance of the noise because we are assuming it’s the same in all cases. So it just scales the squared error.

Bayes Theorem conditional probability joint probability Probability of observed data given W - likelihood function Prior probability of weight vector W Posterior probability of weight vector W given training data D

p(w) w 0 Maximum A Posteriori (MAP) Learning • This trades-off the prior probabilities of the parameters against the probability of the data given the parameters. It looks for the parameters that have the greatest product of the prior term and the likelihood term. • Minimizing the squared weights is equivalent to maximizing the log probability of the weights under a zero-mean Gaussian (maximizing prior) .

p(w) w Maximum A Posteriori (MAP) Learning • Maximizing posterior probabilities is equivalent to minimizing the regularized sum of squares error function • with a regularization parameter • or minimizing the cost function 0

Full Bayesian Learning • Instead of trying to find the best single setting of the parameters (as in ML or MAP) compute the full posterior distribution over parameter settings • This is extremely computationally intensive for all but the simplest models (its feasible for a biased coin). • To make predictions, let each different setting of the parameters make its own prediction and then combine all these predictions by weighting each of them by the posterior probability of that setting of the parameters. • This is also computationally intensive. • The full Bayesian approach allows us to use complicated models even when we do not have much data

Overfitting: A frequentist illusion? • If you do not have much data, you should use a simple model, because a complex one will overfit. • This is true. But only if you assume that fitting a model means choosing a single best setting of the parameters. • If you use the full posterior over parameter settings, overfitting disappears! • With little data, you get very vague predictions because many different parameters settings have significant posterior probability

Which model do you believe? The complicated model fits the data better. But it is not economical and it makes silly predictions. But what if we start with a reasonable prior over all fifth-order polynomials and use the full posterior distribution. Now we get vague and sensible predictions. There is no reason why the amount of data should influence our prior beliefs about the complexity of the model. A classic example of overfitting

Approximating full Bayesian learning in a neural network • If the neural net only has a few parameters we could put a grid over the parameter space and evaluate p( W | D ) at each grid-point. • This is expensive, but it does not involve any gradient descent and there are no local optimum issues. • After evaluating each grid point we use all of them to make predictions on test data • This is also expensive, but it works much better than ML learning when the posterior is vague or multimodal (this happens when data is scarce).

An example of full Bayesian learning • Allow each of the 6 weights or biases to have the 9 possible values [-2 : 0.5 : 2] • So there are 96 grid-points in parameter space. • For each grid-point compute the probability of the observed outputs of all the training cases. • This is the likelihood term and is explained on the next slide • Multiply the prior for each grid-point p(Wi) by the likelihood term and renormalize to get the posterior probability for each grid-point p(Wi,D). • Make predictions p(ytest| input, D) by using the posterior probabilities of all grid-points to average the predictionsp(ytest| input, Wi) made by the different grid-points. bias bias A neural net with 2 inputs, 1 output and 6 parameters

Computing the likelihood term for a logistic output unit • The output of the logistic unit is the probability that the network assigns to the answer 1. It assigns the complementary probability to the answer 0. Compute if d=1 if d=0

What can we do if there are too many parameters for a grid to be feasible? • The number of grid points is exponential in the number of parameters. • So we cannot deal with more than a few parameters using a grid. • If there is enough data to make most parameter vectors very unlikely, only need a tiny fraction of the grid points make a significant contribution to the predictions. • Maybe we can just evaluate this tiny fraction • It might be good enough to just sample weight vectors according to their posterior probabilities. Sample weight vectors with this probability

One method for sampling weight vectors • In standard backpropagation we keep moving the weights in the direction that decreases the cost • i.e. the direction that increases the log likelihood plus the log prior, summed over all training cases. • Suppose we add some Gaussian noise to the weight vector after each update. • So the weight vector never settles down. • It keeps wandering around, but it tends to prefer low cost regions of the weight space. • Amazing fact: If we use just the right amount of noise, and if we let the weight vector wander around for long enough before we take a sample, we will get a sample from the true posterior over weight vectors. • This is called a “Markov Chain Monte Carlo” method and it makes it feasible to use full Bayesian learning with hundreds or thousands of parameters. • There are related MCMC methods that are more complicated but more efficient (we don’t need to let the weights wander around for so long before we get samples from the posterior).

Bayesian Learning in Intelligent Autonomous Systems

Bayesian Learning in Intelligent Autonomous Systems

Presentation Transcript

Konwergencja usług w sieciach komputerowych

Zapewnianie bezpieczeństwa w sieciach

ADRESACJA W SIECIACH IP

Uczenie w Sieciach Rekurencyjnych

Charakterystyka urządzeń w sieciach LAN i WAN

Charakterystyka urządzeń w sieciach LAN i WAN

OBLICZANIE SPADKÓW I STRAT NAPIĘCIA W SIECIACH OTWARTYCH

NOWOCZESNE TECHNOLOGIE W SIECIACH WIFI

Bezpieczeństwo pracy w sieciach informatycznych - Zabezpieczenie infrastruktury

Unikanie kontroli w sieciach teleinformatycznych

Podstawy adresowania hostów w sieciach komputerowych

Aplikacje w sieciach Internet/Intranet

OBLICZANIE ROZPŁYWÓW PRĄDÓW W SIECIACH OTWARTYCH

Informacja geograficzna w sieciach

Uczenie w Sieciach Wielowarstwowych

Nowa Metoda Nadzorowanego Uczenia w Impulsowych Sieciach Neuronowych

Przejścia fazowe w modelu Isinga na sprzężonych sieciach złożonych

Uczenie konkurencyjne.

Sposoby ograniczania strat ciepła w sieciach ciepłowniczych

Elektroniczne Faktury w Sieciach Handlowych

Uczenie ze wzmocnieniem

GIS w sieciach wod-kan