8000 GitHub - mainkoon81/ooo-Bayesian-Modeling-A: PGM(Bayesian Network) & its Inference(with MCMC)
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

mainkoon81/ooo-Bayesian-Modeling-A

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 

Repository files navigation

What is PGM ?

: It's a probabilistic model for which a graph expresses the conditional dependence structure between random variables.

: Understanding what makes Conditional Independence b/w variables helps us explain some uncertainty reasoning. If x, y are conditionally independent on z, then

  • P(x | y,z) = P(x|z)
  • P(x,y | z) = P(x|z) * P(y|z)
  • P(x, y, z) ∝ ϕ(x,z) * ϕ(y,z) ????

: PGM allows a tractable inference!

  • Let's say we want to classify data points that are independent of each other(for example, given an image, predict whether it contains a cat or a dog), but See I, like, machine, learning... Problem here is that learning could be a noun or a verb depending on its context. Probabilistic graphical model is a powerful framework which can be used to learn such models with dependency.
  • PGM consists of a graph structure. Each node of the graph is associated with a random variable, and the edge in the graph are used to encode relations between the random variables. Because of the way the graph structure encodes the parameterization of the probability distribution,
    • It gives us an intuitive and compact data structure for capturing high dimensional probability distributions.
    • It gives us a suite of methods for efficient reasoning, using general purpose algorithms
    • It gives us a reduction in the number of parameters thus we can represent these high-dimensional probability distribution efficiently using a very small number of parameters.
      • Reduce factors down. In the graphic, Drop the arrows as many as possible based on conditional independence!

Depending on whether the graph is directed or undirected, we can classify graphical modes into two categories.

  • [a] Bayesian networks

    • Definition: Given X~P(X), and as an ordered DAG, we build a joint: where Each variable X_i should be independent each other but if some of them are hoplelessly dependent, and we can make them conditionally independent, then...

    • It defines probability distributions over graphs of random variables. Instead of enumerating all possibilities of combinations of multiple random variables, it defines probability distributions that are inherent to each individual node. The definition of this joint distribution by using such factors has one great advantage: we can reduce the number of probability values(parameters) required.

    • Key: Factorization of joint distribution

    • P(a,b,c) = P(a|b,c)P(b,c) = P(a|b,c)P(b|c)P(c)

    • P(a,b,c) = P(a,b|c)P(c)

      • P(a|b,C): it's mainly a conditional = P(a,b)/P(b) = P(a,b|C)/P(b|C)
      • P(a,b|C): it's mainly a joint= P(a|b)P(b) = P(a|b,C) P(b|C)
      • If | is already present, then simply add the extra C, otherwise add | then add C.
    • Example

      • Directed PGM & Bayesian OLS Regression

      • Directed PGM & Bayesian NaiveBayes Classification

  • [b] Markov networks

    • T.B.D

(A) Representation

A1> Bayesian Network Intro

so what?

Bayesian networks allow to model causal relationships between variables, compensating the lack of information provided by data. BN can be used for a wide range of tasks including prediction, anomaly detection, diagnostics, automated insight, reasoning, time series prediction and decision making under uncertainty. It uses a directed graph as the intrinsic representation. Bayesian Network is a Directed Acyclic Graph(DAG) whose nodes represent the random variables X1, X2, ... It represents a joint distribution(via the chainRule) for Bayesian Networks. Bayes-nets explicitly encodes the dependencies between variables to model joint distributions. They are particularly useful because they provide a compact representation for practically arbitrary distributions, and efficient algorithms exist to sample and perform inference over the joint distribution.

  • It takes the idea of uncertainty and marry it with efficient structures. so..one can easily see what uncertain variable influence other uncertain variables.

A2> Bayesian Network Reasoning Patterns

> Flow of influences...the V-Structure is interesting!

> Find the parameters...

  • (1) single result...(a single likelihood)

  • (2) multiple results...(multiple likelihoods)

  • (3) between results ? P( likelihood_A | likelihood_B ) <><><><><*> important!!!

    If we Know the prior, its likelihoods are independent because of the conditional independence... so?

    • There goes magic!! If we see the prior is the hidden latent variable, then dealing with the features at hand becomes super simple since they are independent each other!
  • (4) multiple parameters..latent variables?...(multiple priors)

    This is also important. Using this V-Structure, we can make any features interlaced with each other!

> Typical Conditional Independence: the SPLIT

> D-Separation and V-Structure

  • D-Blocking in ordinary structures: The knowledge of any nodes on the influence stream will split the whole structure (the conditional independence gets activated).
  • Activation in V-Structures: The knowledge of "child" renders previously independent variables DEPENDENT (the dependence gets activated).

[Note] We don't have any obv. We can see the V-Structure, but its child is not observed. D and S - are they really independent because of the inactivated child? A proof using marginalization...

[Note] Independence assumption in Naive Bayes Classification: Given the class variable, each observed variable is independent of the other observed variables. In other words, if we observe the class variable, influence cannot flow between any of the other child variables. It's like typical D-seperations. But Think! the Class has not yet been observed. So it's naive to say all features are independent?

A3> Template Model (Dynamic Bayesian Network or Time Series Network)

  • Hidden Markov model (HMM)
  • Kalman filter (KFM)
  • Time series clustering, etc.

This system evolves over time ?

  • HMM: Although Hidden Markov Models can be viewed as a subclass of dynamic Bayesian networks, they have their own type of structure that makes them particularly useful for a broad range of applications.
    1. "Transition" Probability by time (b/w parents?)
    1. "Emission" Probability (for each child?)

What is Template Model? As an extension on the language on graphical models, Template Model intends to deal with temporal processes where we have many replication over time. So..it extends standard Bayesian networks with the concept of time. They are a convenient way of representing Bayesian networks that have a high amount of parameter sharing and structure (however, they are merely compact representations of a fully unrolled Bayesian network, so it's not to say it has an additional representative powers).

  • Template Variable: it is the variables that we end up replicating in many cases within a single model as well as across models. The typical example is the locations (trajectories) of a robot given time period. ConditionalProbabilityDistribution in template models can often be copied many times.
  • Template models are just the dependency models from template variables.
  • Template models can often capture events that occur in a time series.
  • Template models can capture parameter sharing within a model.

A4> Global / Local structure

[Note] Model the repeatition! If we toss the same coin again and again, how to model this repeatition? Put a little box around that outcome variable, and this box which is called a plate(whyyy? coz it's a stack of identical pieces). A plate denotes that the outcome variable is indexed. Sometimes in many models, we will include all parameters explicitly within the model. But often when you have a parameter that's outside of all plates, we won't denote it explicitly. So we just omit it.

Plate dependency model has the following characteristics:

  • For a template variable A{X1, X2,...,Xk}indexed by X (ex. A: Intelligence, Xk: each student ID),
  • For each of those template variables, we have a set of template parents.
  • Each of the index in the template parents variable B{U1, U2,..}indexed by U or C{V1, V2..}indexed by V, etc have to be a subset of the index in the child.

  • From the ground network above, we can see that A and B belong only to plate x, C belongs to x and y, D belongs to x and z and E belongs to all 3. Moreover, there needs to be a direct edge from A to E. These models, by allowing us to represent an intricate network of dependencies, allow us to capture very richly correlated structures in a concise way which allows collective inference. These models can encode correlations across multiple objects allowing collective inference.

Local Structure ????

No more Tabular representations for CPD!!!!

Instead,...there are several local structures (in a parametric form) for Conditional Probability Distributions. And..we hope to reduce the parameter size, Using...1.Deterministic Structure?, 2.Tree Structure?, 3.Logistic Structure?, 4.Noisy(Or/And) Structure?, 5.Continuous Structure?

This is about the story of crazy V-Structures

... of Conditional Probability Distribution

a) Basic context-specific CPD

<> Context-Specific Independent structure: The independent statement b/w two variable(X,Y) only holds for particular values of the conditioning variable c for example. So the dependence only happening in a certain context.

b) Tree Structured CPD

<> Context-Specific Independent tree: The independent statement b/w two variable(X,Y) only holds for particular values of the conditioning variable C for example. So the dependence is only happening in a certain context.

<> Non**-Context-Specific Independent tree (Multiplexer Model): As an additional structure, multiplexer(as a Parent, a selector) activates the V-structure. This will dramatically reduce the size of parameters.

  • Application of the Multiplexer Tree
    • The Multiplexer Tree is very useful! It comes up in physical hardware configuration settings. It turns out that all of the troubleshooters that are part of the Microsoft operating system are, Built on top of a Bayesian Network Technology. The task is to try and figure out why a printer isn't printing. So we have a variable here that tells us whether the printer is producing output, and that depends on a variety of factors, but one of the factors that it depends on is where the printer input is coming from: Is it coming from a local transport? Or a network transport?. And, depending on which of those it's coming from, there's a different set of failures that might occur. So the variable here that serves the goal of the selector(Multiplexer) variable is this variable print data out. And that's the root of the tree that's used here. And and depending on whether the print location is local or not. then you depend either on properties of the local transport. Or on properties of the network transport. And it turns out that even in this very, very simple network, the use of tree CPD's reduces the number of parameters from 145 to about 55, and makes the elicitation process much easier.

c) Noisy(Or/And) Structured CPD

What if there are so many factors that contribute something to the probability of exhibiting phenomenon such as "cough"?

<> Noisy-OR cpd can capture such interaction.

  • It is a larger graphical model where we break down the dependencies of Y on its parents (X1 up to Xk), by introducing a bunch of intervening variables as noisy transmitters or filters. So Y becomes true only if these filters succeeds in making it true.
  • Z is a probability assignment machine. Even one single Z that turns out to be true, can activate Y...
  • Then what is the probability that all of these Zguys fail to turn on the variable Y? So, when does Y keeps sleeping? First of all, when it doesn't get turned on by the leak(formula=P(Y get activated by itself)) . So, that's formula x P(none of Z activate Y).

  • **Therefore,... Given that Y = 0, we know that all formula from formula are 0; so that blocks the trail of influence from formula...so..all features are independent!

<> Noisy-AND (Noisy_Max) cpd can capture such interaction.

  • This is called independence of causal influence because it assumes that you have a bunch of causes and each of them acts independently to affect the truth of that child variable. so, there's no interactions between the different causes.
  • Each Zi have their own separate mechanism and ultimately it's all aggregated together in a single variable Z from which the truth of Y is then determined from this aggregate effect.

  • **Therefore,...Given that Y = 0, it's not to say all features are independent ???

d) Logistic Structured CPD

  • In a sigmoid CPD, each discrete Xi induces a continuous Zi which represents Wi*Xi (convert discrete into continuous).

  • Wi parameterizes this edge and tells us how much force Xi is going to exert on making Y true so, if Wi is zero, it tells us that Xi exerts no influence whatsoever.

    • (+) Wi: Xi is going to make Y more likely to be true
    • (-) Wi: Xi is going to make Y less likely to be true.
  • the final Z adds up all of these different influences + an additional bias term W0.

  • Then we turn this ultimately into the probability of the target variable Y by passing this continuous quantity Z through a Sigmoid Function(squelching func).

    • Since e to the power of Z is a positive number, this gives us a number that is always in the interval of {0,1}.
  • **Therefore, only the number of observed features bigger than certain threshold will activate Y !

e) Continuous Structured CPD

  • When our network involves continuous variable..
    • Let's imagine that we have a continuous temperature variable and we have a sensor. Now thermometers aren't perfect, so what we would expect is that the sensor is around the right temperature but not quite. And so, one way to capture that is by saying that the sensor follows a normal distribution.

A5> Decision Theory:

Decision making under uncertainty

a) Maximum Expected Utility

  • Expected Utility: E[U]

  • Maximum E[U] Decision Rule

  • T.B.D


(B) Inference for Bays-Net

  • Q1. How to answer questions such as "Given some inputs, what are the outputs?" The answer is going to be a complete joint distribution over the query variables...we call it posterior. This is what we are after.
  • Q2. Out of all the possible values for all the query variables, which combination of values has the highest probability? Whic Q values are maxable given the evidence values?
  • Q3. One great thing about Bayes-net is that we are not restricted to going only in one direction; we can reverse the casual flow. The Q can become evidence values and the E can become the query values. Like wise, one Q and E can become evidence values and the other Q and E can become query values. The variable unspecified can become hidden variable.
    • If Mary has called to report(it's true) that "Hey! the alarm is going off!!!"(don't know if it's true or not), we want to know if there has been a burglary.

?????????????????????????????????????????????????????????????????????????????????

Let's see how to operationalize Bayes-Net to answer such conditional probability queries!

The main issue is...

  • How to Compute Marginals?
  • How to Compute MAP assignments?

  • <1> Using Variable Elimination with Marginalization
  • <2> Using Message passing over graph with MAP assignment (for special purpose?)
    • [Note] MAP Inference It claims a single coherent assignment of highest probability, which is not the same as maximizing individual maginal probabilities.
    • Max-product-Belief Propagation
    • ....
  • <3> Using Random Sampling with Marginalization

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

> Inference_01: Enumeration and Variable Elimination

Let's say...Go through all possibilities, add them up and come up with an answer.

  • State the problem and ask the question: "If Mary and John has called to report that the alarm is going off, we want to know if there has been a burglary."

  • First, take a conditional prob(posterior) and rewrite it as unconditional prob(a bunch of jointsss).

  • Next, enumerate all the atomic probabilities and calculate the sum of products.

    • On the numerator, this joint can be determined by enumerating all hidden variable values..(Making the Sum over these hidden variables)

    • To obtain the values of these atomic events, rewrite the eqation in a form that corresponds to the conditional prob-table of this Bayes-net. Rewrite w.r.t parents of each of the nodes in the network, so the form of the unfolded equation heavily depends on the graphic!

    • The equation can be understood as a function of hidden variables and the answer can be the sum of all these function outputs for all values of the hidden variables. So it's a sum of such terms and each of which is a product of "factors" in the previous equation.

  • Do the same thing with the denominator.

  • We finish to enumerate over all four hidden variable possibilities in the end, then it's saying P(burglary alarm being true | John, Mary) is 0.284.

To speed up, We can eliminate some variables...

  • Step 1> Make a joint

  • Step 2> Marginalize the joint...keep making the "single node" until it gives an answer:

Variable Elimination

pushing terms through the summation..

  • Find a Factor(joints) product -> Then...Marginalize

  • Graph-based perspective : TBD

  • Elimination ordering? Which variable should be removed first? : TBD

> Inference_02: MAP

: TBD

> Inference_03: Approximation Method by means of "Sampling"

To deal with Joint (or to estimate Joint), we actually do pseudo-experiments?

(+) Sampling has an advantage over inference

  • In that we know a procedure for coming up with at least an approximate value for the Joint.
  • In that although we don't have the conditional probability tables, we still can simulate the process.

"Sampling" means we can get a posterior at the end!

We can do sampling:

Draw samples then average them out. You have a machine dude!

Let's say we have a bunch of inputs X that follows a certain distribution. We want to get "Expected value of the outputs" E[f(X)], but how? using computer, we plug in all inputs then average out the outputs f(X). If we know the input distribution, then we sample each input from the known distribution, then plug them in. But what if we have no knowledge on input distribution of X?

  • What we can do is to think of some ulternative distribution! then draw samples from it. Interestingly, we can later on correct the samples from the alternative(wrong) distribution! And sometimes even we can do better than using the real distribution of X at the end of the day! HOW?

  • 1) Importance Sampling... is not a sampling but a variant of MonteCarlo approximation method.

    • Let's say we have E[f(x)] = ∫ f(x)*P(x) dx where the pdf P(x) is the probability of f(x) outcome.
    • we do some weird thing with q(x) that we call "proposal distribution". And it's ur baby.
    • Treat f(x)*P(x) / q(x) as a new objective function that we want to approximate. And q(x) is the pdf of our new objective function (so from now on, our samples come from q(x) and not from P(x)). And we can perform the MonteCarlo approximation on our new objective function! The result would be exactly what we seek for in the beginning.
    • We know f(x) of course. But do we know P(x) where the sample X come from?? No! we cannot compute P(x). We don't know the original samples to plug into f(x) for MonteCarlo Computation...perhaps..we do not know the exact P(x), but sometimes we know sth similar to P(x)..but without the normalizing constant.

      • Let's say
      • If we put them in..looks nice, but let's say we don't know the normalization constant z_p, z_q.
      • So we don't know z_p and z_q, but luckily we can compute the proportion because we can approximate the Normalization constant proportion with the MonteCarlo approximation, using the "importance weight"!
      • thus this is the MonteCarlo Estimation within the MonteCarlo Estimation.
  • 2) Smirnov's Inverse Transform sampling

    We have a random variable, and want to generate samples from the random variable or pdf, first, you need CDF!

    • a) When CDF of your data distribution is invertible: inverse CDF(unif(0,1)) is your sample!
    • b) When CDF of your data distribution is not invertible: playing with "intervals"..
  • 3) Rejection Sampling

    It also applies for a higher dimension.

    • a) Draw samples using Uniform distribution on a "set A": It's a discrete!

      • The "set A" is very complicated. But we know a certain rectangle boundary that houses our "set A"
      • First, draw a bunch of samples from a larger set of a certain known rectangle boundary, then
    • b) Draw samples using Non-Uniform distribution on a "pdf A": It's a continuous!

      • The "pdf A" is very complicated. If we don't know "pdf A", we can first think of the unnormalized verson(proportional to pdf A) like "pdf A_tilda". And we choose additionally the proposal distribution "q" for actual sampling(gives a boundary of uniform sampling). "pdf A_tilda" will work as a constraint to reject/accept the samples.
  • 4) MCMC and Gibbs Sampling for multiple parameter inference

    Sampling can help a Neural Network model?

MarKov Chain

When x ~ unknown P(x), we hypothetically sample from a proposal distribution q(x)(MonteCarlo) then approximate stationary distributionE[P(x)](MarkovChain). In other words, instead of trying to deal with intractable computations involving the posterior, we can get samples from some easy distribution then feed these samples to our known joint functiong(x). Next, we use the proportion of the joint function outputs (g(x1)/g(x0) as Metropolis-Hastings ratio: α), or the random walk's transition probability distribution, we decide the final samples for the nasty P(x).

Metropolis > Metropolis Hasting > Gibbs

When we have multiple parameters P(θ,ϕ|y) ∝ g(θ,ϕ), first we plug in the value of arbitrary ϕ, pretending we know the value of θ by substituting in its current value or current iteration from the Markov chain. Once we've drawn for both θ and ϕ, that completes one iteration and we begin the next iteration by drawing a new θ. This idea of one at a time updates is used in what we call Gibbs sampling.


(C) Learning Bayes-Net from data

00 > The Goal: Knowledge discovery

  • In Structure:

    • Check_1. directionality of edges ?
    • Check_2. independencies ?
    • Check_3. presence and location of hidden variables ?
    • Check_4. structure overfitting ?
      • How to penalize the structure complexity?....the more complex it is, the higher training-likelihood it will have
  • In Prediction on new instance:

    • Check_1. Estimated model parameter overfitting ?
      • it can overfit to random noise in training data
      • Use regularization....

Why Bayes-Net for Knowledge discovery?

  • we can exploit correlations between several predicted variables.
  • In the decision analysis framework, traditional point estimation approaches have often produced faulty risk assessment results as it ignores the associated uncertainty. One can see that the key lies in the manifestation of uncertainty range (confidence interval) and this requires us to obtain a full parameter distribution. Bayesian method allows us to estimate a full parameter distribution, using the joint distribution of the assumed model parameters(prior) and the given observations(likelihood).
  • Incorporate prior knowledge into our model..i.e...Bayesian method derives an insight based on the synthesis of the assumed knowledge in the parameters(prior) and the new information from the observed data(likelihood)...combining both observations and experts' knowledge on the most likely parameter values
  • we can perform multiple tasks, using a single model. This can be a efficient framework of knowledge discovery

01 > Parameter Learning

Approach 01. MLE:

MLE tries to estimate the best parameter while optimizing the likelihood of the data, given the parameters.

T.B.D

Approach 02. Bayesian Estimation

MLE is flawed...

Limitations of MLE

  • [Case]: "Toss a coin and H came 7 times out of 10 tosses. Then P(H)=0.7" which is the θ.
    • New context:
      • A and B fight a duel. A won 7 out of 10 matches. Then P(A)=0.7 which is the θ.
    • Larger samples:
      • Toss a coin and H came 70,000 times out of 100,000 tosses. Then P(H)=0.7 which is the θ.
  • MLE has absolutely no ability to distinguish between these three scenarios above.

Bayesian Estimation in PGM offers some better properties.

  • First, θ becomes a random variable. In Bayesian framework, anything about we are uncertain can be viewed as a random variable over which we have a distribution that is updated over time as data is obtained. This is the heart of the Bayesian formalism.

    • In MLE, θ is fixed (or given), and all tosses(data points) are independent.
    • In BE, θ is unknown, so we cannot assume all tosses are independent...they are marginally dependent because of their unknown mommy (without being given θ)
    • This gives us a joint probablistic model (of all data points, and θ..evenly).
    • what about Bayesian prediction on new instances ?
      • For example, let's say we have a Multinomial Data distribution, thus it comes with a Dirichlet prior distribution.

Bayesian Estimation in Bayes-Net

  • Case 01.

  • Case 02.

In Bayes-Net, if the model has independent priors, then its posteriors are also independent.


02 > Structure Learning (Model Selection?)

  • Case_01: Answering specific queries by taking data and learning the most significant dependencies...but we don't know the structure.

  • Case_02: Discovering the interrelationships between the variables so that we understand the our whole network better...but we don't know the structure.

  • What's the Problem?

  • Solution: Scroing based Learning (Optimization?)

    • We can define a scoring function that evaluates how well the network structure matches your data.

Scoring Criteria

    1. Likelihood Score
    • Find the (model, θ) pair that maximize the likelihood?
      • It computes the log-likelihood of data relative to a certain structure, using MLE parameters(already optimized) for the structure.
      • It is subject to overfitting problem for sure...
      • which structure shows bigger score?
    1. BIC, Asymptotic Consistency
    1. Bayesian Score

Searching Structures with Optimization (Exploit Decomposability)

Input:

  • your data
  • scoring criteria selection
  • prior information
  • a set of possible structures

Output:

  • a structure that maximizes the score

03 > Learning with incomplete data

What's the occasion?

  • hidden variables? Unobservable?
  • simply missing values?

The presence of missing values or latent variable raises multiple challenges.

  • (1) Unknown mechanism of missingness
  • (2) Complexity of likelihood & Multiple Global Optima

(1) Unknown mechanism of missingness

With Latent variables and Model Sparcity

Having a latent variable sometimes gives an advantage.

  • Latent variable often gives rise to the sparser and therefore easier-to-learn models.
  • Latent variable often offers an interesting characterization of structure in the data...such as clusters..

With missing values

Having missing values gives a pain in the ass. H,T,H,?,T,?,?,?,T,H......how to deal with "?" If you don't know why these data are missing, you have no idea how to proceed...

  • Missing data mechanism:

    • [Case 01] with Random missing value

      • The observed target value Y depends on X and on O, but there is no interaction between X and on O (D-Separation)
    • [Case 02] with Non-Random missing value

      • By comparison, in this case, the true value of the X variable affects O(observed or not)
  • In the case of missing at random (MAR) we can ignore the missing data mechanism and focus only on the likelihood that we get to observe.

    • The joint over (x, y, o) have the following characteristics that the observation variables O are independent of the missing X's. This means if you tell me the Y values that you observe, then the fact that some X values may or may not have been observed doesn't carry any additional information.

(2) Complexity of likelihood & Multiple Global Optima

  • Let's say if our hidden variable has two possible outcomes, then our likelihood function can have two global maxima. With many hidden variables, there can be an exponential number of global maxima. This can occur with missing data (not only hidden variables) as well.

  • parameters start being correlated with each other....

Ok, then how to go about doing parameter estimation in the context of the missing data?


EM Algorithm (Likelihood Optimization)

In the context of missing data, the likelihood function looks like a nasty, complicated multi-modal function, which is impossible to optimize in closed form. There is two general classes of strategies.

  • The first is to use a generic optimization method, such as gradient ascent. We might start out with a particular point in the parameter space, and we compute the gradient at that point, and we go in the direction of steepest ascent, which might take us to the next point, at which point we compute the gradient, and we continue until we reach a local optimum of the function that we're trying to optimize.

    • For improving the rate of convergence of these algorithms, some methods such as line search or conjugate gradient methods basically are going to get us faster to whatever local optimum we're going to end up at..
    • Need to run inference over each data instance at every iteration..so very costly...
  • The second strategy is EM-Algorithm and that is specifically geared for likelihood function optimization. So it's a special purpose algorithm. The intuition is that if we had the full set of parameters, then computing the probability of the missing data would also be easy in the sense that it can be done using straight probabilistic inference. We have two sets of of things that we need to infer or estimate. One is the parameters, and the other is the missing data. We start out with one set of parameters, use that to estimate the values of the missing data and then use that completion to re-estimate the parameters and so on and so forth until we reach convergence.

    • First, pick a starting point for parameter value.
    • Next iterate:
      • [Step_E]: "complete" the data(soft completion using probability), using current parameters.
      • [Step_M]: Re-estimate the parameters relative to the previous data completion.
  • The convergence of the likelihood function is not the same as convergence of the parameters and so we have to decide which one we actually care about.

  • EM achieves only a local optimum of the log likelihood function. Local optima are unavoidable, and increase with the amount of missing data..... They decrease with the total amount of data, but not if we hidden variables, at which point you're stuck with them no matter how much data you have.

  • Perhaps the most common use of the EM algorithm is to the problem of Learning with Latent Variables. These are situations where there is a certain set of variables that we just never get to observe, but they turn out to be important for capturing some important latent structure in the distribution of the data instances that we have. Perhaps the simplest example is in context of clustering where we have some kind of, latent class variable and some set of observed features. And we're trying to, basically segment the instances that we have into categories where, we assume that, for each category there is a similar distribution over the observed features. The features become independent to each other, once we're given the class. Latent variables satisfy MAR, so can use EM.

About

PGM(Bayesian Network) & its Inference(with MCMC)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0