: It's a probabilistic model for which a graph expresses the conditional dependence structure between random variables.
: Understanding what makes Conditional Independence
b/w variables helps us explain some uncertainty reasoning. If x, y are conditionally independent on z
, then
- P(x | y,
z
) = P(x|z
) - P(x,y |
z
) = P(x|z
) * P(y|z
) - P(x, y, z) ∝ ϕ(x,
z
) * ϕ(y,z
) ????
: PGM allows a tractable inference!
- Let's say we want to classify data points that are independent of each other(for example, given an image, predict whether it contains a cat or a dog), but See
I
,like
,machine
,learning
... Problem here is thatlearning
could be a noun or a verb depending on its context. Probabilistic graphical model is a powerful framework which can be used to learn such models with dependency. - PGM consists of a graph structure. Each
node
of the graph is associated with a random variable, and theedge
in the graph are used to encode relations between the random variables. Because of the way the graph structure encodes theparameterization
of the probability distribution,- It gives us an intuitive and compact data structure for capturing high dimensional probability distributions.
- It gives us a suite of methods for efficient reasoning, using general purpose algorithms
- It gives us a reduction in the number of parameters thus
we can represent these high-dimensional probability distribution efficiently using a very small number of parameters
.- Reduce factors down. In the graphic, Drop the arrows as many as possible based on conditional independence!
Depending on whether the graph is directed or undirected, we can classify graphical modes into two categories.
-
-
Definition: Given X~P(X), and as an ordered DAG, we build a
joint
: where Each variable X_i should be independent each other but if some of them are hoplelessly dependent, and we can make them conditionally independent, then... -
It defines probability distributions over graphs of random variables. Instead of enumerating all possibilities of combinations of multiple random variables, it defines probability distributions that are inherent to each individual node. The definition of this
joint distribution
by using such factors has one great advantage:we can reduce the number of probability values(parameters) required
. -
P(a,b,c)
= P(a|b,c)P(b,c)
= P(a|b,c)P(b|c)P(c) -
P(a,b,c)
= P(a,b|c)P(c)
- P(a|b,
C
): it's mainly aconditional
= P(a,b)/P(b) = P(a,b|C
)/P(b|C
) - P(a,b|
C
): it's mainly ajoint
= P(a|b)P(b) = P(a|b,C
) P(b|C
) - If
|
is already present, then simply add the extraC
, otherwise add|
then addC
.
- P(a|b,
-
Example
-
-
- T.B.D
Bayesian networks allow to model causal relationships between variables, compensating the lack of information provided by data. BN can be used for a wide range of tasks including prediction, anomaly detection, diagnostics, automated insight, reasoning, time series prediction and decision making under uncertainty. It uses a directed graph
as the intrinsic representation. Bayesian Network is a Directed Acyclic Graph(DAG) whose nodes represent the random variables X1, X2, ... It represents a joint distribution
(via the chainRule) for Bayesian Networks. Bayes-nets explicitly encodes the dependencies
between variables to model joint distributions. They are particularly useful because they provide a compact representation for practically arbitrary distributions, and efficient algorithms exist to sample and perform inference over the joint distribution.
- It takes the idea of uncertainty and marry it with efficient structures. so..one can easily see what uncertain variable influence other uncertain variables.
-
-
-
If we Know the prior, its likelihoods are independent because of the
conditional independence
... so?- There goes magic!! If we see the prior is the
hidden latent variable
, then dealing with the features at hand becomes super simple since they are independent each other!
- There goes magic!! If we see the prior is the
-
This is also important. Using this V-Structure, we can make any features interlaced with each other!
- D-Blocking in ordinary structures: The knowledge of any nodes on the influence stream will split the whole structure (the conditional independence gets activated).
- Activation in V-Structures: The knowledge of
"child"
renders previously independent variables DEPENDENT (the dependence gets activated).
[Note] We don't have any obv. We can see the V-Structure, but its child is not observed. D and S - are they really independent because of the inactivated child? A proof using marginalization...
[Note] Independence assumption in Naive Bayes Classification: Given the class variable, each observed variable is independent of the other observed variables. In other words, if we observe the class variable, influence cannot flow between any of the other child variables. It's like typical D-seperations. But Think! the Class has not yet been observed. So it's naive to say all features are independent?
- Hidden Markov model (HMM)
- Kalman filter (KFM)
- Time series clustering, etc.
This system evolves over time ?
- HMM: Although Hidden Markov Models can be viewed as a subclass of dynamic Bayesian networks, they have their own type of structure that makes them particularly useful for a broad range of applications.
-
- "Transition" Probability by time (b/w parents?)
-
- "Emission" Probability (for each child?)
What is Template Model? As an extension on the language on graphical models, Template Model intends to deal with temporal processes where we have many replication over time. So..it extends standard Bayesian networks with the concept of time. They are a convenient way of representing Bayesian networks that have a high amount of parameter sharing and structure (however, they are merely compact representations of a fully unrolled Bayesian network, so it's not to say it has an additional representative powers).
- Template Variable: it is the variables that we end up replicating in many cases within a single model as well as across models. The typical example is the locations (trajectories) of a robot given time period. ConditionalProbabilityDistribution in template models can often be copied many times.
- Template models are just the dependency models from template variables.
- Template models can often capture events that occur in a time series.
- Template models can capture
parameter sharing
within a model.
[Note] Model the repeatition! If we toss the same coin again and again, how to model this repeatition? Put a little box around that outcome variable, and this box which is called a plate(whyyy? coz it's a stack of identical pieces). A plate denotes that the outcome variable is indexed. Sometimes in many models, we will include all parameters explicitly within the model. But often when you have a parameter that's outside of all plates, we won't denote it explicitly. So we just omit it.
Plate dependency model has the following characteristics:
- For a template variable
A{X1, X2,...,Xk}
indexed byX
(ex. A: Intelligence, Xk: each student ID), - For each of those template variables, we have a set of template parents.
- Each of the index in the template parents variable
B{U1, U2,..}
indexed byU
orC{V1, V2..}
indexed byV
, etc have to be a subset of the index in the child.
- From the ground network above, we can see that A and B belong only to plate x, C belongs to x and y, D belongs to x and z and E belongs to all 3. Moreover, there needs to be a direct edge from A to E. These models, by allowing us to represent an intricate network of dependencies, allow us to capture very richly correlated structures in a concise way which allows collective inference. These models can encode correlations across multiple objects allowing collective inference.
No more Tabular representations for CPD!!!!
Instead,...there are several local structures (in a parametric form) for Conditional Probability Distributions. And..we hope to reduce the parameter size, Using...1.Deterministic Structure?, 2.Tree Structure?, 3.Logistic Structure?, 4.Noisy(Or/And) Structure?, 5.Continuous Structure?
... of Conditional Probability Distribution
<> Context-Specific Independent structure: The independent statement b/w two variable(X,Y) only holds for particular values of the conditioning variable c for example. So the dependence only happening in a certain context.
<> Context-Specific Independent tree: The independent statement b/w two variable(X,Y) only holds for particular values of the conditioning variable C for example. So the dependence is only happening in a certain context.
<> Non**-Context-Specific Independent tree (Multiplexer Model): As an additional structure, multiplexer(as a Parent, a selector) activates the V-structure
. This will dramatically reduce the size of parameters.
- Application of the Multiplexer Tree
- The Multiplexer Tree is very useful! It comes up in physical hardware configuration settings. It turns out that all of the troubleshooters that are part of the Microsoft operating system are, Built on top of a Bayesian Network Technology. The task is to try and figure out why a printer isn't printing. So we have a variable here that tells us whether the printer is producing output, and that depends on a variety of factors, but one of the factors that it depends on is where the printer input is coming from:
Is it coming from a local transport? Or a network transport?
. And, depending on which of those it's coming from, there's a different set of failures that might occur. So the variable here that serves the goal of the selector(Multiplexer) variable is this variableprint data out
. And that's the root of the tree that's used here. And and depending on whether the print location is local or not. then you depend either on properties of the local transport. Or on properties of the network transport. And it turns out that even in this very, very simple network, the use of tree CPD'sreduces the number of parameters
from 145 to about 55, and makes the elicitation process much easier.
- The Multiplexer Tree is very useful! It comes up in physical hardware configuration settings. It turns out that all of the troubleshooters that are part of the Microsoft operating system are, Built on top of a Bayesian Network Technology. The task is to try and figure out why a printer isn't printing. So we have a variable here that tells us whether the printer is producing output, and that depends on a variety of factors, but one of the factors that it depends on is where the printer input is coming from:
What if there are so many factors that contribute something to the probability of exhibiting phenomenon such as "cough"?
<> Noisy-OR cpd can capture such interaction.
- It is a larger graphical model where we break down the dependencies of Y on its parents (X1 up to Xk), by introducing a bunch of
intervening variables
as noisy transmitters or filters. So Y becomes true only if these filters succeeds in making it true. - Z is a probability assignment machine. Even one single Z that turns out to be true, can activate Y...
- Then what is the probability that all of these Zguys fail to turn on the variable Y? So, when does
Y keeps sleeping
? First of all, when it doesn't get turned on by the leak(=
P(Y get activated by itself)
) . So, that'sx
P(none of Z activate Y)
.
-
**Therefore,... Given that Y = 0, we know that all
from
are 0; so that blocks the trail of influence from
...so..all features are independent!
<> Noisy-AND (Noisy_Max) cpd can capture such interaction.
- This is called independence of causal influence because it assumes that you have a bunch of causes and each of them acts independently to affect the truth of that child variable. so, there's no interactions between the different causes.
- Each
Zi
have their own separate mechanism and ultimately it's all aggregated together in a single variableZ
from which the truth of Y is then determined from this aggregate effect.
-
**Therefore,...Given that Y = 0, it's not to say all features are independent ???
-
In a sigmoid CPD, each discrete Xi induces a continuous
Zi
which representsWi*Xi
(convertdiscrete
intocontinuous
). -
Wi
parameterizes this edge and tells us how much force Xi is going to exert on making Y true so, ifWi
is zero, it tells us that Xi exerts no influence whatsoever.- (+)
Wi
: Xi is going to make Y more likely to be true - (-)
Wi
: Xi is going to make Y less likely to be true.
- (+)
-
the final
Z
adds up all of these different influences + an additional bias termW0
. -
Then we turn this ultimately into the probability of the target variable Y by passing this continuous quantity
Z
through a Sigmoid Function(squelching func).- Since e to the power of
Z
is a positive number, this gives us a number that is always in the interval of {0,1}.
- Since e to the power of
-
**Therefore, only the number of observed features bigger than certain threshold will activate Y !
- When our network involves continuous variable..
- Let's imagine that we have a continuous temperature variable and we have a sensor. Now thermometers aren't perfect, so what we would expect is that the sensor is around the right temperature but not quite. And so, one way to capture that is by saying that the sensor follows a normal distribution.
Decision making under uncertainty
- Q1. How to answer questions such as "Given some inputs, what are the outputs?" The answer is going to be a complete
joint
distribution over the query variables...we call itposterior
. This is what we are after. - Q2. Out of all the possible values for all the query variables, which combination of values has the highest probability? Whic
Q
values are maxable given the evidence values? - Q3. One great thing about Bayes-net is that we are not restricted to going only in one direction; we can reverse the casual flow. The
Q
can become evidence values and theE
can become the query values. Like wise, oneQ
andE
can become evidence values and the otherQ
andE
can become query values. The variable unspecified can become hidden variable.- If Mary has called to report(it's true) that "Hey! the alarm is going off!!!"(don't know if it's true or not), we want to know if there has been a burglary.
The main issue is...
- How to Compute Marginals?
- How to Compute MAP assignments?
- <1> Using Variable Elimination with Marginalization
- <2> Using Message passing over graph with MAP assignment (for special purpose?)
- <3> Using Random Sampling with Marginalization
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Let's say...Go through all possibilities, add them up and come up with an answer.
- State the problem and ask the question: "If Mary and John has called to report that the alarm is going off, we want to know if there has been a burglary."
-
First, take a conditional prob(posterior) and rewrite it as unconditional prob(a bunch of
jointsss
). -
Next, enumerate all the atomic probabilities and calculate the sum of products.
-
On the numerator, this joint can be determined by enumerating all hidden variable values..(Making the Sum over these hidden variables)
-
To obtain the values of these atomic events, rewrite the eqation in a form that corresponds to the conditional prob-table of this Bayes-net. Rewrite w.r.t parents of each of the nodes in the network, so the form of the unfolded equation heavily depends on the graphic!
-
The equation can be understood as a function of hidden variables and the answer can be the sum of all these function outputs for all values of the hidden variables. So it's a sum of such terms and each of which is a product of "factors" in the previous equation.
-
-
Do the same thing with the denominator.
-
We finish to enumerate over all four hidden variable possibilities in the end, then it's saying
P(burglary alarm being true | John, Mary)
is 0.284.
pushing terms through the summation..
-
Graph-based perspective : TBD
-
Elimination ordering? Which variable should be removed first? : TBD
: TBD
To deal with Joint
(or to estimate Joint
), we actually do pseudo-experiments?
(+) Sampling has an advantage over inference
- In that we know a procedure for coming up with at least an approximate value for the
Joint
. - In that although we don't have the conditional probability tables, we still can simulate the process.
Draw samples then average them out. You have a machine dude!
Let's say we have a bunch of inputs X
that follows a certain distribution. We want to get "Expected value of the outputs" E[f(X)]
, but how? using computer, we plug in all inputs then average out the outputs f(X)
. If we know the input distribution, then we sample each input from the known distribution, then plug them in. But what if we have no knowledge on input distribution of X
?
-
What we can do is to think of some ulternative distribution! then draw samples from it. Interestingly, we can later on correct the samples from the alternative(wrong) distribution! And sometimes even we can do better than using the real distribution of
X
at the end of the day! HOW? -
- Let's say we have
E[f(x)] = ∫ f(x)*P(x) dx
where the pdf P(x) is the probability of f(x) outcome. - we do some weird thing with
q(x)
that we call "proposal distribution". And it's ur baby.
- Treat
f(x)*P(x) / q(x)
as a new objective function that we want to approximate. And q(x) is the pdf of our new objective function (so from now on,our samples come from q(x)
and not fromP(x)
). And we can perform the MonteCarlo approximation on our new objective function! The result would be exactly what we seek for in the beginning.
-
We know
f(x)
of course. But do we knowP(x)
where the sampleX
come from?? No! we cannot computeP(x)
. We don't know the original samples to plug intof(x)
for MonteCarlo Computation...perhaps..we do not know the exactP(x)
, but sometimes we know sth similar toP(x)
..but without the normalizing constant.- Let's say
- If we put them in..looks nice, but let's say we don't know the normalization constant
z_p
,z_q
.
- So we don't know
z_p
andz_q
, but luckily we can compute the proportion because we can approximate the Normalization constant proportion with the MonteCarlo approximation, using the "importance weight"!
- thus this is the MonteCarlo Estimation within the MonteCarlo Estimation.
- Let's say we have
-
We have a
random variable
, and want to generate samples from therandom variable
orpdf
, first, you need CDF!- a) When CDF of your data distribution is invertible:
inverse CDF(unif(0,1))
is your sample!
- b) When CDF of your data distribution is not invertible: playing with "intervals"..
- a) When CDF of your data distribution is invertible:
-
It also applies for a higher dimension.
-
a) Draw samples using Uniform distribution on a "set A":
It's a discrete
!- The "set A" is very complicated. But we know a certain rectangle boundary that houses our "set A"
- First, draw a bunch of samples from a larger set of a
certain known rectangle boundary
, then
-
b) Draw samples using Non-Uniform distribution on a "pdf A":
It's a continuous
!- The "pdf A" is very complicated. If we don't know "pdf A", we can first think of the unnormalized verson(proportional to pdf A) like "pdf A_tilda". And we choose additionally the proposal distribution "q" for actual sampling(gives a boundary of uniform sampling). "pdf A_tilda" will work as a constraint to reject/accept the samples.
-
-
Sampling can help a Neural Network model?
When x ~ unknown P(x)
, we hypothetically sample from a proposal distribution q(x)
(MonteCarlo) then approximate stationary distributionE[P(x)]
(MarkovChain). In other words, instead of trying to deal with intractable computations involving the posterior, we can get samples from some easy distribution then feed these samples to our known joint functiong(x)
. Next, we use the proportion of the joint function outputs (g(x1)
/g(x0)
as Metropolis-Hastings ratio: α), or the random walk's transition probability distribution, we decide the final samples for the nasty P(x)
.
Metropolis > Metropolis Hasting > Gibbs
When we have multiple parameters P(θ,ϕ|y) ∝ g(θ,ϕ)
, first we plug in the value of arbitrary ϕ, pretending we know the value of θ by substituting in its current value or current iteration from the Markov chain. Once we've drawn for both θ and ϕ, that completes one iteration and we begin the next iteration by drawing a new θ. This idea of one at a time updates is used in what we call Gibbs sampling.
-
- Check_1. directionality of edges ?
- Check_2. independencies ?
- Check_3. presence and location of hidden variables ?
- Check_4. structure overfitting ?
- How to penalize the structure complexity?....the more complex it is, the higher training-likelihood it will have
-
- Check_1. Estimated model parameter overfitting ?
- it can overfit to random noise in training data
- Use regularization....
- Check_1. Estimated model parameter overfitting ?
- we can exploit correlations between several predicted variables.
- In the decision analysis framework, traditional point estimation approaches have often produced faulty risk assessment results as it ignores the associated uncertainty. One can see that the key lies in the manifestation of uncertainty range (confidence interval) and this requires us to obtain a full parameter distribution. Bayesian method allows us to estimate a full parameter distribution, using the joint distribution of the assumed model parameters(prior) and the given observations(likelihood).
- Incorporate prior knowledge into our model..i.e...Bayesian method derives an insight based on the synthesis of the assumed knowledge in the parameters(prior) and the new information from the observed data(likelihood)...combining both observations and experts' knowledge on the most likely parameter values
- we can perform multiple tasks, using a single model. This can be a efficient framework of knowledge discovery
MLE
tries to estimate the best parameter while optimizing the likelihood of the data, given the parameters.
T.B.D
MLE
is flawed...
Limitations of MLE
- [Case]: "Toss a coin and H came 7 times out of 10 tosses. Then P(H)=0.7" which is the
θ
.- New context:
- A and B fight a duel. A won 7 out of 10 matches. Then P(A)=0.7 which is the
θ
.
- A and B fight a duel. A won 7 out of 10 matches. Then P(A)=0.7 which is the
- Larger samples:
- Toss a coin and H came 70,000 times out of 100,000 tosses. Then P(H)=0.7 which is the
θ
.
- Toss a coin and H came 70,000 times out of 100,000 tosses. Then P(H)=0.7 which is the
- New context:
- MLE has absolutely no ability to distinguish between these three scenarios above.
Bayesian Estimation in PGM offers some better properties.
-
First,
θ
becomes a random variable. In Bayesian framework, anything about we are uncertain can be viewed as a random variable over which we have a distribution that is updated over time as data is obtained. This is the heart of the Bayesian formalism.- In MLE,
θ
is fixed (or given), and all tosses(data points) are independent. - In BE,
θ
is unknown, so we cannot assume all tosses are independent...they are marginally dependent because of their unknown mommy (without being givenθ
) - This gives us a joint probablistic model (of all data points, and
θ
..evenly).
- In MLE,
Bayesian Estimation in Bayes-Net
In Bayes-Net, if the model has independent priors, then its posteriors are also independent.
-
Case_01: Answering specific queries by taking data and learning the most significant dependencies...but we don't know the structure.
-
Case_02: Discovering the interrelationships between the variables so that we understand the our whole network better...but we don't know the structure.
-
Solution: Scroing based Learning (Optimization?)
- We can define a scoring function that evaluates how well the network structure matches your data.
-
- Likelihood Score
-
- BIC, Asymptotic Consistency
-
- Bayesian Score
Input:
- your data
- scoring criteria selection
- prior information
- a set of possible structures
Output:
- a structure that maximizes the score
What's the occasion?
- hidden variables? Unobservable?
- simply missing values?
The presence of missing values or latent variable raises multiple challenges.
- (1) Unknown mechanism of missingness
- (2) Complexity of likelihood & Multiple Global Optima
Having a latent variable sometimes gives an advantage.
- Latent variable often gives rise to the sparser and therefore easier-to-learn models.
- Latent variable often offers an interesting characterization of structure in the data...such as clusters..
Having missing values gives a pain in the ass. H,T,H,?,T,?,?,?,T,H......how to deal with "?" If you don't know why these data are missing, you have no idea how to proceed...
-
In the case of missing at random (MAR) we can ignore the
missing data mechanism
and focus only on the likelihood that we get to observe.
-
Let's say if our
hidden variable
has two possible outcomes, then ourlikelihood function
can have two global maxima. With many hidden variables, there can be an exponential number of global maxima. This can occur with missing data (not only hidden variables) as well.
In the context of missing data, the likelihood function looks like a nasty, complicated multi-modal function, which is impossible to optimize in closed form. There is two general classes of strategies.
-
The first is to use a generic optimization method, such as
gradient ascent
. We might start out with a particular point in the parameter space, and we compute the gradient at that point, and we go in the direction of steepest ascent, which might take us to the next point, at which point we compute the gradient, and we continue until we reach a local optimum of the function that we're trying to optimize.- For improving the rate of convergence of these algorithms, some methods such as line search or conjugate gradient methods basically are going to get us faster to whatever local optimum we're going to end up at..
- Need to run inference over each data instance at every iteration..so very costly...
-
The second strategy is
EM-Algorithm
and that is specifically geared for likelihood function optimization. So it's a special purpose algorithm. The intuition is that if we had the full set of parameters, then computing the probability of the missing data would also be easy in the sense that it can be done using straight probabilistic inference. We have two sets of of things that we need to infer or estimate. One is the parameters, and the other is the missing data. We start out with one set of parameters, use that to estimate the values of the missing data and then use that completion to re-estimate the parameters and so on and so forth until we reach convergence. -
The convergence of the likelihood function is not the same as convergence of the parameters and so we have to decide which one we actually care about.
-
EM achieves only a local optimum of the log likelihood function.
Local optima are unavoidable, and increase with the amount of missing data.....
They decrease with the total amount of data, but not if we hidden variables, at which point you're stuck with them no matter how much data you have. -
Perhaps the most common use of the EM algorithm is to the problem of Learning with Latent Variables. These are situations where there is a certain set of variables that we just never get to observe, but they turn out to be important for capturing some important latent structure in the distribution of the data instances that we have. Perhaps the simplest example is in context of clustering where we have some kind of, latent class variable and some set of observed features. And we're trying to, basically segment the instances that we have into categories where, we assume that, for each category there is a similar distribution over the observed features. The features become independent to each other, once we're given the class. Latent variables satisfy MAR, so can use EM.