Lecture 1: Probability and Statistics Part 1

Jason Chen YCHEN148@e.ntu.edu.sg

Copyright Notice

This document is a compilation of original notes by Jason Chen (YCHEN148@e.ntu.edu.sg). Unauthorized reproduction, downloading, or use for commercial purposes is strictly prohibited and may result in legal action. If you have any questions or find any content-related errors, please contact the author at the provided email address. The author will promptly address any necessary corrections.

I. Probability Measure


1. Some Definition

Sample space: Ω is the set of all possible outcomes in a random phenomenon.

Example 1.1: Throw a coin 3 times, Ω = {hhh, hht, hth, thh, htt, tht, tth, ttt}

Example 1.2: Number of jobs in a print queue, Ω = {0, 1, 2, ...}

Example 1.3: Length of time between successive earthquakes, Ω = {t | t ≥ 0}

Question: What are the differences between Ω’s in these examples?

Answer:

The key differences are that Example 1.1 deals with a finite discrete sample space, Example 1.2 with an infinite discrete sample space, and Example 1.3 with a continuous infinite sample space.


2. Event & Measure

Event: A particular subset of Ω.

Example 1.4: Let A be the event that the total number of heads equals 2 when tossing a coin 3 times. A = {HHT,HTH,THH}

Example 1.5: Let A be the event that fewer than 5 jobs are in the print queue. A = {0,1,2,3,4}

Probability measure: A function P from subsets of Ω to the real numbers that satisfies the following axioms:

  1. P(Ω) = 1.

  2. If A ⊆ Ω, then P(A) ≥ 0.

  3. If A₁ and A₂ are disjoint, then P(A₁ ∪ A₂) = P(A₁) + P(A₂).

    More generally, if A₁, A₂, ... are mutually disjoint, then P(∪ᵢ Aᵢ) = Σ P(Aᵢ).

Example 1.6: Suppose the coin is fair. For every outcome ω ∈ Ω, P(ω) = 1/8.

Ω = {hhh,hht,hth,thh,htt,tht,tth,ttt}


3. Law of Total Probability

Let A and B1,B2,,Bn be events where Bi are disjoint, i=1nBi=Ω, and P(Bi)>0 for all i. Then

P(BjA)=P(ABj)P(Bj)i=1nP(ABi)P(Bi)

The Law of Total Probability is an important method for calculating the probability of an event by decomposing event A into several mutually exclusive events B, and then summing up the probabilities of A occurring given each B occurs, weighted by the probabilities of the B​ themselves.


4. Conditional Probability

Definition

If A and B are two events, the conditional probability of A given B, denoted P(AB), is defined as: P(AB)=P(AB)P(B)

provided that P(B)>0.

The formula can be interpreted as "the probability of A happening, given that B has already happened."

Key Properties


5. Bayes' Rule

Let A and B1, B2, ..., Bn be events where Bi are disjoint, i=1nBi=Ω and P(Bi)>0 for all i. Then P(BjA)=P(ABj)P(Bj)i=1nP(ABi)P(Bi)

Bayes' Rule is a method to calculate the reverse probability, allowing us to update our estimation of the probability of an initial event given the occurrence of another event. This is crucial for revising probability estimates based on new evidence.

Example 1.8: Monty Hall problem

Three prisoners A, B, and C are on death row. The governor decides to pardon one of them and chooses the prisoner to pardon at random. He informs the warden of this choice but requests that the name be kept secret for a few days.

A tries to get the warden to tell him who had been pardoned. The warden refuses. A then asks which of B or C will be executed. The warden thinks for a while and tells A that B is to be executed. A then thinks that his chance of being pardoned has risen to 1/2. Is he right?

Step 1: Initial Probabilities

P(A)=P(B)=P(C)=13

Step 2: Conditional Probability After Warden’s Information

If A is actually released, there is a 1/2 chance that the warden will say B will be released; if B is actually released, there is a 0% chance that the warden will say B will be released; if C is actually released, there is a 100% chance that the warden will say B will be released (you can neither admit the truth about C nor tell A the truth)

Step 3: Apply Bayes' Theorem

Using Bayes' Theorem, we can calculate the probability that A is pardoned given that the warden says B will be executed:

P(AWarden says B will be executed)=P(Warden says B will be executedA is pardoned)P(A)P(Warden says B will be executed)

P(Warden says B will be executed)=(1213)+(013)+(113)=16+26=36=12

Thus: P(AWarden says B will be executed)=121312=1621=13


6. Independence

Two events A and B are independent if P(AB)=P(A)P(B)

A group of events A, A, ..., A are said to be mutually independent if for any subgroup A, ..., A P(=A)=P(A)

Independence is a crucial concept in probability theory, implying that the occurrence of one event does not affect the likelihood of the other.


II. Random Variable and Distribution


1. Random Variable

A random variable is a function from the sample space Ω to the real numbers.

Example 1.9:

Importance of Random Variables

Statisticians need random variables because they provide a way to quantify and analyze outcomes of random phenomena. By mapping outcomes to the real line, we can use mathematical tools and techniques to study their distributions, moments, and other properties.


2. Distribution

A random variable has a sample space on the real line. This brings special ways to describe its probability measure.

2.1 Discrete vs. Continuous Random Variables

2.2 Functions for Describing Distributions

2.3 Summary of Notations and Functions

 DiscreteContinuous
Univariate r.v.Probability Mass Function (pmf)Probability Density Function (pdf)
single random variableCumulative Distribution Function (cdf)Cumulative Distribution Function (cdf)
 Moment Generating Function (mgf)Moment Generating Function (mgf)
Multivariate r.v.Joint Probability Mass Function (joint pmf)Joint Probability Density Function (joint pdf)
multiple random variablesJoint Cumulative Distribution Function (joint cdf)Joint Cumulative Distribution Function (joint cdf)
 Joint Moment Generating Function (joint mgf)Joint Moment Generating Function (joint mgf)

3. Distribution Functions

3.1 Cumulative Distribution Function (CDF)

A function F is the cumulative distribution function (CDF) of a random variable X if F(x)=P(Xx),xR.

3.2 Probability Mass Function (PMF)

A function p(x) is a probability mass function (PMF) if and only if p(x)0 for all x and <x<p(x)=1.

For a discrete random variable X with PMF p(x), P(X=x)=p(x) and P(XA)=xAp(x).

3.3 Probability Density Function (PDF)

A function f(x) is a probability density function (PDF) if and only if f(x)0 for all x and f(x)dx=1; For a continuous random variable X with PDF f(x), P(XA)=Af(x)dx.

3.4 Relationships Between PDF and CDF

3.5 Interpretation of f(x)

The PDF f(x) does not give the probability that X takes the value x (which is zero for continuous random variables). Instead, it gives the density of the probability at x. To find the probability that X falls within an interval [a,b], you integrate the PDF over that interval: P(aXb)=abf(x)dx.

3.6 Properties of CDF

If F(x) is a CDF of some random variable X, then the following properties hold:

  1. 0F(x)1.

  2. F(x) is non-decreasing.

  3. For any xR, F(x) is continuous from the right, i.e., limtxF(t)=F(x).

  4. limxF(x)=1 and limxF(x)=0.

  5. P(X>x)=1F(x) and P(a<X<b)=F(b)F(a).

  6. For any xR, F(x) has a left limit.

  7. There are at most countably many discontinuity points of F(x).

Conversely, if a function F(x) satisfies properties 2, 3, and 4, then F(x) is a CDF.


4. Joint Distribution

4.1 Why We Need Joint Distribution for Multivariate Random Variables

Joint distributions are necessary to describe the behavior of multiple random variables simultaneously. They provide a comprehensive view of how the random variables interact with each other.

Example 1.10 (continued from Example 1.9)

Given: Ω={hhh,hht,hth,thh,htt,tht,tth,ttt}

We can describe the joint distribution of X1 and X2 by looking at the sample space and counting occurrences.

4.2 Joint CDF and Marginal CDF

Joint CDF

The joint cumulative distribution function (CDF) of X1,,Xn is: F(x1,,xn)=P(X1x1,,Xnxn)

Marginal CDF

The marginal CDF of X1 is: FX1(x1)=P(X1x1)=limx2,,xnF(x1,x2,,xn)

4.3 Joint and Marginal Distribution Functions

Discrete:

Continuous:

Interpreting Joint Distributions

4.4 Joint PMF and its Relationship with Joint CDF

Joint PMF Definition

For discrete random variables X and Y, the joint probability mass function (joint PMF) pXY(x,y) is defined as: pXY(x,y)=P(X=x,Y=y).

Joint CDF Definition

The joint cumulative distribution function (joint CDF) FXY(x,y) is defined as: FXY(x,y)=P(Xx,Yy).

Relationship between Joint PMF and Joint CDF

The joint CDF can be expressed in terms of the joint PMF as follows: FXY(x,y)=uxvypXY(u,v).

Deriving Joint CDF from Joint PMF

Given the joint PMF pXY(x,y):

  1. Definition of Joint CDF: The joint CDF FXY(x,y) is the probability that X is less than or equal to x and Y is less than or equal to y: FXY(x,y)=P(Xx,Yy).

  2. Expressing in terms of PMF: This can be expressed as the sum of joint PMFs for all ux and vy: FXY(x,y)=uxvypXY(u,v).

Example

Let's assume X and Y are discrete random variables with the following joint PMF:

XY012
00.10.20.1
10.10.10.2
20.10.050.05

Calculating FXY(1,1): FXY(1,1)=u1v1pXY(u,v) =pXY(0,0)+pXY(0,1)+pXY(1,0)+pXY(1,1) =0.1+0.2+0.1+0.1=0.5

Calculating FXY(2,2): FXY(2,2)=u2v2pXY(u,v) =pXY(0,0)+pXY(0,1)+pXY(0,2)+pXY(1,0)+pXY(1,1)+pXY(1,2)+pXY(2,0)+pXY(2,1)+pXY(2,2) =0.1+0.2+0.1+0.1+0.1+0.2+0.1+0.05+0.05=1.0

The joint PMF pXY(x,y) can be found by taking the differences of the joint CDF: pXY(x,y)=P(X=x,Y=y)=FXY(x,y)FXY(x1,y)FXY(x,y1)+FXY(x1,y1).

4.5 Joint PDF and its Relationship with Joint CDF (Continuous Case)

Joint PDF Definition

For continuous random variables X and Y, the joint probability density function (joint PDF) fXY(x,y) is defined such that: P(aXb,cYd)=abcdfXY(x,y)dydx.

Joint CDF Definition

The joint cumulative distribution function (joint CDF) FXY(x,y) is defined as: FXY(x,y)=P(Xx,Yy).

Relationship between Joint PDF and Joint CDF

The joint CDF can be expressed in terms of the joint PDF as follows: FXY(x,y)=xyfXY(u,v)dvdu.

Detailed Derivation

Given the joint PDF fXY(x,y):

  1. Definition of Joint CDF: The joint CDF FXY(x,y) is the probability that X is less than or equal to x and Y is less than or equal to y: FXY(x,y)=P(Xx,Yy).

  2. Expressing in terms of PDF: This can be expressed as the double integral of the joint PDF over the region (,x]×(,y]: FXY(x,y)=xyfXY(u,v)dvdu.

Example

Let's assume X and Y are continuous random variables with the following joint PDF: fXY(x,y)={2exyx0,y00otherwise

Calculating FXY(x,y) for x0 and y0: FXY(x,y)=xy2euvdvdu=0x0y2euvdvdu =0x[2euv]0ydu=0x(2euy+2eu0)du =0x(2euy+2eu)du =[2yeuy]0x+[21eu]0x =(2yexy+2yey)+(2ex+2) =(2yey2yexy)+(22ex)

Deriving Joint PDF from Joint CDF

Given the joint CDF FXY(x,y):

Joint PDF: The joint PDF fXY(x,y) can be found by taking the partial derivatives of the joint CDF: fXY(x,y)=2FXY(x,y)xy.

Example

Assume we have the joint CDF values for certain points as follows:

FXY(x,y)={0x<0 or y<01exey+exyx0,y0

To find fXY(x,y): fXY(x,y)=2xy(1exey+exy) =y(x(1exey+exy)) =y(ex+exy)

=exy+exy =exy

 


5. Marginal Distribution

Marginal distributions are derived from the joint distribution by summing (discrete case) or integrating (continuous case) over the other variables.

5.1 Why We Need Joint Distributions

Joint distributions are essential for understanding the relationships between multiple random variables. They allow us to analyze how variables co-vary and the dependencies between them.

5.2 When We Know the Joint CDF, Can We Obtain Every Marginal CDF?

Yes, if we know the joint CDF, we can derive every marginal CDF by taking the appropriate limits.

5.3 Is the Reverse Statement True?

No, knowing the marginal CDFs does not provide enough information to determine the joint CDF. The joint CDF contains information about the dependencies and interactions between the variables that marginal CDFs alone do not capture.

Proof for the Statement: "When We Know the Joint CDF, We Can Derive Every Marginal CDF by Taking the Appropriate Limits"

For Discrete Random Variables:

Given the joint CDF FX1,X2,,Xn(x1,x2,,xn): FX1,X2,,Xn(x1,x2,,xn)=P(X1x1,X2x2,,Xnxn)The marginal CDF of X1 is: FX1(x1)=P(X1x1); This can be obtained by summing over all possible values of X2,X3,,Xn: FX1(x1)=x2=x3=xn=FX1,X2,,Xn(x1,x2,,xn)

For Continuous Random Variables:

Given the joint CDF FX1,X2,,Xn(x1,x2,,xn): FX1,X2,,Xn(x1,x2,,xn)=P(X1x1,X2x2,,Xnxn) The marginal CDF of X1 is: FX1(x1)=P(X1x1); This can be obtained by integrating over all possible values of X2,X3,,Xn: FX1(x1)=FX1,X2,,Xn(x1,x2,,xn)dx2dx3dxn

Proof for the Statement: "Knowing the Marginal CDFs Does Not Provide Enough Information to Determine the Joint CDF"

To prove this, let's consider two random variables X and Y.

Example 1: Independent Variables

Suppose X and Y are independent random variables with known marginal CDFs FX(x) and FY(y). The joint CDF of X and Y is given by: FX,Y(x,y)=P(Xx,Yy)=FX(x)FY(y)

Example 2: Dependent Variables

Now consider the case where X and Y are perfectly correlated, such that Y=X. In this case, the marginal CDFs remain the same FX(x) and FY(y), but the joint CDF is different: FX,Y(x,y)=P(Xx,Yy)=P(Xx,Xy)=P(Xmin(x,y))=FX(min(x,y))

This demonstrates that knowing the marginal CDFs FX(x) and FY(y) alone is not sufficient to determine the joint CDF FX,Y(x,y) because the joint CDF also captures the dependency structure between the variables.


6. Independent Random Variables

6.1 Definition

Random variables X1,,Xn are said to be independent if their joint CDF factors into the product of their marginal CDFs: F(x1,,xn)=FX1(x1)FXn(xn) for all x1,,xn.

Discrete Case

For discrete random variables: F(x1,,xn)=FX1(x1)FXn(xn)p(x1,,xn)=pX1(x1)pXn(xn) where p(x1,,xn) is the joint PMF.

Continuous Case

For continuous random variables: F(x1,,xn)=FX1(x1)FXn(xn)f(x1,,xn)=fX1(x1)fXn(xn) where f(x1,,xn) is the joint PDF.

6.2 Property 1: Independence of X and Y

Two random variables X and Y are independent if: P(XA,YB)=P(XA)P(YB) i.e., the events {XA} and {YB} are independent.

6.3 Property 2: Function of Independent Variables

If X and Y are independent, then Z=g(X) and W=h(Y) are also independent.

Proof: Given two independent random variables X and Y, we need to show that for any measurable functions g and h, the random variables g(X) and h(Y) are also independent. Specifically, we need to prove that: P(g(X)z,h(Y)w)=P(g(X)z)P(h(Y)w).

1. Independence Definition for X and Y

By definition, X and Y are independent if for all measurable sets A and B: P(XA,YB)=P(XA)P(YB).

2. Consider the Events for Functions of X and Y

Define the events:

Here, g1((,z]) is the preimage of the interval (,z] under the function g, and similarly for h.

3. Relate the Events to g(X) and h(Y)

The events can be written as:

4. Apply the Definition of Independence

Using the independence of X and Y, we know: P(Xg1((,z]),Yh1((,w]))=P(Xg1((,z]))P(Yh1((,w])).

5. Translate Back to the Functions g(X) and h(Y)

This equation translates back to the original functions as: P(g(X)z,h(Y)w)=P(g(X)z)P(h(Y)w).


7. Conditional Distribution

7.1 Discrete Random Variables

For discrete random variables X and Y with joint PMF pXY(x,y), the conditional PMF of Y given X is: pY|X(y|x)=P(Y=yX=x)=P(X=x,Y=y)P(X=x)=pXY(x,y)pX(x)

If pX(x)>0. The probability is defined to be zero if pX(x)=0.

7.2 Continuous Random Variables

For continuous random variables X and Y with joint PDF fXY(x,y), the conditional PDF of Y given X is: fY|X(y|x)=fXY(x,y)fX(x) If 0<fX(x)<, and zero otherwise.

7.3 Remarks

  1. For each fixed x:

    pY|X(y|x) is a PMF for y | fY|X(y|x) is a PDF for y.

  2. Multiplication Law: pXY(x,y)=pY|X(y|x)pX(x) | fXY(x,y)=fY|X(y|x)fX(x)

  3. Law of Total Probability: pY(y)=xpY|X(y|x)pX(x) | fY(y)=fY|X(y|x)fX(x)dx

  4. Independence:

    X and Y are independent if and only if: pY|X(y|x)=pY(y) | fY|X(y|x)=fY(y)


8. Functions of Random Variables

Question: For given random variables X1,,Xn, how to derive the distributions of their transformations?

8.1 Method 1: Method of Events

Let X=(X1,,Xn) be random variables and Y=g(X). Then the distribution of Y is determined as follows: for any event B defined by Y, P(YB)=P(XA), where A=g1(B) is the preimage of B under the function g.

Example 1.11 (Univariate Discrete Random Variable)

Let X be a discrete random variable taking values xi, i=1,2,, and Y=g(X). Then Y is also a discrete random variable taking values yj, j=1,2,. To determine the PMF of Y, by taking B={yj}, we have: A={xi:g(xi)=yj} and hence, pY(yj)=P({yj})=P(A)=xiApX(xi).

Example 1.12 (Sum of Two Discrete Random Variables)

Let X and Y be random variables with joint PMF p(x,y). Find the distribution of Z=X+Y.

To find the distribution of Z, we use the convolution of the PMFs of X and Y: pZ(z)=P(Z=z)=P(X+Y=z)=xP(X=x,Y=zx)=xpXY(x,zx).

For the exercise Z=YX, the distribution can be found similarly: pZ(z)=P(Z=z)=P(YX=z)=xP(X=x,Y=x+z)=xpXY(x,x+z).

8.2 Method 2: Method of CDF

This is a special case of Method 1. Let Y be a function of random variables X1,,Xn.

  1. Find the region Yy in the (x1,,xn) space.

  2. Find FY(y)=P(Yy) by summing the joint PMF or integrating the joint PDF of X1,,Xn over the region Yy

  3. For the continuous case, find the PDF of Y by differentiating FY(y): fY(y)=ddyFY(y)

Example for Continuous Case

Let X and Y be continuous random variables with joint PDF fXY(x,y). To find the PDF of Z=X+Y:

  1. Define the region Zz: FZ(z)=P(X+Yz)=zxfXY(x,y)dydx.

  2. Differentiate to find the PDF: fZ(z)=ddzFZ(z)=ddz(zxfXY(x,y)dydx).

Using Leibniz's rule for differentiation under the integral sign, we get: fZ(z)=fXY(x,zx)dx.

8.3 Leibniz's Rule

Example for Continuous Case: Finding the PDF of Z=X+Y

Given two continuous random variables X and Y with a joint PDF fXY(x,y), we aim to find the PDF of their sum Z=X+Y

Step-by-Step Derivation

  1. Differentiate to Find the PDF: The PDF fZ(z) is the derivative of the CDF FZ(z): fZ(z)=ddzFZ(z)=ddz(zxfXY(x,y)dydx). Here, we need to differentiate an integral with respect to z. For this purpose, we apply Leibniz's rule for differentiation under the integral sign.

Applying Leibniz's Rule

Leibniz's rule for differentiation under the integral sign states: ddz(a(z)b(z)f(x,z)dx)=a(z)b(z)zf(x,z)dx+f(b(z),z)ddzb(z)f(a(z),z)ddza(z).

In our case: FZ(z)=zxfXY(x,y)dydx.

By applying Leibniz's rule, we get: fZ(z)=ddz(zxfXY(x,y)dydx).

Since the lower limit of the inner integral is (which is constant and its derivative with respect to z is zero) and the upper limit is zx, we apply the rule: fZ(z)=fXY(x,zx)ddz(zx)dx.

Given that ddz(zx)=1: fZ(z)=fXY(x,zx)dx.

Leibniz's Rule for Differentiation Under the Integral Sign

Leibniz's rule for differentiation under the integral sign provides a way to differentiate an integral with variable limits and an integrand that depends on the variable of differentiation.

Leibniz's Rule Statement

The rule states: ddz(a(z)b(z)f(x,z)dx)=a(z)b(z)zf(x,z)dx+f(b(z),z)ddzb(z)f(a(z),z)ddza(z).

Explanation and Proof

  1. Differentiate I(z) with respect to z: dIdz=ddz(a(z)b(z)f(x,z)dx).

  2. Differentiate inside the integral (where possible): To handle the differentiation, we need to consider three parts:

    • The variation of the integrand f(x,z) with respect to z.

    • The variation of the upper limit b(z) with respect to z.

    • The variation of the lower limit a(z) with respect to z.

  3. Apply the Fundamental Theorem of Calculus and the Chain Rule:

    By the Fundamental Theorem of Calculus and the Chain Rule, we get: ddz(a(z)b(z)f(x,z)dx)=a(z)b(z)f(x,z)zdx+f(b(z),z)db(z)dzf(a(z),z)da(z)dz.

    • Integral with respect to z: a(z)b(z)f(x,z)zdx considers how the integrand changes with z while keeping the limits constant.

    • Upper limit b(z): f(b(z),z)db(z)dz considers the contribution from the changing upper limit.

    • Lower limit a(z): f(a(z),z)da(z)dz considers the contribution from the changing lower limit.

  4. Combine the three components: ddz(a(z)b(z)f(x,z)dx)=a(z)b(z)f(x,z)zdx+f(b(z),z)db(z)dzf(a(z),z)da(z)dz.

Application Example

Consider: FZ(z)=zxfXY(x,y)dydx.

Using Leibniz's rule: fZ(z)=ddzFZ(z)=ddz(zxfXY(x,y)dydx).

  1. Inner integral: zxfXY(x,y)dy. The variable limits here are to zx.

  2. Differentiate with respect to z: ddz(zxfXY(x,y)dydx).

Applying Leibniz's rule: fXY(x,zx)(zx)zdx+(terms for other limits).

Since: (zx)z=1, We get: fZ(z)=fXY(x,zx)dx.

8.4 Method of PDF for Continuous Random Variables with Differentiable, 1-to-1 Transformations

Statement

For a continuous random variable X with PDF fX(x), let Y=g(X), where g is a differentiable and strictly monotone function. The PDF of Y is given by: fY(y)=fX(g1(y))|ddyg1(y)| for y such that y=g(x) for some x, and fY(y)=0 otherwise.

Proof

  1. Start with the CDF Method: The CDF of Y, FY(y), is: FY(y)=P(Yy)=P(g(X)y).

  2. Transform to the Probability Statement in Terms of X: Since g is strictly monotone, it is either increasing or decreasing. Assume g is increasing (similar steps apply if g is decreasing). For g increasing: P(g(X)y)=P(Xg1(y)). Thus: FY(y)=FX(g1(y)).

  3. Differentiate the CDF of Y to Find its PDF: Differentiate both sides with respect to y: fY(y)=ddyFY(y)=ddyFX(g1(y)). By the chain rule: fY(y)=fX(g1(y))ddyg1(y).

  4. Absolute Value for Monotonic Transformations: Since g is strictly monotone, the derivative ddyg1(y) is always positive or always negative. To ensure the PDF is non-negative: fY(y)=fX(g1(y))|ddyg1(y)|.

Interpretation of |ddyg1(y)|

The term |ddyg1(y)| acts as a scaling factor that adjusts the density function based on the rate of change of the transformation g1(y).

  1. Rate of Change:

    • If g is increasing, ddyg1(y) represents how quickly x changes with respect to y.

    • The absolute value ensures the PDF remains non-negative.

  2. Density Adjustment:

    • When g1(y) changes slowly with respect to y, the density is stretched, resulting in a lower PDF value.

    • When g1(y) changes rapidly, the density is compressed, resulting in a higher PDF value.

Example Calculation

Consider a transformation Y=g(X)=2X+3:

  1. Find the Inverse Transformation: g1(y)=y32.

  2. Calculate the Derivative: ddyg1(y)=ddy(y32)=12.

  3. Apply the Transformation to the PDF: fY(y)=fX(y32)|12|.

If X is uniformly distributed over [0,1], i.e., fX(x)=1 for 0x1:

8.5 Multivariate Case: Transformation of Continuous Random Variables

General Transformation

Given X=(X1,,Xn) are multivariate continuous random variables and Y=g(X) is a one-to-one transformation such that its inverse exists and is denoted by x=g1(y)=w(y)=(w1(y),,wn(y)).

Assume w have continuous partial derivatives, and let J be the Jacobian matrix of the transformation, given by: J=|w1(y)y1w1(y)ynwn(y)y1wn(y)yn|.

The PDF of Y is given by: fY(y)=fX(g1(y))|J|, for y such that y=g(x) for some x, and fY(y)=0 otherwise.

Example 1.14: Find the Distribution of Y1=X2X1

Given X1 and X2 are random variables with joint PDF fX1,X2(x1,x2).

Solution

Let Y1=X2X1 and Y2=X1.

  1. Find the Inverse Transformation: X1=Y2, X2=Y1Y2.

  2. Jacobian Determinant: J=|x1y1x1y2x2y1x2y2|=|01y2y1|=y2.

  3. Joint PDF of Y1 and Y2: fY1,Y2(y1,y2)=fX1,X2(y2,y1y2)|y2|.

  4. Marginal PDF of Y1: Integrate out Y2: fY1(y1)=fY1,Y2(y1,y2)dy2.

Interpretation of the Jacobian Determinant |J|

The Jacobian determinant |J| accounts for the change in volume elements when transforming from the original variables X to the new variables Y. It ensures that the probability density functions remain properly scaled during the transformation process.

Question: The Role of |dg1(y)dy|

The term |dg1(y)dy| serves as a scaling factor that adjusts the density function based on how the transformation stretches or compresses the space.

  1. Scaling Factor:

    • If g is an increasing function, the derivative dg1(y)dy indicates the rate of change of the original variable x with respect to the transformed variable y.

    • The absolute value ensures that the PDF is non-negative.

  2. Adjustment of Density:

    • When dg1(y)dy is large, it indicates that the transformation compresses the space, leading to a higher density in the transformed space.

    • Conversely, when dg1(y)dy is small, it indicates that the transformation stretches the space, leading to a lower density in the transformed space.

By considering these factors, we ensure that the probability density functions are correctly adjusted for the transformations applied.

Application in Probability

When transforming random variables, the joint PDF of the new variables Y is given by:

fY(y)=fX(g1(y))|det(J)|

This formula ensures that the probability density is correctly adjusted for the transformation.

HERE fX(g1(y)) : Using y1, y2 to represent x, like : y1=x2x1, y2=x1

Problem 1: Find the distribution of Y1=X2X1

Step 1: Define the Transformation

We start with two random variables X1 and X2 and define the new variable: Y1=X2X1

We also define a secondary variable Y2=X1. Thus, we have the transformations: y1=x2x1, y2=x1

Step 2: Express the Inverse Transformation

From the definitions of y1 and y2, we can express the original variables X1 and X2 in terms of Y1 and Y2:

x1=y2, x2=y1×y2

Step 3: Compute the Jacobian Determinant

The Jacobian matrix J for the transformation is given by: J=|x1y1x1y2x2y1x2y2|=|01y2y1|

The determinant of this matrix is: |J|=|01y2y1|=y2

Step 4: Use the Transformation Formula

The joint probability density function (PDF) of Y1 and Y2 is given by: fY1,Y2(y1,y2)=fX1,X2(x1,x2)|J|

Substituting x1=y2 and x2=y1×y2: fY1,Y2(y1,y2)=fX1,X2(y2,y1×y2)|y2|

Step 5: Find the Marginal PDF of Y1

To find the marginal PDF of Y1, integrate out y2: fY1(y1)=fY1,Y2(y1,y2)dy2=fX1,X2(y2,y1×y2)|y2|dy2

This integral gives the distribution of Y1​.

 

Problem 2: Find the distribution of Y2=X1×X2

Step 1: Define the Transformation

For the second problem, define: Y2=X1×X2. We also retain the definition of Y1=X1.

Step 2: Express the Inverse Transformation

The original variables X1 and X2 can be written as: x1=y1, x2=y2y1

Step 3: Compute the Jacobian Determinant

The Jacobian matrix J for this transformation is given by: J=|x1y1x1y2x2y1x2y2|=|10y2y121y1|

The determinant of this matrix is: |J|=1y1

Step 4: Use the Transformation Formula

The joint PDF of Y2 and Y1 is: fY2,Y1(y2,y1)=fX1,X2(x1,x2)|J|

Substituting x1=y1 and x2=y2y1: fY2,Y1(y2,y1)=fX1,X2(y1,y2y1)1|y1|

Step 5: Find the Marginal PDF of Y2

To find the marginal PDF of Y2, integrate out y1: fY2(y2)=fY2,Y1(y2,y1)dy1=fX1,X2(y1,y2y1)1|y1|dy1

This integral gives the distribution of Y2.


III. Expectation


1. Expectation of a Random Variable

The expectation (or expected value) of a random variable Y=g(X1,X2,,Xn) can be thought of as the "average" or "mean" value that the random variable takes. It provides a single value that summarizes the entire distribution.

1.1 For Discrete Random Variables:

For discrete random variables X1,X2,,Xn, the expectation of Y is given by:

E(Y)=E[g(X1,,Xn)]=<x1,,xn<g(x1,,xn)p(x1,,xn) where p(x1,,xn) is the joint probability mass function (pmf).

1.2 For Continuous Random Variables:

For continuous random variables, the expectation is given by:

E(Y)=E[g(X1,,Xn)]=g(x1,,xn)f(x1,,xn)dx1dxn where f(x1,,xn) is the joint probability density function (pdf).

1.3 Decision Making Under Uncertainty Using Expectation

Expectation helps in making decisions under uncertainty by providing a criterion to evaluate the outcomes.

Example 1.15:

Solution: E=(136×100)+(3536×0)=100362.78 Since the expected value (2.78) is less than the cost of the ticket (3), you should not play the game.

Question:


2. Various Characteristic Measures

Mean: E[g(X)]=E(X) is called the mean of X, denoted by μX.

Variance: E[g(X)]=E((XμX)2)=E(X2)μX2 is called the variance of X, denoted by Var(X) or σX2.

Covariance: E[g(X,Y)]=E((XμX)(YμY))=E(XY)μXμY is called the covariance of X and Y, denoted by Cov(X,Y) or σXY.

Correlation Coefficient: ρXY=σXYσXσY​.


3. Properties


4. Common Mistakes in Using Expectation

  1. Independence and Expectation: E[g(X)h(Y)]=E[g(X)]E[h(Y)] (if X and Y are independent).

  2. Linearity of Expectation: E[g(X)]g(E(X)).

  3. Independence and Uncorrelation: If X and Y are independent, then Cov(X,Y)=0​​. The converse statement is not always true; zero covariance does not imply independence.


5. Moment Generating Function (MGF)

5.1 Definition

The moment generating function MX(t) of a random variable X is defined as: MX(t)=E(etX),tR if the expectation exists.

5.2 Properties

  1. Existence: The mgf may or may not exist for any particular value of t.

  2. Uniqueness Theorem: If the mgf exists for t in an open interval containing zero, it uniquely determines the probability distribution.

  3. Moments: If the mgf exists in an open interval containing zero, then: MX(k)(0)=E(Xk)

  4. Linear Transformation: For any constants a,b: Ma+bX(t)=eatMX(bt)

  5. Independence: If X and Y are independent, then: MX+Y(t)=MX(t)MY(t)

5.3 Joint Moment Generating Function (Joint MGF)

For random variables X1,,Xn, their joint mgf is defined as: MX1,,Xn(t1,,tn)=E(et1X1++tnXn) if the expectation exists.

5.4 Properties of Joint MGF

  1. Marginal MGF: MX1(t1)=MX1,,Xn(t1,0,,0)

  2. Uniqueness Theorem: Similar to the univariate case, the joint mgf uniquely determines the joint distribution if it exists for t in an open interval containing zero.

  3. Independence: X1,,Xn are independent if and only if: MX1,,Xn(t1,,tn)=i=1nMXi(ti)

  4. Moments: r1++rnt1r1tnrnMX1,,Xn(0,,0)=E(X1r1Xnrn)

Application: Portfolio Returns

Problem Setup

Given:

Portfolio:

Portfolio return: RP=xARA+xBRB

Expected Return

E(RP)=E(xARA+xBRB)=xAE(RA)+xBE(RB)=xAμA+xBμB

Portfolio Variance

Var(RP)=Var(xARA+xBRB)

Using the properties of variance: Var(RP)=xA2Var(RA)+xB2Var(RB)+2xAxBCov(RA,RB)Var(RP)=xA2σA2+xB2σB2+2xAxBσAB

Portfolio Standard Deviation

σP=Var(RP)=xA2σA2+xB2σB2+2xAxBσAB


IV. Conditional Expectation


1. Definition

The conditional expectation of h(Y) given X=x is defined as:

Discrete case: E(h(Y)|X=x)=yh(y)PY|X(y|x) In particular, E(Y|X=x)=yyPY|X(y|x)

Continuous case: E(h(Y)|X=x)=h(y)fY|X(y|x)dy In particular, E(Y|X=x)=yfY|X(y|x)dy


2. Questions

  1. For joint pdf f(x,y), if we fix x, is f(x,y) a pdf of y?

    No, f(x,y) by itself is not a pdf of y because it does not integrate to 1 over all possible values of y. It is the joint density function evaluated at a fixed x.

  2. Why is fY|X(y|x)=f(x,y)fX(x) a pdf of y?

    fY|X(y|x) is the conditional pdf of Y given X=x. To be a valid pdf, it must satisfy the following:

    • Non-negativity: fY|X(y|x)0

    • Integrates to 1: fY|X(y|x)dy=1

    Let's verify this: fY|X(y|x)dy=f(x,y)fX(x)dy=1fX(x)f(x,y)dy Since f(x,y) is the joint pdf, f(x,y)dy=fX(x) Thus, 1fX(x)fX(x)=1 Hence, fY|X(y|x) is a valid pdf of y.

  3. Conditional Expectation Calculation

    • E(Y|x): The mean of fY|X(y|x).

    • Do it for any x=x, and get a function of xE(Y|x)​.

Example Calculation

Given a joint pdf fX,Y(x,y): fX,Y(x,y)=joint pdf To find the conditional expectation: E(Y|X=x)=yfY|X(y|x)dy fY|X(y|x)=f(x,y)fX(x) Thus, E(Y|X=x)=yf(x,y)fX(x)dy

Important Points

This ensures that fY|X(y|x) is properly normalized: fY|X(y|x)dy=1


3. Illustration

Example: Let X be the age (in years), and Y be the height (in cm).

Properties of Conditional Expectation

  1. EY|X(h(Y)x) is a function of x and is free of Y.

  2. If X and Y are independent, then EY|X(h(Y)x)=EY(h(Y)).

  3. E(h(X)X=x)=h(x).

  4. Let g(x)=EY|X(h(Y)x). Then g(X) is a random variable and denoted by EY|X(h(Y)X).

  5. Law of Total Expectation: EX[EY|X(h(Y)X)]=EY(h(Y))

In particular: EX[EY|X(YX)]=EY(Y)

  1. Variance Decomposition: Var(Y)=Var(EY|X(YX))+EX[Var(YX)]

Note:

  1. Var(Y)EX[Var(YX)], with equality if and only if EY|X(YX)=E(Y) with probability one.

  2. Var(Y)Var(EY|X(YX)), with equality if and only if Var(YX)=0 with probability one, i.e., Y=EY|X(YX)​ with probability one.


4. δ Method

The δ method is a statistical technique used to approximate the mean and variance of a function of a random variable. The key idea is to use a Taylor series expansion to linearize the function around the mean of the random variable. This method is especially useful when we know the mean and variance of the original random variable but not the full distribution.

Step-by-Step Derivation:

Given:

4.1 Taylor Expansion:

We start by expanding g(X) around the mean μX using a Taylor series:

Y=g(X)g(μX)+(XμX)g(μX)+12(XμX)2g(μX) This expansion approximates Y as a function of the mean of X and its deviations from the mean.

4.2 Expectation of Y:

To find the expected value of Y, we take the expectation of the Taylor series expansion:

E[Y]=E[g(X)]E[g(μX)+(XμX)g(μX)+12(XμX)2g(μX)]

Since E[XμX]=0, the linear term vanishes: E[Y]g(μX)+12E[(XμX)2]g(μX)

Since E[(XμX)2]=σX2, this simplifies to: E[Y]g(μX)+12σX2g(μX)

However, for most practical purposes, especially when g(X) is approximately linear around μX, the second term is small, and we often approximate: E[Y]g(μX)

4.3 Variance of Y:

Next, we compute the variance of Y: Var[Y]=E[(YE[Y])2]

Substituting the Taylor expansion into the variance formula: Var[Y]Var[g(μX)+(XμX)g(μX)]

Since g(μX) is a constant, it doesn't contribute to the variance, so we have: Var[Y]Var[(XμX)g(μX)]

Using the property that Var[aX]=a2Var[X]: Var[Y](g(μX))2Var[X]

Substituting Var[X]=σX2: Var[Y](g(μX))2σX2

Summary:

Interpretation:

The (\delta) method, often called the delta method, is a statistical technique used primarily to approximate the mean and variance of a function of a random variable. This method relies on using a first-order Taylor series expansion around the mean of the random variable to linearize the function.

4.4 Prerequisites for Using the Delta Method

  1. Differentiable Function g(X):

    • The function g(X) should be differentiable at the mean μX of the random variable X.

    • Differentiability ensures that we can approximate the function g(X) using a first-order Taylor expansion around the mean μX.

  2. Known Mean and Variance of X:

    • You should know the mean μX and variance σX2 of the random variable X.

    • These parameters are crucial for applying the Taylor expansion and subsequently estimating the mean and variance of Y=g(X).

  3. Small Variance Assumption:

    • The method works best when the variance σX2 of the random variable X is relatively small.

    • If σX2 is large, higher-order terms in the Taylor expansion might become significant, reducing the accuracy of the approximation.

  4. Linear Approximation Validity:

    • The function g(X) should not deviate significantly from linearity around the mean μX.

    • The accuracy of the δ method depends on how well the linear approximation (first-order Taylor expansion) represents the function g(X) near μX.

  5. Random Variable X Centered Around Mean:

    • The random variable X should be well-centered around its mean μX.

    • This ensures that the higher-order terms in the Taylor series expansion are negligible, making the linear approximation (first-order expansion) sufficient.