Lectern

Search

Search IconIcon to open search

Sampling

Last updated Nov 20, 2022

Refers to the process of extracting a subset of a population. Done either due to constraints around:

Statistics are often used to infer population parameters.

# Terminologies

TermDefinition
PopulationAn entire group of entites of which data can be collected from
SampleA subset of a population; a smaller group of entities selected from the population
ParameterA numerical measurement collected from a population
StatisticA numerical measurement computed from a sample

# Notations

RepresentationParameter symbol (population)Statistic symbol (sample)
Size$N$$n$
Mean$\mu$$\bar X$
Standard deviation$\sigma$$s$
Variance$\sigma^2$$s^2$

# Sampling methods

# Random sampling

A technique whereby a sample is selected from a population entirely by chance (randomly). Each entity in the population has a known probability of being selected. Reduces the possibility of bias in sampling.

There are three kinds of random sampling:

# Simple random sampling

A random sampling method that ensures that each entity in the population has an equal chance of being included in the sample.

# Stratified random sampling

A random sampling method that selects a sample from different groups in the population, ensuring that a particular group in the population won’t be missed out.

# Systematic random sampling

A random sampling method such that a starting point and every $k$-th entity in the population is selected. Is easy to implement and reasonably efficient, but bias may exist if there is a certain pattern in the population list.

# Quota sampling

A technique commonly used in marketing research where interviewers are given a quota of interviewees from a certain type to conduct an interview with. Not random in nature as not every entity in the population has a chance to be selected.

There may also be additional bias as interviewers may approach more approchable and helpful interviewees than interviewees of a diverse background.

# Sampling distribution

Refers to the probability distribution of a statistic. More of a theoretical concept than one observed from experiment. As statistics are random variables, each statistic follows a particular distribution.

# Sampling distribution of the sample mean

Refers to the probability distribution of all possible values the sample mean can take when a sample (of size $n$) is taken from a particular population. Is a continuous probability distribution by nature.

An important statistic is the sample’s mean ($\bar X$), meaning that we often concern ourselves with the sample distribution of the sample mean.

# Mean and variance

When a very large number of samples (each of size $n$) of either:

is repeatedly and independently drawn from a population,

Expressed mathematically, when $n \to \infty$, $$ E(\bar X) \approx \mu $$ $$ Var(\bar X) = \frac {1} {n} \times \sigma^2 = \frac {\sigma^2} {n} $$

On the other hand, if $n$ is done on a finite population (of size $N$) and $n$ is not a very small fraction of $N$, the finite population correction factor needs to be applied to the variance: $$ Var(\bar X) = \frac {\sigma^2} n \times \frac {N - n} {N - 1} $$

# Sampling error

Sampling errors are defined as the differences between statistics and parameters as samples are not a perfect representation of a population that will always be present. The sampling error of the mean is the standard deviation of the sampling distribution of the mean.

When a very large number of samples (each of size $n$) of either:

is repeatedly and independently drawn from a population, the standard error of the mean is as such: $$ SE_{mean} = \sqrt {Var(\bar X)} = \frac \sigma {\sqrt n} $$ On the other hand, if $n$ is done on a finite population (of size $N$) and $n$ is not a very small fraction of $N$, the finite population correction factor needs to be applied: $$ SE_{mean} = \sqrt {Var(\bar X)} = \frac \sigma {\sqrt n} \times \sqrt \frac {N - n} {N - 1} $$

# Central limit theorem

If $X_1, X_2, …, X_n$ is a random sample (of size $n \geq 30$) taken from a population where with a random variable X (of any kind of distribution) with a mean ($\mu$) and variance ($\sigma^2$), the sample mean ($\bar X$) is approximately normal.

Expressed mathematically, when $n \geq 30$, $$ \bar X \sim N(\mu, \frac {\sigma^2} n) $$