Page 9 of 13

CM6.1-6 | Biostatistics for Community Medicine — Glossary

Glossary — CM6.1-6 | Biostatistics for Community Medicine

Key terms in this module. Tap a term to see its definition.

Alternative hypothesis (H₁)

The hypothesis that a real difference or association exists in the population; accepted when the null hypothesis is rejected.

ANOVA (Analysis of Variance)

A parametric test comparing means of three or more independent groups; uses the F-statistic (ratio of between-group to within-group variance); a significant result indicates at least one group mean differs.

Arithmetic mean

The sum of all observations divided by the number of observations; appropriate for symmetric, continuous data; sensitive to outliers.

Biostatistics

The application of statistical principles and methods to biological and medical data to collect, organise, analyse, and interpret health information.

Central Limit Theorem (CLT)

The theorem stating that the distribution of sample means approaches a normal distribution as sample size increases (n ≥ 30), regardless of the shape of the underlying population distribution.

Chi-square test

A non-parametric test for association between two categorical variables in a contingency table, or for goodness-of-fit; requires expected frequency ≥5 in all cells; χ² = Σ[(O−E)²/E].

Clinical significance

Whether a statistically significant finding is large enough to be meaningful in clinical or public-health practice; determined by effect size and the minimum clinically important difference, NOT by the p-value alone.

Cluster sampling

A sampling method in which naturally occurring groups (clusters) are randomly selected and all (or a random subset of) individuals within selected clusters are surveyed.

Coefficient of variation (CV)

SD expressed as a percentage of the mean (SD/mean × 100%); allows comparison of relative variability across datasets with different units or different means.

Confidence interval (CI)

A range of values computed from sample data that, with a specified confidence level (e.g. 95%), is expected to contain the true population parameter across repeated sampling; 95% CI = mean ± 1.96 × SE.

Continuous variable

A variable that can take any value within a range, limited only by measurement precision (e.g. haemoglobin, blood glucose, body weight).

Degrees of freedom (df)

The number of independent values that can vary in a statistical calculation; for a chi-square contingency table, df = (rows − 1)(columns − 1); for a two-sample t-test, df = n₁ + n₂ − 2.

Design effect

The ratio of the actual variance under cluster sampling to the variance under simple random sampling of the same total size; greater than 1 because cluster members are more similar to each other than to the general population.

Discrete variable

A variable that can take only distinct whole-number values with no intermediate fractions (e.g. number of children, number of admissions).

Epi Info

Free, CDC-developed statistical software for field epidemiology and public health; supports data entry, frequency analysis, cross-tabulation, chi-square, t-tests, and outbreak investigation.

Expected frequency

The theoretical frequency in a contingency table cell under the null hypothesis of no association; = (row total × column total) / grand total; must be ≥5 in all cells for chi-square to be valid.

Fisher's exact test

A non-parametric test for association in a 2×2 contingency table used when any expected frequency is <5; calculates exact probabilities rather than approximating from the chi-square distribution.

Frequency distribution

A tabular arrangement of data showing the frequency (count) of observations falling within each class interval or category.

Histogram

A graphical display for continuous data grouped into class intervals; bars are adjacent (no gaps), and bar area is proportional to frequency.

Interquartile range (IQR)

The difference between the 75th percentile (Q3) and the 25th percentile (Q1); represents the middle 50% of data; robust to outliers.

Interval scale

A scale with equal intervals between values but no true zero point (e.g. temperature in Celsius); ratios between values are not meaningful.

Kruskal-Wallis test

Non-parametric equivalent of one-way ANOVA; tests for differences across ≥3 independent groups using rank sums; used when data are ordinal or not normally distributed.

Mann-Whitney U test

Non-parametric equivalent of the independent two-sample t-test; used for ordinal data or non-normally distributed continuous data comparing two independent groups; uses rank sums.

Median

The middle value of ordered data; appropriate for ordinal or skewed continuous data; not influenced by extreme values.

Mode

The most frequently occurring value or category in a dataset; the only appropriate measure of central tendency for nominal data.

Multistage sampling

A sampling method employing sequential probability sampling at multiple hierarchical levels (e.g. district → village → household → individual).

Nominal scale

A scale of measurement in which data are classified into unordered, mutually exclusive categories (e.g. blood group, sex); no quantitative relationship between categories.

Normal distribution

A symmetrical, bell-shaped continuous probability distribution characterised by its mean and standard deviation; 68.27%, 95.45%, and 99.73% of values fall within ±1, ±2, and ±3 SD of the mean.

Null hypothesis (H₀)

The assumption that any observed difference or association is due to chance alone — that no real effect exists; the default position that a significance test attempts to disprove.

Ogive

A graph of cumulative frequency (or cumulative percentage) plotted against class boundaries; used to read off percentiles graphically.

OpenEpi

A free, browser-based statistical tool for public health; computes sample sizes, proportions with CIs, chi-square, OR, and RR from 2×2 tables; requires no installation.

Ordinal scale

A scale of measurement with ordered categories whose intervals are not necessarily equal (e.g. disease severity: mild/moderate/severe).

P-value

The probability of observing a result as extreme as, or more extreme than, the one obtained, given that the null hypothesis is true; NOT the probability that the null hypothesis is true.

Paired t-test

A t-test applied to paired data (same subjects measured twice, or matched pairs); operates on within-pair differences; more powerful than independent t-test when subjects are their own controls.

PICO

A structured framework for formulating research questions: Population, Intervention/Exposure, Comparison, Outcome.

Ratio scale

A scale with equal intervals and a true zero, where zero represents absence of the quantity (e.g. weight, height, haemoglobin); all arithmetic operations are valid.

Simple random sampling

A sampling method in which every member of the population has an equal and independent probability of being selected, typically using a random number table or lottery.

Standard deviation (SD)

The square root of variance; measures the spread of individual observations around the mean within a sample or population.

Standard error (SE)

SD divided by the square root of n; measures the precision of the sample mean as an estimate of the population mean; decreases as sample size increases.

Statistical power

The probability of correctly rejecting a false null hypothesis; power = 1 − β; conventionally ≥0.80 (80%) is considered adequate; increases with larger sample size.

Statistical significance

A finding is statistically significant when the p-value is below the pre-specified α threshold (usually 0.05), indicating the result is unlikely under the null hypothesis.

Stratified random sampling

A sampling method in which the population is divided into homogeneous subgroups (strata) and a random sample is drawn from each stratum, ensuring representation of all strata.

Systematic sampling

A sampling method in which every kth unit is selected from a sampling frame after a random start; k = population size ÷ sample size.

T-test

A parametric significance test for comparing means when population SD is unknown; three forms: one-sample (vs hypothesised value), independent two-sample (two different groups), and paired (same subjects measured twice or matched pairs).

Type I error (α)

The error of rejecting the null hypothesis when it is actually true (false positive); conventionally set at 0.05 (5%), meaning a 5% acceptable risk of this error.

Type II error (β)

The error of failing to reject the null hypothesis when it is actually false (false negative); statistical power = 1 − β.

Z-score

A standardised score expressing the number of standard deviations a value is from the mean: z = (x − μ) / σ; used with the standard normal distribution for large-sample inference.

Z-test

A parametric significance test for comparing a sample mean to a population mean when population SD is known or sample size is large (n ≥ 30); test statistic follows the standard normal distribution.

48 terms in this module