Biostatistics & Bioinformatics
Biostatistics applies statistical principles to biological and medical research. Bioinformatics uses computational tools to analyze biological data. Both are increasingly essential in modern medicine.
Key Biostatistical Concepts
- Types of Data: Nominal (categories, no order), Ordinal (ranked categories), Interval (equal intervals, no true zero), Ratio (equal intervals + true zero — most lab values)
- Measures of Central Tendency: Mean (average), Median (middle value — better for skewed data), Mode (most frequent)
- Measures of Dispersion: Range, Variance (mean of squared deviations), SD (square root of variance), SE (SD/√n), Coefficient of Variation (CV = SD/mean × 100%)
- Normal Distribution: Bell-shaped, symmetric; Mean ±1 SD = 68%, ±2 SD = 95%, ±3 SD = 99.7%
Diagnostic Test Parameters
- Sensitivity = TP/(TP+FN): Ability to detect disease when present. High sensitivity → low FN. Best for ruling OUT disease (SnNout). Screening test.
- Specificity = TN/(TN+FP): Ability to rule out disease when absent. High specificity → low FP. Best for ruling IN disease (SpPin). Confirmatory test.
- PPV = TP/(TP+FP): Probability that positive test truly has disease. Depends on PREVALENCE (↑prevalence → ↑PPV).
- NPV = TN/(TN+FN): Probability that negative test truly doesn't have disease. ↑prevalence → ↓NPV.
- LR+ = Sensitivity/(1-Specificity): How much positive test increases disease odds
- ROC curve: Plot of Sensitivity vs (1-Specificity) at all cutoffs. AUC (Area Under Curve) = overall accuracy; 0.5 = useless, 1.0 = perfect.
Hypothesis Testing
- Null hypothesis (H₀): No difference (e.g., treatment has no effect)
- p-value: Probability of observing result if H₀ is true. p < 0.05 → reject H₀ (statistically significant)
- Type I error (α): Reject H₀ when it is true (false positive). Controlled by α level (0.05).
- Type II error (β): Fail to reject H₀ when it is false (false negative). Power = 1-β.
- Common tests: t-test (compare 2 means), ANOVA (compare >2 means), Chi-squared (categorical data), Mann-Whitney U (non-parametric)
Bioinformatics
- Sequence Alignment: BLAST (Basic Local Alignment Search Tool) — rapidly aligns query sequence to database; finds homologs. CLUSTAL — multiple sequence alignment.
- Databases: GenBank/NCBI (DNA sequences), UniProt/SwissProt (proteins), PDB (3D protein structures), OMIM (Online Mendelian Inheritance in Man — genetic diseases)
- Genome Browsers: UCSC, Ensembl — visualize genome, annotations, variants
- Variant Annotation: ClinVar, dbSNP, COSMIC (cancer somatic mutations)
- Protein Structure: AlphaFold2 (DeepMind) — AI predicts 3D protein structure from sequence with remarkable accuracy. Revolutionized structural biology.
- Pathway Analysis: KEGG, Reactome — map gene sets to biological pathways; used in RNA-seq data interpretation
Epidemiology Measures
- Incidence: New cases per population per time
- Prevalence: Existing cases per population at a time point
- RR (Relative Risk): Risk in exposed / Risk in unexposed (cohort study)
- OR (Odds Ratio): Odds of exposure in cases / Odds in controls (case-control study; approximates RR when disease rare)
- NNT (Number Needed to Treat): 1/ARR (Absolute Risk Reduction)