Biostatistics & Bioinformatics

Biostatistics & Bioinformatics

Biostatistics applies statistical principles to biological and medical research. Bioinformatics uses computational tools to analyze biological data. Both are increasingly essential in modern medicine.

Key Biostatistical Concepts

  • Types of Data: Nominal (categories, no order), Ordinal (ranked categories), Interval (equal intervals, no true zero), Ratio (equal intervals + true zero — most lab values)
  • Measures of Central Tendency: Mean (average), Median (middle value — better for skewed data), Mode (most frequent)
  • Measures of Dispersion: Range, Variance (mean of squared deviations), SD (square root of variance), SE (SD/√n), Coefficient of Variation (CV = SD/mean × 100%)
  • Normal Distribution: Bell-shaped, symmetric; Mean ±1 SD = 68%, ±2 SD = 95%, ±3 SD = 99.7%

Diagnostic Test Parameters

  • Sensitivity = TP/(TP+FN): Ability to detect disease when present. High sensitivity → low FN. Best for ruling OUT disease (SnNout). Screening test.
  • Specificity = TN/(TN+FP): Ability to rule out disease when absent. High specificity → low FP. Best for ruling IN disease (SpPin). Confirmatory test.
  • PPV = TP/(TP+FP): Probability that positive test truly has disease. Depends on PREVALENCE (↑prevalence → ↑PPV).
  • NPV = TN/(TN+FN): Probability that negative test truly doesn't have disease. ↑prevalence → ↓NPV.
  • LR+ = Sensitivity/(1-Specificity): How much positive test increases disease odds
  • ROC curve: Plot of Sensitivity vs (1-Specificity) at all cutoffs. AUC (Area Under Curve) = overall accuracy; 0.5 = useless, 1.0 = perfect.

Hypothesis Testing

  • Null hypothesis (H₀): No difference (e.g., treatment has no effect)
  • p-value: Probability of observing result if H₀ is true. p < 0.05 → reject H₀ (statistically significant)
  • Type I error (α): Reject H₀ when it is true (false positive). Controlled by α level (0.05).
  • Type II error (β): Fail to reject H₀ when it is false (false negative). Power = 1-β.
  • Common tests: t-test (compare 2 means), ANOVA (compare >2 means), Chi-squared (categorical data), Mann-Whitney U (non-parametric)

Bioinformatics

  • Sequence Alignment: BLAST (Basic Local Alignment Search Tool) — rapidly aligns query sequence to database; finds homologs. CLUSTAL — multiple sequence alignment.
  • Databases: GenBank/NCBI (DNA sequences), UniProt/SwissProt (proteins), PDB (3D protein structures), OMIM (Online Mendelian Inheritance in Man — genetic diseases)
  • Genome Browsers: UCSC, Ensembl — visualize genome, annotations, variants
  • Variant Annotation: ClinVar, dbSNP, COSMIC (cancer somatic mutations)
  • Protein Structure: AlphaFold2 (DeepMind) — AI predicts 3D protein structure from sequence with remarkable accuracy. Revolutionized structural biology.
  • Pathway Analysis: KEGG, Reactome — map gene sets to biological pathways; used in RNA-seq data interpretation

Epidemiology Measures

  • Incidence: New cases per population per time
  • Prevalence: Existing cases per population at a time point
  • RR (Relative Risk): Risk in exposed / Risk in unexposed (cohort study)
  • OR (Odds Ratio): Odds of exposure in cases / Odds in controls (case-control study; approximates RR when disease rare)
  • NNT (Number Needed to Treat): 1/ARR (Absolute Risk Reduction)