Overview
- The Results section reports the statistical findings of the study
- Most studies use some common statistical tests (see “General statistics” below)
- Specific types of research study also have unique statistical tests (see “Statistics in specific types of study” below)
General statistics
- Types of variables
- Continuous: a variable that can take on an infinite number of values, which can be further divided into:
- Interval: a scale with a numerical value (for example, Celsius or Fahrenheit)
- Ratio: a type of interval variable in which 0 indicates that there is nothing; can divide into meaningful ratios (for example, height or weight)
- Categorical/discrete: a variable that takes on a limited number of values, which can be further divided into:
- Nominal: no intrinsic order (for example, dogs, cats, hamsters)
- Ordinal: has an order, but unlike continuous variables, the increments may not be equal (for example, 1 to 5 on a pain scale, with 1 being least painful and 5 being most painful)
- Binary/dichotomous: two categories that are mutually exclusive (for example, yes/no)
- Continuous: a variable that can take on an infinite number of values, which can be further divided into:
- Continuous
- The following are different ways that studies can report continuous variables
- Measures of central tendency
- Mean: average of a dataset; sum of the values divided by the number of values
- Central limit theorem: a sample with many observations tend to converge into a normal distribution (values cluster around the mean in a “bell shape” curve)
- Outliers: values that differ significantly from other values
- Outliers can skew the mean (no longer a normal distribution)
- Median: middle value of a dataset, aka 50th percentile
- More resistant to outliers and thus more often used when the sample size is small
- Mode: most common value of a dataset; more resistant to outliers
- Mean: average of a dataset; sum of the values divided by the number of values
- Measures of variability/dispersion
- Standard deviation (SD): measure of variation around the mean in a population
- Larger SD suggests larger variation
- Standard error (SE): SD of a sample of the population (not the whole population)
- The SE depends on the sample size; larger sample size leads to smaller SE
- Interquartile range (IQR): measure of variation often used with medians
- Equal to the difference between the 75th and 25th percentiles
- Range: difference between the largest and smallest value in a dataset
- Standard deviation (SD): measure of variation around the mean in a population
- Categorical
- Count and percent
- Unlike continuous variables, categorical variables are usually reported in studies as the number of participants (count) and percentage of participants (percent)
- Count and percent
- Hypothesis testing
- Statistical tests are used to test a hypothesis
- Statistical tests compare the means (or other measures of central tendency) of different groups and determine whether their values are significantly different
- Parametric tests compare means between groups (and therefore depend on a normal distribution) while non-parametric tests rely less on the mean (and therefore depend less on a normal distribution)
- There are many different statistical tests for different purposes (see Table below for some common examples)
- Various statistical softwares (for example, Stata and SPSS) are used to conduct statistical testing on data
- Statistical significance can be expressed in two ways: p-value and 95% confidence interval
- These two ways use different explanations but produce consistent results
- p-value: the probability of getting a result at least as extreme as the results observed, given that the null hypothesis is correct
- It answers the question: what is the probability that the groups are not different?
- One- vs two-tailed tests
- One-tailed tests assume directionality (for example, is drug X better than drug Y for treating the disease?)
- Two-tailed tests do not (for example, are drug X and drug Y different for treating the disease?)
- Most studies use two-tailed tests
- The level of statistical significance is conventionally set at alpha (α) = 0.05, which implies that there is a 5% chance that the groups are not different (see “Type I error” below)
- Therefore, p < 0.05 is statistically significant
- There is less than a 5% chance that the groups are not different (so there is most likely a difference)
- 95% confidence interval (CI): if a study is repeated an infinite number of times, 95% of the estimated 95% confidence intervals will contain the true population measure
- A 95% CI range that does not contain the reference value is significant
- For example, if there is no difference between two groups (assuming that the null hypothesis is correct), their difference should be 0, which is the reference value; so, a CI that includes 0 is not statistically significant (would be significant if it did not include 0)
- If statistically significant, then the null hypothesis is rejected
- Type I (alpha or α) error: incorrectly rejecting the null hypothesis; concluding that a difference is real when it is not real (false positive)
- This is the same alpha (α) used in the level of statistical significance above
- Type II (beta or β) error: incorrectly not rejecting the null hypothesis; concluding that a difference is not real when it is real (false negative)
- Statistical power (1 – β): probability of correctly rejecting the null hypothesis when the alternate hypothesis is true (true positive)
- Type I (alpha or α) error: incorrectly rejecting the null hypothesis; concluding that a difference is real when it is not real (false positive)
Statistics in different types of study
- Correlation
- Models relationship between two continuous variables
- Correlation coefficient (r): measures the strength of association between the two variables
- Positive r: positive correlation; when one variable increases, the other increases too
- Negative r: negative correlation; when one variable increases, the other decreases
- r ranges from -1 to 1: values close to 0 have weak correlation, while those closer to -1 or 1 have strong correlation
- In other words, larger magnitude suggests stronger correlation
- Coefficient of determination (r2): amount of variation in the dependent variable that can be predicted from the independent variable; can be calculated by squaring the value of r
- Higher r2 indicates better model prediction
- Effect modification (interaction) vs confounding
- Both involve three or more variables
- Both are relevant in regression analysis, in which multiple independent variables can be entered into the model
- Effect modification (interaction): the effect of an independent variable on a dependent variable depends on a different independent variable
- Effect modification is useful in research, because two or more causes may interact to determine an effect
- Confounding: a third variable that explains the association between an independent and dependent variable
- Confounding is undesirable in research, because the third variable reveals a spurious association between the independent and dependent variable
- Ways to control confounding in observational studies (see “Methods” section for definitions of different types of studies)
- A limitation of all the ways to control confounding below is that they all rely on known confounding variables; unknown confounding variables cannot be controlled and may still cause problems (called residual confounding)
- This is a key shortcoming of observational studies and a reason why RCTs are considered gold standard
- Stratification: categorize participants based on the confounding variable
- More limited than regression because only 3 variables can be compared at a time (independent, dependent, and confounding variable)
- Matching
- Propensity score matching (PSM): an alternative to regression
- PSM matches the treatment and comparison groups based on confounding variables
- After matching, both groups will have similar characteristics and can be directly compared
- Propensity score matching (PSM): an alternative to regression
- Regression
- Multivariable linear regression: models the relationship between multiple independent variables (binary or continuous) and one continuous dependent variable
- Goals
- To determine which independent variables significantly predict the dependent variable
- Can determine statistical significance using p-values or 95% confidence intervals
- To assess which variables have a larger effect (see “Effect size” below)
- Can control for confounding by including other variables in the model (for example, age and body mass index) to produce “adjusted” results as opposed to the unadjusted or “crude” results
- Effect size
- Like correlation, larger magnitude indicates stronger association
- Like correlation, pay attention to the sign
- Positive sign = positive correlation
- Negative sign = negative correlation
- Unstandardized coefficient (B): often just called “coefficient,” this number is the unit change in the dependent variable given one unit increase in the independent variable
- Pay attention to the units of the variables (for example, kg or cm)
- If the independent variable is binary (yes/no), then B is the unit change in the dependent variable when the independent variable is “yes” (relative to “no”)
- Standardized coefficient (Beta or β): standardized the B based on standard deviations
- β is unitless; can directly compare independent variables (unlike B)
- For example, if the β of one independent variable is greater in magnitude than the β of another independent variable, then the former’s effect size is greater (assuming that the dependent variable is constant/does not change)
- Note: the standardized coefficient β is unrelated to the type II error β; to make things more confusing, the coefficients B and β are sometimes used interchangeably in articles, so pay attention to how the authors define the coefficients (unstandardized vs standardized)
- Mixed regression model: accounts for repeated measures of the same participants over time (not cross-sectional)
- Goals
- Multivariable logistic regression: models the relationship between multiple independent variables (binary or continuous) and one binary dependent variable
- Similar goals as linear regression, except the dependent variable is binary, not continuous
- Effect size
- Odds ratio (OR): odds of exposure to a risk factor among participants with an outcome, divided by odds of that same exposure among those without that same outcome
- OR is approximately equal to relative risk (see below) when an outcome is rare; however, OR is used in cross-sectional and case-control studies while RR is used in cohort studies and clinical trials
- In simple terms, OR and RR are basically the number of times participants with an exposure are at higher risk of having an outcome compared with those without the exposure
- Multivariable linear regression: models the relationship between multiple independent variables (binary or continuous) and one continuous dependent variable
- A limitation of all the ways to control confounding below is that they all rely on known confounding variables; unknown confounding variables cannot be controlled and may still cause problems (called residual confounding)
- Cross-sectional study
- Prevalence
- Number of existing cases of a disease (at a certain time), divided by the population observed (at that time)
- Often expressed as a percentage to show how many people have a disease at one point in time (see “Incidence” below for differences)
- Factors that change prevalence
- Number of new cases
- Number of deaths
- Number of recoveries
- Prevalence
- Case-control study
- Odds ratio (OR): same concept as the one used in logistic regression; OR is odds of exposure to a risk factor among participants with an outcome, divided by odds of that same exposure among those without that same outcome
- Remember that cross-sectional studies do not have a temporal relationship while in case-control studies, the exposure precedes the outcome
- Prospective studies: cohort study and randomized controlled trial
- Mortality rate: total number of deaths divided by total population over a specified period of time
- Case-fatality rate: total number of deaths due to a disease divided by number of people diagnosed with that disease over a specified period of time
- Incidence (risk): number of new cases of a disease during a time period, divided by the population at risk during that time period
- Different from prevalence, which is at a single point in time (a snapshot or cross-section)
- Incidence = Prevalence x Duration
- Incidence changes when prevalence or duration changes
- Unit for incidence is often “person years”
- Participants sometimes drop out of prospective studies
- Instead of discarding their data from the study, the number of years when they were in the study is included in the final analysis
- Risk ratio or relative risk (RR): incidence of an outcome among participants exposed to a risk factor, divided by the incidence of that same outcome among participants not exposed to that same risk factor
- RR is the relative difference in risk
- RR = 1 indicates no difference in risk between treatment and control groups (null hypothesis)
- RR < 1 indicates decreased risk in the treatment group compared with the control group
- RR > 1 indicates increased risk in the treatment group compared with the control group
- Risk difference or attributable risk (AR): incidence of an outcome among participants exposed to a risk factor, minus the incidence of that same outcome among participants not exposed to that same risk factor
- AR is the absolute difference in risk
- AR = 0 indicates no difference in risk between treatment and control groups (null hypothesis)
- AR < 0 indicates decreased risk in the treatment group compared with the control group
- AR > 0 indicates increased risk in the treatment group compared with the control group
- Number needed to treat (NNT) or harm (NNH): number of patients who must be treated over a certain period of time to achieve a defined effect
- Inverse of the AR
- Kaplan-Meier (KM) survival analysis: measures the fraction of patients who reach a predefined outcome after starting treatment over time
- Log-rank test: compares the outcome between the treatment and control groups for statistical significance
- Cox proportional hazards model: type of multivariable regression model that accounts for time to an outcome; can control for confounding
- Hazard: probability of an outcome occurring during an instant in time, conditional upon the participant having made it to that instant without having already reached the outcome
- Hazard ratio (HR): hazard of an outcome among participants exposed to a risk factor, divided by the hazard of that same outcome among participants not exposed to that same risk factor
- HR removes people who have already reached the outcome at every instant in time
- Similar to RR in interpretation
- HR = 1 indicates no difference in risk between treatment and control groups (null hypothesis)
- HR < 1 indicates decreased risk in the treatment group compared with the control group
- HR > 1 indicates increased risk in the treatment group compared with the control group
- Meta-analysis
- Quality of a meta-analysis depends on the quality of the studies it included
- Check which type of studies are included: in general, case-control < cohort < clinical trials in quality
- Forest plot: graph that combines the results of multiple published studies to give an overall effect (Z-score)
- The overall effect for clinical trials is the mean difference (MD), for cohort studies is the relative risk (RR), and for case-control studies is the odds ratio (OR)
- Fixed effects model: assumes one true effect; accounts only for within study variability
- Random effects model: assumes many different effects; accounts for within and between study variability
- Measures of heterogeneity or variation between studies
- Low heterogeneity is preferred; high heterogeneity suggests that the included studies vary a lot (for example, participants, study design, results)
- I-squared (I2): larger I2 suggests larger heterogeneity (for example, I2=0% is no heterogeneity, 50% is moderate heterogeneity, and 100% is extreme heterogeneity)
- Similarly, larger tau-squared (τ2) and chi-squared (χ2) (p < 0.05) indicate larger heterogeneity
- Meta-regression: like linear regression on individual participants, meta-regression is linear regression on included research studies
- Performed when the heterogeneity is large
- Goal is to determine which factors/variables explain the heterogeneity
- Funnel plot: graph that checks for publication bias, which occurs when studies with unfavorable results are not published and thus not included in the meta-analysis
- Low bias can be visually confirmed by left-right symmetry, and a narrow top with wide bottom (triangle shape)
- Asymmetry should raise alarm for bias
- Individual participant data (IPD) meta-analysis: considered the gold standard of meta-analysis
- Individual participant data from each study is analyzed rather than the aggregate data of all the participants (for example, mean) from each study
- Quality of a meta-analysis depends on the quality of the studies it included
- Other common analyses
- Screening and diagnostic tests
- Types of prevention
- Primary: prevents disease from occurring (for example, vaccination, exercise)
- Secondary: has the disease but no symptoms yet (for example, Pap screening test)
- Tertiary: prevent symptomatic disease from getting worse (for example, taking diabetes medications)
- Research studies on screening and diagnostic tests
- Purpose of these studies
- Compare the efficacy of a new screening/diagnostic test with an existing, gold-standard test for diagnosing a disease
- Below are some common terms used in screening tests
- Sensitivity and specificity
- Sensitivity: proportion of people with disease who test positive (true positive rate)
- Specificity: proportion of people without disease who test negative (true negative rate)
- They are intrinsic properties of the tests and do not changed based on prevalence of disease
- Receiver operating characteristic (ROC) curve: graph showing the relationship between the sensitivity (true positive rate) and 1 minus specificity (false positive rate)
- Best ROC curve has the highest true positive rate and lowest false positive rate (upper left corner on graph)
- Predictive value
- Positive predictive value (PPV): proportion of positive tests that are correct
- Negative predictive value (NPV): proportion of negative tests that are correct
- High prevalence of disease leads to high PPV and low NPV, while low prevalence of disease leads to low PPV and high NPV
- In other words, if a disease is very common, then it is more likely by chance that a test will correctly test positive and less likely correctly test negative
- Accuracy: sum of the number of true positives and true negatives divided by the total population screened
- Likelihood ratio
- Positive likelihood ratio (LR+): probability of a positive test result in people with disease divided by probability of a positive test result in people without disease
- Negative likelihood ratio (LR-): probability of a negative test result in people with disease divided by probability of a negative test result in people without disease
- Inter-observer reliability (kappa or κ): proportion of potential agreement beyond chance between different observers
- κ > 0.7 is good, 0.4 to 0.7 is medium, and < 0.4 is poor
- Screening test biases
- Lead time bias: screening tests allow a disease to be diagnosed earlier, making the person appear to live longer, but does not actually increase survival time
- Length time bias: screening tests may be more likely to catch people with milder forms of a disease because they live longer
- People with severe disease die faster and are less likely to be screened
- Volunteer bias: type of selection bias in which people who want (and thus more likely) to get screened are different from the target population studied
- Purpose of these studies
- Types of prevention
- Cost-effectiveness analysis (CEA): type of economics analysis that weighs the cost and outcomes of different actions
- Incremental cost-effectiveness ratio (ICER)
- Unit used is often dollar per quality adjusted life year ($/QALY)
- There are many assumptions about cost, patient preferences, and treatment efficacy in CEA so pay attention to sensitivity analyses
- Incremental cost-effectiveness ratio (ICER)
- Screening and diagnostic tests
Tips for reading the Results
- Analyzing tables
- Start with the table title: what is the purpose of this table?
- Examine the table headings: what are each of the rows and columns about?
- After familiarizing with the table organization, begin interpreting the statistics in the table
- Analyzing figures
- Start with the figure title: what is the purpose of this figure?
- Read the figure legend (text after the title): what are the authors trying to convey?
- Examine the figure axes (if applicable): what are the x and y axes about, and what are their units?
- After familiarizing with the figure organization, begin interpreting the statistics in the figure
- Look at Table 1 to determine generalizability of the findings
- What are the demographics of participants included in the study (for example, age, sex, body mass index, any medical conditions)?
- Is the study applicable to everyone or only a specific group of people?
- Are the results statistically and clinically significant (practical importance of an effect)?
- For cohort studies and clinical trials, pay attention to drop-outs and loss to follow-up
- Figure 1 is often a flow diagram of the study design
- How many patients stopped using the treatment (dropped out)?
- How many patients were lost to follow up?
- Why did people drop out or leave the study?
- High drop-out or loss to follow-up rates (>20%) should raise concerns about selection bias