3. Results - Understand Medical Research

Overview

The Results section reports the statistical findings of the study
Most studies use some common statistical tests (see “General statistics” below)
Specific types of research study also have unique statistical tests (see “Statistics in specific types of study” below)

General statistics

Types of variables
- Continuous: a variable that can take on an infinite number of values, which can be further divided into:
  - Interval: a scale with a numerical value (for example, Celsius or Fahrenheit)
  - Ratio: a type of interval variable in which 0 indicates that there is nothing; can divide into meaningful ratios (for example, height or weight)
- Categorical/discrete: a variable that takes on a limited number of values, which can be further divided into:
  - Nominal: no intrinsic order (for example, dogs, cats, hamsters)
  - Ordinal: has an order, but unlike continuous variables, the increments may not be equal (for example, 1 to 5 on a pain scale, with 1 being least painful and 5 being most painful)
  - Binary/dichotomous: two categories that are mutually exclusive (for example, yes/no)
Continuous
- The following are different ways that studies can report continuous variables
- Measures of central tendency
  - Mean: average of a dataset; sum of the values divided by the number of values
    - Central limit theorem: a sample with many observations tend to converge into a normal distribution (values cluster around the mean in a “bell shape” curve)
    - Outliers: values that differ significantly from other values
      - Outliers can skew the mean (no longer a normal distribution)
  - Median: middle value of a dataset, aka 50^th percentile
    - More resistant to outliers and thus more often used when the sample size is small
  - Mode: most common value of a dataset; more resistant to outliers
- Measures of variability/dispersion
  - Standard deviation (SD): measure of variation around the mean in a population
    - Larger SD suggests larger variation
  - Standard error (SE): SD of a sample of the population (not the whole population)
    - The SE depends on the sample size; larger sample size leads to smaller SE
  - Interquartile range (IQR): measure of variation often used with medians
    - Equal to the difference between the 75^th and 25^th percentiles
  - Range: difference between the largest and smallest value in a dataset

The mean, median, and mode are all 3 in this data set, but they are calculated differently.

In a normal distribution, one standard deviation (SD or σ) contains 68% of the data, two SD contain 95% of the data, and three SD contain over 99% of the data. Image courtesy of M.W. Toews.

Categorical
- Count and percent
  - Unlike continuous variables, categorical variables are usually reported in studies as the number of participants (count) and percentage of participants (percent)
Hypothesis testing
- Statistical tests are used to test a hypothesis
- Statistical tests compare the means (or other measures of central tendency) of different groups and determine whether their values are significantly different
  - Parametric tests compare means between groups (and therefore depend on a normal distribution) while non-parametric tests rely less on the mean (and therefore depend less on a normal distribution)
  - There are many different statistical tests for different purposes (see Table below for some common examples)
  - Various statistical softwares (for example, Stata and SPSS) are used to conduct statistical testing on data
- Statistical significance can be expressed in two ways: p-value and 95% confidence interval
  - These two ways use different explanations but produce consistent results
  - p-value: the probability of getting a result at least as extreme as the results observed, given that the null hypothesis is correct
    - It answers the question: what is the probability that the groups are not different?
    - One- vs two-tailed tests
      - One-tailed tests assume directionality (for example, is drug X better than drug Y for treating the disease?)
      - Two-tailed tests do not (for example, are drug X and drug Y different for treating the disease?)
      - Most studies use two-tailed tests
    - The level of statistical significance is conventionally set at alpha (α) = 0.05, which implies that there is a 5% chance that the groups are not different (see “Type I error” below)
    - Therefore, p < 0.05 is statistically significant
      - There is less than a 5% chance that the groups are not different (so there is most likely a difference)
  - 95% confidence interval (CI): if a study is repeated an infinite number of times, 95% of the estimated 95% confidence intervals will contain the true population measure
    - A 95% CI range that does not contain the reference value is significant
    - For example, if there is no difference between two groups (assuming that the null hypothesis is correct), their difference should be 0, which is the reference value; so, a CI that includes 0 is not statistically significant (would be significant if it did not include 0)
  - If statistically significant, then the null hypothesis is rejected
    - Type I (alpha or α) error: incorrectly rejecting the null hypothesis; concluding that a difference is real when it is not real (false positive)
      - This is the same alpha (α) used in the level of statistical significance above
    - Type II (beta or β) error: incorrectly not rejecting the null hypothesis; concluding that a difference is not real when it is real (false negative)
    - Statistical power (1 – β): probability of correctly rejecting the null hypothesis when the alternate hypothesis is true (true positive)

Statistics in different types of study

Correlation
- Models relationship between two continuous variables
- Correlation coefficient (r): measures the strength of association between the two variables
  - Positive r: positive correlation; when one variable increases, the other increases too
  - Negative r: negative correlation; when one variable increases, the other decreases
  - r ranges from -1 to 1: values close to 0 have weak correlation, while those closer to -1 or 1 have strong correlation
    - In other words, larger magnitude suggests stronger correlation
- Coefficient of determination (r²): amount of variation in the dependent variable that can be predicted from the independent variable; can be calculated by squaring the value of r
  - Higher r² indicates better model prediction
Effect modification (interaction) vs confounding
- Both involve three or more variables
- Both are relevant in regression analysis, in which multiple independent variables can be entered into the model
- Effect modification (interaction): the effect of an independent variable on a dependent variable depends on a different independent variable
  - Effect modification is useful in research, because two or more causes may interact to determine an effect
- Confounding: a third variable that explains the association between an independent and dependent variable
  - Confounding is undesirable in research, because the third variable reveals a spurious association between the independent and dependent variable
Ways to control confounding in observational studies (see “Methods” section for definitions of different types of studies)
- A limitation of all the ways to control confounding below is that they all rely on known confounding variables; unknown confounding variables cannot be controlled and may still cause problems (called residual confounding)
  - This is a key shortcoming of observational studies and a reason why RCTs are considered gold standard
- Stratification: categorize participants based on the confounding variable
  - More limited than regression because only 3 variables can be compared at a time (independent, dependent, and confounding variable)
- Matching
  - Propensity score matching (PSM): an alternative to regression
    - PSM matches the treatment and comparison groups based on confounding variables
    - After matching, both groups will have similar characteristics and can be directly compared
- Regression
  - Multivariable linear regression: models the relationship between multiple independent variables (binary or continuous) and one continuous dependent variable
    - Goals
      - To determine which independent variables significantly predict the dependent variable
      - Can determine statistical significance using p-values or 95% confidence intervals
      - To assess which variables have a larger effect (see “Effect size” below)
      - Can control for confounding by including other variables in the model (for example, age and body mass index) to produce “adjusted” results as opposed to the unadjusted or “crude” results
    - Effect size
      - Like correlation, larger magnitude indicates stronger association
      - Like correlation, pay attention to the sign
        Positive sign = positive correlation
        Negative sign = negative correlation
      - Unstandardized coefficient (B): often just called “coefficient,” this number is the unit change in the dependent variable given one unit increase in the independent variable
        Pay attention to the units of the variables (for example, kg or cm)
        If the independent variable is binary (yes/no), then B is the unit change in the dependent variable when the independent variable is “yes” (relative to “no”)
      - Standardized coefficient (Beta or β): standardized the B based on standard deviations
        β is unitless; can directly compare independent variables (unlike B)
        For example, if the β of one independent variable is greater in magnitude than the β of another independent variable, then the former’s effect size is greater (assuming that the dependent variable is constant/does not change)
        Note: the standardized coefficient β is unrelated to the type II error β; to make things more confusing, the coefficients B and β are sometimes used interchangeably in articles, so pay attention to how the authors define the coefficients (unstandardized vs standardized)
    - Mixed regression model: accounts for repeated measures of the same participants over time (not cross-sectional)
  - Multivariable logistic regression: models the relationship between multiple independent variables (binary or continuous) and one binary dependent variable
    - Similar goals as linear regression, except the dependent variable is binary, not continuous
    - Effect size
      - Odds ratio (OR): odds of exposure to a risk factor among participants with an outcome, divided by odds of that same exposure among those without that same outcome
      - OR is approximately equal to relative risk (see below) when an outcome is rare; however, OR is used in cross-sectional and case-control studies while RR is used in cohort studies and clinical trials
      - In simple terms, OR and RR are basically the number of times participants with an exposure are at higher risk of having an outcome compared with those without the exposure

The correlation coefficient (r) of various data sets. When the points line up perfectly in a line, then the r = 1 or -1. Note that r is not the slope of the line. Image courtesy of Kiatdd.

Example of effect modification (interaction).

Example of simple linear regression. Image courtesy of Sewagu.

Similarities between simple linear regression and the more generalized multivariable linear regression.

Example of a standard logistic function. Image courtesy of Qef.

Relationship between the logistic function and odds ratio.

Cross-sectional study
- Prevalence
  - Number of existing cases of a disease (at a certain time), divided by the population observed (at that time)
  - Often expressed as a percentage to show how many people have a disease at one point in time (see “Incidence” below for differences)
  - Factors that change prevalence
    - Number of new cases
    - Number of deaths
    - Number of recoveries

Case-control study
- Odds ratio (OR): same concept as the one used in logistic regression; OR is odds of exposure to a risk factor among participants with an outcome, divided by odds of that same exposure among those without that same outcome
- Remember that cross-sectional studies do not have a temporal relationship while in case-control studies, the exposure precedes the outcome

Example of odds ratio calculation in a fictional case-control study.

Prospective studies: cohort study and randomized controlled trial
- Mortality rate: total number of deaths divided by total population over a specified period of time
- Case-fatality rate: total number of deaths due to a disease divided by number of people diagnosed with that disease over a specified period of time
- Incidence (risk): number of new cases of a disease during a time period, divided by the population at risk during that time period
  - Different from prevalence, which is at a single point in time (a snapshot or cross-section)
  - Incidence = Prevalence x Duration
    - Incidence changes when prevalence or duration changes
  - Unit for incidence is often “person years”
    - Participants sometimes drop out of prospective studies
    - Instead of discarding their data from the study, the number of years when they were in the study is included in the final analysis
- Risk ratio or relative risk (RR): incidence of an outcome among participants exposed to a risk factor, divided by the incidence of that same outcome among participants not exposed to that same risk factor
  - RR is the relative difference in risk
  - RR = 1 indicates no difference in risk between treatment and control groups (null hypothesis)
  - RR < 1 indicates decreased risk in the treatment group compared with the control group
  - RR > 1 indicates increased risk in the treatment group compared with the control group
- Risk difference or attributable risk (AR): incidence of an outcome among participants exposed to a risk factor, minus the incidence of that same outcome among participants not exposed to that same risk factor
  - AR is the absolute difference in risk
  - AR = 0 indicates no difference in risk between treatment and control groups (null hypothesis)
  - AR < 0 indicates decreased risk in the treatment group compared with the control group
  - AR > 0 indicates increased risk in the treatment group compared with the control group
- Number needed to treat (NNT) or harm (NNH): number of patients who must be treated over a certain period of time to achieve a defined effect
  - Inverse of the AR
- Kaplan-Meier (KM) survival analysis: measures the fraction of patients who reach a predefined outcome after starting treatment over time
  - Log-rank test: compares the outcome between the treatment and control groups for statistical significance
- Cox proportional hazards model: type of multivariable regression model that accounts for time to an outcome; can control for confounding
  - Hazard: probability of an outcome occurring during an instant in time, conditional upon the participant having made it to that instant without having already reached the outcome
  - Hazard ratio (HR): hazard of an outcome among participants exposed to a risk factor, divided by the hazard of that same outcome among participants not exposed to that same risk factor
    - HR removes people who have already reached the outcome at every instant in time
    - Similar to RR in interpretation
      - HR = 1 indicates no difference in risk between treatment and control groups (null hypothesis)
      - HR < 1 indicates decreased risk in the treatment group compared with the control group
      - HR > 1 indicates increased risk in the treatment group compared with the control group

Example of relative risk, attributable risk, and number needed to harm calculations in a fictional cohort study.

Example of relative risk, attributable risk, and number needed to treat calculations in a fictional clinical trial.

Example of a Kaplan-Meier analysis. Figure courtesy of https://doi.org/10.1371/journal.pone.0232043

Meta-analysis
- Quality of a meta-analysis depends on the quality of the studies it included
  - Check which type of studies are included: in general, case-control < cohort < clinical trials in quality
- Forest plot: graph that combines the results of multiple published studies to give an overall effect (Z-score)
  - The overall effect for clinical trials is the mean difference (MD), for cohort studies is the relative risk (RR), and for case-control studies is the odds ratio (OR)
  - Fixed effects model: assumes one true effect; accounts only for within study variability
  - Random effects model: assumes many different effects; accounts for within and between study variability
- Measures of heterogeneity or variation between studies
  - Low heterogeneity is preferred; high heterogeneity suggests that the included studies vary a lot (for example, participants, study design, results)
  - I-squared (I²): larger I² suggests larger heterogeneity (for example, I²=0% is no heterogeneity, 50% is moderate heterogeneity, and 100% is extreme heterogeneity)
  - Similarly, larger tau-squared (τ²) and chi-squared (χ²) (p < 0.05) indicate larger heterogeneity
- Meta-regression: like linear regression on individual participants, meta-regression is linear regression on included research studies
  - Performed when the heterogeneity is large
  - Goal is to determine which factors/variables explain the heterogeneity
- Funnel plot: graph that checks for publication bias, which occurs when studies with unfavorable results are not published and thus not included in the meta-analysis
  - Low bias can be visually confirmed by left-right symmetry, and a narrow top with wide bottom (triangle shape)
  - Asymmetry should raise alarm for bias
- Individual participant data (IPD) meta-analysis: considered the gold standard of meta-analysis
  - Individual participant data from each study is analyzed rather than the aggregate data of all the participants (for example, mean) from each study

Example of a forest plot in a meta-analysis. Figure courtesy of https://doi.org/10.1371/journal.pone.0121187

Example of a funnel plot from the same meta-analysis. Note the asymmetry on the bottom right that suggests publication bias. Figure courtesy of https://doi.org/10.1371/journal.pone.0121187

Other common analyses
- Screening and diagnostic tests
  - Types of prevention
    - Primary: prevents disease from occurring (for example, vaccination, exercise)
    - Secondary: has the disease but no symptoms yet (for example, Pap screening test)
    - Tertiary: prevent symptomatic disease from getting worse (for example, taking diabetes medications)
  - Research studies on screening and diagnostic tests
    - Purpose of these studies
      - Compare the efficacy of a new screening/diagnostic test with an existing, gold-standard test for diagnosing a disease
      - Below are some common terms used in screening tests
    - Sensitivity and specificity
      - Sensitivity: proportion of people with disease who test positive (true positive rate)
      - Specificity: proportion of people without disease who test negative (true negative rate)
      - They are intrinsic properties of the tests and do not changed based on prevalence of disease
      - Receiver operating characteristic (ROC) curve: graph showing the relationship between the sensitivity (true positive rate) and 1 minus specificity (false positive rate)
        Best ROC curve has the highest true positive rate and lowest false positive rate (upper left corner on graph)
    - Predictive value
      - Positive predictive value (PPV): proportion of positive tests that are correct
      - Negative predictive value (NPV): proportion of negative tests that are correct
      - High prevalence of disease leads to high PPV and low NPV, while low prevalence of disease leads to low PPV and high NPV
        In other words, if a disease is very common, then it is more likely by chance that a test will correctly test positive and less likely correctly test negative
    - Accuracy: sum of the number of true positives and true negatives divided by the total population screened
    - Likelihood ratio
      - Positive likelihood ratio (LR+): probability of a positive test result in people with disease divided by probability of a positive test result in people without disease
      - Negative likelihood ratio (LR-): probability of a negative test result in people with disease divided by probability of a negative test result in people without disease
    - Inter-observer reliability (kappa or κ): proportion of potential agreement beyond chance between different observers
      - κ > 0.7 is good, 0.4 to 0.7 is medium, and < 0.4 is poor
    - Screening test biases
      - Lead time bias: screening tests allow a disease to be diagnosed earlier, making the person appear to live longer, but does not actually increase survival time
      - Length time bias: screening tests may be more likely to catch people with milder forms of a disease because they live longer
        People with severe disease die faster and are less likely to be screened
      - Volunteer bias: type of selection bias in which people who want (and thus more likely) to get screened are different from the target population studied
- Cost-effectiveness analysis (CEA): type of economics analysis that weighs the cost and outcomes of different actions
  - Incremental cost-effectiveness ratio (ICER)
    - Unit used is often dollar per quality adjusted life year ($/QALY)
  - There are many assumptions about cost, patient preferences, and treatment efficacy in CEA so pay attention to sensitivity analyses

Relationship between sensitivity and specificity. Image courtesy of Blue64701.

Example of a receiver operating characteristic (ROC) curve. Image courtesy of CMG Lee.

Tips for reading the Results

Analyzing tables
- Start with the table title: what is the purpose of this table?
- Examine the table headings: what are each of the rows and columns about?
- After familiarizing with the table organization, begin interpreting the statistics in the table
Analyzing figures
- Start with the figure title: what is the purpose of this figure?
- Read the figure legend (text after the title): what are the authors trying to convey?
- Examine the figure axes (if applicable): what are the x and y axes about, and what are their units?
- After familiarizing with the figure organization, begin interpreting the statistics in the figure
Look at Table 1 to determine generalizability of the findings
- What are the demographics of participants included in the study (for example, age, sex, body mass index, any medical conditions)?
- Is the study applicable to everyone or only a specific group of people?
Are the results statistically and clinically significant (practical importance of an effect)?
For cohort studies and clinical trials, pay attention to drop-outs and loss to follow-up
- Figure 1 is often a flow diagram of the study design
- How many patients stopped using the treatment (dropped out)?
- How many patients were lost to follow up?
- Why did people drop out or leave the study?
- High drop-out or loss to follow-up rates (>20%) should raise concerns about selection bias

Previous page: Methods

Next page: Discussion and Conclusion