Statistics 101 by Borman

Ref: David Borman (2018). Statistics 101. Simon & Schuster.

___________________________________________________________________________

Summary

  • Statistics: The measuring and interpreting of data (especially large groups of numbers) to prove or disprove a point. Statistics measures the frequency, distribution, randomness, and cause/effect relationship of data points in studies. Statistics is used to determine measures of center, spread, and relative frequency, and to create models used for predicting outcomes.

  • Probability: A measure of the chance of an event happening.

___________________________________________________________________________

Fundamentals

  • With statistics, the goal is usually 95% accuracy before a researcher can assume a guess is correct.

  • Law of Large Numbers: Statistical tests are best done with large numbers, ideally groups of >100.

  • Sampling

    • Convenience Sampling: Sampling at your own convenience.

    • Cluster Sampling: Sampling at randomly chosen locations.

    • Stratified Sampling: Separating sample sets into categories before systematic sampling; i.e. men & women.

    • Systematic Sampling: Testing one out of every specified number of samples, regardless of the order they are in.

  • Charts, Graphs, Plots

    • Dot Plots: X-Y axis data point plotting; the most basic form of statistics graphs.

    • Bar-Charts: Similar to dot plots, however the data is represented in solid, vertical bars.

      • Histogram: A bar chart in which each bar touches another to form a solid mass of bars.

    • Frequency Polygons: A line graph that visually represents the distribution of data by plotting the midpoints of class intervals on the x-axis against their corresponding frequencies on the y-axis, and connecting the points as straight lines.

    • Time Series: A graph depicting event frequency (Y-Axis) per time (X-axis). 

    • Box & Whisker: A plot depicting the five-number summary- min, first quartile, median, third quartile, max.

  • Mean, Medium, Mode

    • Mean: The average; sum all the amounts in the data set and divide by the number of data points in that set.

    • Median: The point that lies in the middle of the data.

    • Mode: The point that appears the most frequently in the data set.

  • Deviation

    • Standard Deviation (s): The average difference between the data values and the mean of the data.

      • Variance (s2): The sum of the square of the value of the data points and the mean of the data points, divided by the number of data points.

      • t-distribution: The standard deviation of small samples.

  • Probability: The measure of whether and how frequently something will happen; the ratio of the number of times something happens to the number of possible outcomes.

    • Subjective Probability: Basing answers to the probability of an event based on passed personal experience.

    • Empirical Probability: Basing answers to the probability of an event on statistical analysis.

    • Rules of Probability

      • The chance of an event happening must be between 0 (0%) and 1 (100%).

      • If the chance that an event happening is x, then the chance that it doesn’t happen is 1-x.

      • If the chance that an event happening is 1, then it will happen 100% of the time.

      • If the chance that an event happening is 0, then it will never happen.

      • In a group, each event probability must add up to 1 (100%).

      • If there are two separate events that can occur together, this is a compound-event; when drawing from a deck of cars, the ace of spades contains both an ace and a spade.

      • If there is an event that happens in either/or two separate groups, this is a union of events.

      • If an event happens in both A & B, it is called an intersection of events.

      • If an event is either A or B, but not both, it is called a mutually exclusive event.

      • An event that is outside all that is contained in set A, is called a complement of event.

    • Conditional Probability: Measuring the chance that something will happen if something else has already happened.

    • The probability of multiple die rolls is the probability per roll multiplied by the number of rolls; rolling 6 three times in a row = 1/6 x 1/6 x 1/6 = 1/216.

  • Distribution

    • Data collecting and subsequent plotting should form a bell curve.

    • Normal (Gaussian, Laplace-Gauss) Distribution: Graphed as the bell curve; the mean, median, and mode are all at the same point and the bell curve is symmetric about this point. The standard normal distribution has a mean equal to 0 and a standard deviation equal to 1. The total area under the normal distribution is 1.

      • Whale Tail: A bell curve in which the outliers dominate. 

      • Degrees of Freedom: Used to determine how close to a normal bell-curve you can get.

    • Pafnuty Chebyshev (1821-1894) showed that for any distribution, at least 1 – (1/k2) of the data will lie within k standard deviations of the mean

    • Central Limit Theorem: As the size of each sample gets large enough, the sampling distribution of the mean can be approximated by the normal distribution no matter what the distribution of the individual data might be.

    • The central limit theorem tells us that the distribution of the sample statistic will be normally distributed, and the empirical rule tells us that 99.5% of the data will lie within three standard deviations of the mean.

  • Associations in Data

    • P-value: A measure of how small the significance level needs to be to make you reject the null hypothesis; the probability of an observed result assuming the null hypothesis is true.

      • P-value > .1: p-value points to nothing against the null hypothesis.

      • .05 < P-value < .1: p-value has some evidence to reject the null hypothesis.

      • .01 < P-value > .05: p-value has good evidence to reject the null hypothesis.

      • .001 < P-value > .01: p-value has excellent evidence to reject the null hypothesis.

      • P-value <.001: p-value provides highest evidence to reject the null hypothesis.

    • Pearson Correlation Coefficient (r): Measures the strength of the relationship between two variables. The range of r is between -1 and 1. A value of 0 indicates no association, while a negative value of r indicates that as one variable increases, the other decreases, and a positive value of r indicates a proportional correlation. 

    • Regression: Measures the predictable relationship between two strongly proportional variables.

      • Multiple Regression: A method of finding the relationships between how events or observations affect one another.

    • Coefficient of Determination (r2): Tells you what percentage of the output variable are impacted by the input variable.

Standard Deviation

___________________________________________________________________________

Research

  • Clearly state the question to be studied. 

  • Come up with a possible answer (null hypothesis). 

  • Determine a design for collecting data.

  • Estimate the population size.

  • Discover the ideal sample size. 

  • Collect the data in a random manner. 

  • Use statistics to match the results to your null hypothesis.

  • Reject or do not reject your null hypothesis.

___________________________________________________________________________

Misc Quotes

  • “In 2017 there were 434 players in the NBA.”

  • “There are three kinds of lies: lies, damned lies, and statistics”-British PM Benjamin Disraeli.

  • “We usually think of positive things as predictable and mundane, and we think of things as random when they turn negative or against our best hopes. The facts are, most individual events in life have a large element of randomness attached to them.”

___________________________________________________________________________

Terminology

  • Big Data: Data sets with millions or billions of data points. 

  • Binomial: Events that occur in which there are only two possible outcomes; flipping a coin.

  • Bivariate: Two variables.

  • Black Swan Events (Nassim Taleb): Major events that were seemingly unpredictable and occur out of the blue.

  • Hypothesis: The question that you are trying to prove or disprove.

    • Hypothesis Testing: Find the question you want to study, state the assumed answer (null hypothesis), state what the answer to the question is if the null hypothesis is proved wrong (alternate hypothesis), find the statistical test to use, determine the rules that would make the null hypothesis rejected or called wrong.

  • Inference: An educated, statistically supported “guess” about a group of data.

  • Inferential Statistics: Making educated guesses, testing theories, modeling observations’ relationships, and predicting outcomes with data analysis.

  • Interquartile Range (IQR): The middle 50% of the range (the truer range that excludes outliers).

    • IQR Rule of Thumb: A data point is an outlier if it is >1.5x from the IQR min or max.

  • Primary Information: Data you collect yourself.

  • Six Sigma: A data driven process that that strives to eliminate defects in production and processes. The goal is that all manufacturing outcomes all lie within three standard deviations of the specified mean.

  • Quantitative Data: Numerical based data. 

  • Qualitative Data: Words-based data. 

___________________________________________________________________________

Chronology

  • 1654: The correspondence between Blaise Pascal (1623-1662) and Pierre de Fermat (1601-1665) concerning a problem posed by Antoine Gombaud lays the foundation for probability theory (Statistics by Borman).

___________________________________________________________________________