Nothing Special   »   [go: up one dir, main page]

Statistics and Data Management

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Statistics and Data Management

Statistics
- A branch of science since there are procedures done in stats that don’t
agree in math e.g. (1) we can’t generalize, (2) we can conclude using samples
only.
- Deals with collection, organization, analysis, and presentation of pertinent
data under statistical experiment.
History of Statistics
- Statisticum Collegium (Latin) meaning council of state.
- Statistics began in surveying lands, demography, resources, etc.
- Statistik was first introduced by German philosopher Gottfried Achenwall in
1749, he defined it as Science of state.
- In the 18th century, the term “statistics" was defined as the systematic
collection of demographic and economic data by states. But now this is
termed as demography.
- In the 19th century, collection intensified and the meaning of statistics
broadened with the collection, summary, and analysis of data.
- In the 19th century, Statistics was extended to many fields of a scientific
(research) or commercial nature.
- In gambling, gamblers find a systematic way to determine the probability
of winning and losing.
- Birth of mathematical statistics and probability theory
- Nowadays, more sciences are developed related to statistics, e.g., actuarial
science and data science.
Branches of Statistics
(1) Descriptive Statistics
- Concerned w/ description and summarization of data.
(2) Inferential Statistics
- Concerned w/ drawing conclusions from data.

In terms of terminologies:
Population
- Total collection of all the elements that we are interested in.
- Used in Descriptive
Sample
- Subset of the population that will be studied in detail.
- Used in Inferential
In terms of collection of data:
Census
- Process of collecting information from a population.
- Used in Descriptive
Sampling Design
- Process of collecting information from a sample. Also called ‘sampling’.
- Used in Inferential
-
Types of Sampling Design
Probability Sampling
- A sampling process where each unit in the population has a known
nonzero probability of being included in the sample (everyone has a
chance to be selected in the sample).
- Has a systematic way of collecting data
- Types of Probability Sampling
1. Pure Random Sampling
- Every item or element in the population is given an equal non zero
chance of being selected as a sample. This could be done either w/
or w/o replacement. (Lottery sampling)
- Pros: very easy and simple to use.
- Cons: samples chosen may be distributed over a wide geographic
area.
- When to use:
a. If the population is not widely spread geographically.
b. If the population is more or less homogeneous with respect to
the characteristics of the population.
2. Systematic Random Sampling
- Samples are selected by going down an arranged list of elements in
the population using a fixed interval, k; the k^th subsequent unit after
the last selected sample is also selected as a sample.
- Population/ Sample, e.g 50 students and we only need 10. 50/10=5. So
every 5th in the list will be selected.
- Pros
1. easy to administer in the field
2. Sample is spread evenly over the population
- Cons
1. May give poor precision when unsuspected periodicity is
present in the population.
- When to use?
a. If the ordering of the population is essentially random.
b. When stratification with numerous data is used.
3. Stratified Random Sampling
- Population is divided into strata or groupings.
- Samples from each stratum are drawn independently from the
samples from other strata.
- Samples from each stratum may be randomly drawn using simple
random sampling techniques.
- Use this formula to find the sample per strata:
n1 = nN1/N
- Pros:
1. Stratification of respondents is advantageous in terms of
precision of the estimates of the characteristics of the
population.
2. Sampling designs may vary by stratum to adjust for the
difference in the conditions across strata.
- Cons:
1. Values of stratification variable may not be easily available for
all units in the population especially if the characteristics of
interest if homogeneous,
2. It is possible that there are not representatives in one or two
stratum.
- When to use?
a. If precise estimates are desired for stratified parts of the
population.
b. If sampling problems differ in the various strata of the
population.
Other samplings:
Cluster Sampling and Multistage Sampling.

Non-probability Sampling
- Probabilities of selection are not specified for the individual units in the
population or no equal chance for individuals to be selected.
- Types of Non-Probability Sampling
1. Quota Sampling
- The main concern here is to come up w/ the desired number of
samples no matter how they are selected
- E.g. You need to check the product to different people. You
need to specify a quota for different skin types.
2. Judgement Sampling
- Expert selects a representative sample according to his own
subjective judgment.
- Here, who and what is included in the sample is based on the
standard of the researcher.
3. Purposive Sampling
- Researcher selects those who can best help explain or give
information based on his own judgment. Subjects are not
randomly selected.
4. Convenient Sampling
- Used by companies to monitor their performance.
In terms of data produce:
Parameter
- A numerical characteristic of the population.
- Population size: N, Population Mean: µ, Population standard deviation: σ
Statistic
- A numerical value that describes a sample.
- Sample size: n, Sample Mean: x̄ , Sample Standard Deviation: s
In terms of data and measurement:
Data
- Characteristics or attributes of persons or objects which assume different
values for different objects under consideration.
Measurement
- Process of determining the value or label of a particular individual or object
on which a variable is measured.
Types of Data:
Qualitative Data
- Data yields categorical responses
- E.g. occupation, gender, civil status, political affiliation, etc.
Quantitative Data
- Takes on numerical values representing an amount or quantity
- E.g. height, salary, number of children, etc.
Discrete and Continuous
Discrete Variable
- Finite or at most, countably infinite numbers of values; usually measured by
counting or enumeration
- E.g. students, professors, mayors, children, priests, doctors, etc.
- quantity
Continuous Variable
- Manu values corresponding to a line interval
- E.g. time, money, weight, amount, etc.
Levels of Measurement
1. Nominal Level
- Have no numerical value.
- Also called categorical scales or categorical data
- It classifies persons or objects into two or more categories
- Remark:
● Categories may be referred to as true or artificial. True
categories are categories into which persons or objects
naturally fall
● Example of Nominal Level
a. Data which involve true categories include: Sex, Number
of siblings, employment status, educational attainment,
type of school
b. Data which involve artificial categories:
- Indicate your current status. Encircle one
Single 1
Married 2
Separated 3
Widowed 4
2. Ordinal Level
- Not only classifies subjects but also ranks them in terms of the
degree to which they possess characteristics of interest. Ex. Highest
to lowest or most to least.
- Example: Ranking in car racing competition. Question about how you
feel.
3. Interval Level
- Nominal scale + Ordinal scale. In addition, it is based upon
predetermined equal intervals.
- Don’t have a true zero point
- E.g. IQ, Temperature, University Grading system.
4. Ratio Level
- Represents the highest, most precise, level of measurement.
- Has a meaningful true zero point.
- E.g. Height, weight, distance, etc.
Summation of Notation

- ∑

- Capital greek letter sigma is used as a symbol for summation.


𝑛
- ∑ 𝑥𝑖 = x1+x2+x3…xn-1+xn
𝑖=1
i:= index of summation
x:= the variable of summation
1:= the lower limit
n:= the upper limit
Properties of Summation
𝑛 4
1. ∑ c = nc Ex. ∑ 5 = 5x5x5x5
𝑖=1 𝑖=1
𝑛 𝑛
2. ∑ c•𝑥𝑖 = c• ∑ 𝑥𝑖
𝑖=1 𝑖=1
𝑛 𝑛 𝑛
3. ∑ ( 𝑥𝑖 ± 𝑦𝑖 ) = ∑ 𝑥𝑖 ± ∑ 𝑦𝑖
𝑖=1 𝑖=1 𝑖=1
𝑚 𝑛 𝑛 𝑚
4. ∑ ∑ 𝑥𝑖𝑗 = ∑ ∑ 𝑥𝑖𝑗
𝑗=1𝑖=1 𝑖=1𝑗=1
Remarks:
𝑛 𝑛 𝑛 𝑛 𝑛 𝑛
∑ 𝑥𝑖𝑦𝑖 ≠ ( ∑ 𝑥𝑖)( ∑ 𝑦𝑖) & ∑ 𝑥𝑖 / 𝑦𝑖≠ ∑ 𝑥𝑖 / ∑ 𝑦𝑖
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
Data Management:
Measures of Central Tendency
- Single value that describes the way in which a group of data
cluster around a central value.
- Types of Central Tendency
a. Mean or Arithmetic Mean
- Sum of the data values divided by the number of
data values.
- Weighted Mean
● Let x1,x2,...,xn be data values w/ weights
w1,w2,...,wn, respectively. Then, the weighted
mean of the set of data values is…
𝑛 𝑛
∑ 𝑥𝑖𝑤𝑖/ ∑ 𝑤𝑖
𝑖=1 𝑖=1

b. Median
- Middle observation when the data is sorted.
- When the size n is even, the median is the average of
two middle values
c. Mode
- Value of a variable that occurs most frequently in a
distribution
- 0= no mode, 1=unimodal, 2=bimodal, 2 or
more=multimodal

Population Mean Sample Mean


𝑁 𝑛
µ = ∑ 𝑥𝑖/ N x̄ = ∑ 𝑥𝑖/ n
𝑖=1 𝑖=1

Median Median
𝑋𝑛+1 𝑋𝑛+1
µ= 2
if n is odd 𝑥̄ = 2
if n is odd
𝑋𝑛 𝑋𝑛+1 𝑋𝑛 𝑋𝑛+1
µ= 2
+ 2
if n is even x̄ = 2
+ 2
if n is even
Measures of Position
a. Percentiles
- Divide the set of data to 100 equal parts.
- Steps:
1. Arrange in increasing order.
2. If np is not integer, determine the smallest integer
greater than np. The data value in that position is
the sample 100p percentile.
3. If np is an integer, then the average of the values in
positions np and np+1 is the sample 100p percentile,
b. Quartiles
- Divide the set of data to 4 equal parts.
c. Deciles
- Divide the set of data to 10 equal parts.

Measures of Dispersion
- Values that describe the variability of a given set of values.
a. Range
- Difference of highest and lowest value.
Range= HV -LV
b. Population Standard Deviation
2
- Population Variance, denoted by σ , is given by…
𝑁
2 2
σ = µ = ∑ (𝑥𝑖 − µ) / N
𝑖=1

- Population standard deviation, denoted by σ , us the


2
positive square root of σ , that is,
2
σ= σ
c. Sample Standard Deviation
2
- Sample variance, denoted as 𝑠 , is given by,
𝑁
2 2
𝑠 = ∑ (𝑥𝑖 − 𝑥̄ ) / n - 1
𝑖=1
- Sample standard deviation, denoted by s, is the
2
positive sqrt of 𝑠 that is,
2
s= 𝑠
Use the following example for practice:

You might also like