# 2. Data: A Closer Look

## Introduction

• Knowledge about data is vital for successful data mining
• It is specially useful in data preprocessing step
• It is difficult for human to know about data by looking at raw data
• We'll learn how data can be described and visualized to help us understand it

## Statistical descriptors for data

Two common categories of descriptors

• Given an attribute, where do most of its values fall? → Measures of central tendency
• How are data spread out? → Measures of dispersion

## Statistical descriptors for data

Measures of central tendency

• Mean
• Median
• Mode

## Measures of central tendency

Mean

$\bar{x} = \frac{\sum_{i=1}^{N} x_i}{N} = \frac{x_1+x_2+...+x_N}{N}$

## Measures of central tendency

Example

agegendersalary
youngm58
oldf33
teenagerm12
youngf86
youngm75
youngf71
youngm86
oldm45

$mean(salary) = \frac{58+33+12+86+110+75+71+90+86}{9} = 69$

## Measures of central tendency

Problem with the mean: sensitivity to extreme (outlier) values

• Trimmed mean
• remove a fraction of data from top and bottom (e.g. 2%)
• Median
• Middle value in the ordered list

Example: median(salary)

12, 33, 45, 58, 71, 75, 86, 86, 86, 90, 110

## Measures of central tendency

Mode: the value that occurs most frequently

Example: mode(salary) = 86

Mode and median can be applied to ordinal data

mode(gender) = m

median(age) = young

## Measures of central tendency

Midrange: the average of the largest and smallest value

$midrange(salary) = \frac{110+12}{2} = 61$

## Measures of central tendency

Where the descriptors fall within the distribution of data

## Statistical descriptors for data

Measures of dispersion

• Range
• Quantile
• Five-number summary (Boxplot)

## Measures of dispersion

Range: The difference between the largest and smallest value

$range(salary) = 110-12 = 98$

## Measures of dispersion

Quartile: Points taken at regular intervals of a data distribution, dividing it into essentially equal size consecutive sets

• 2-quartile is the data point dividing the lower and upper halves of the data distribution
• 4-quartiles are the three data points that split the data distribution into four equal parts
• 100-quartiles: AKA percentiles

## Measures of dispersion

IQR (Inter-Quartile Range)

• A simple measure of dispersion
• Difference between third and first quartile
• Example:

$IQR(salary) = Q3(salary) - Q1(salary) = 86-45 = 41$

A rule of thumb for identifying outliers:

Values falling at least 1.5 x IQR above the third quartile or below the first quartile

## Measures of dispersion

Five-number summary

• Five numbers together: minimum, Q1, median (Q2), Q3, and maximum
• Idea: using just one value to describe the spread of data is not efficient, specially for skewed data
• Boxplots are usually used to visualize this

## Measures of dispersion

Variance and standard deviation

A low standard deviation means that the data observations tend to be close to the mean

Variance = $\sigma^2 = \displaystyle\sum_{i=1}^N{\frac{1}{N}(x_i-\bar{x})^2} = (\frac{1}{N}\displaystyle\sum_{i=1}^{N}x_i^2)-\bar{x}^2$

Standard deviation = $\sigma$

## Data visualization

Histogram

• Summarizes the distribution of a given attribute
• for categorical data the term bar chart is used

## Data visualization

Scatter plot

• Plots the values of a pair of attributes against each other
• Helps to see the relationships between attributes (Correlations)
• Helps to see outliers and clusters

## Data visualization

Scatter plots are effective means for observing correlations

Correlation: a measure of linear relationship between two attributes

$corr(X,Y) = \rho_{X,Y} = \frac{cov(X,Y)}{\sigma_X\sigma_Y} = \frac{ \sum_{i=1}^N{(x_i-\bar{x})(y_i-\bar{y})} } { \sqrt{ \sum_{i=1}^N{(x_i-\bar{x})^2} \sum_{i=1}^N{(y_i-\bar{y})^2} } }$

## Correlation

Can be negative, positive, or zero

## Data visualization

Scatter plots for datasets with more than 2-dimensions. Is it possible?

Example: Iris dataset

## Data visualization

Scatter plot for 3 or 4 dimensions → Use a 3-d graph. Add a color depth for the 4th dimension

## Data visualization

Scatter plot for n dimensions → Use $n(n-1)$ 2-d plots

## Data visualization

• Scatter plots lose effectiveness with increase in dimensionality
• Alternatively, we can use parallel coordinate plots
• Draw n parallel lines for n dimensions
• Cross-connect the points of each record on the lines together

## Data visualization

Sometimes, the parallel coordinates are scaled to equal ranges

The problem of parallel coordinate plots: losing readability for too many records

## Chernoff faces

Idea: Human mind is able to recognize small differences in facial characteristics and to assimilate many facial characteristics at once

Each dimension is mapped to a facial characteristic (eye size, nose length, ...)