©2016 Hamid Fadishei, fadishei@ce.ub.ac.ir

- Knowledge about data is vital for successful data mining
- It is specially useful in data preprocessing step
- It is difficult for human to know about data by looking at raw data
- We'll learn how data can be described and visualized to help us understand it

Two common categories of descriptors

- Given an attribute, where do most of its values fall? →
**Measures of central tendency** - How are data spread out? →
**Measures of dispersion**

Measures of central tendency

- Mean
- Median
- Mode

Mean

$\bar{x} = \frac{\sum_{i=1}^{N} x_i}{N} = \frac{x_1+x_2+...+x_N}{N}$

Example

age | gender | salary |
---|---|---|

young | m | 58 |

old | f | 33 |

teenager | m | 12 |

young | f | 86 |

adult | m | 110 |

young | m | 75 |

young | f | 71 |

adult | f | 90 |

young | m | 86 |

adult | f | 86 |

old | m | 45 |

$mean(salary) = \frac{58+33+12+86+110+75+71+90+86}{9} = 69$

Problem with the mean: sensitivity to extreme (outlier) values

Instead, we can use...

- Trimmed mean
- remove a fraction of data from top and bottom (e.g. 2%)
- Median
- Middle value in the ordered list

Example: median(salary)

12, 33, 45, 58, 71, 75, 86, 86, 86, 90, 110

Mode: the value that occurs most frequently

Example: mode(salary) = 86

Mode and median can be applied to ordinal data

mode(gender) = m

median(age) = young

Midrange: the average of the largest and smallest value

$midrange(salary) = \frac{110+12}{2} = 61$

Where the descriptors fall within the distribution of data

Measures of dispersion

- Range
- Quantile
- Five-number summary (Boxplot)

Range: The difference between the largest and smallest value

$range(salary) = 110-12 = 98$

Quartile: Points taken at regular intervals of a data distribution, dividing it into essentially equal size consecutive sets

- 2-quartile is the data point dividing the lower and upper halves of the data distribution
- 4-quartiles are the three data points that split the data distribution into four equal parts
- 100-quartiles: AKA percentiles

IQR (Inter-Quartile Range)

- A simple measure of dispersion
- Difference between third and first quartile
- Example:

$IQR(salary) = Q3(salary) - Q1(salary) = 86-45 = 41$

A rule of thumb for identifying outliers:

Values falling at least 1.5 x IQR above the third quartile or below the first quartile

Five-number summary

- Five numbers together: minimum, Q1, median (Q2), Q3, and maximum
- Idea: using just one value to describe the spread of data is not efficient, specially for skewed data
- Boxplots are usually used to visualize this

Variance and standard deviation

A low standard deviation means that the data observations tend to be close to the mean

Variance = $\sigma^2 = \displaystyle\sum_{i=1}^N{\frac{1}{N}(x_i-\bar{x})^2} = (\frac{1}{N}\displaystyle\sum_{i=1}^{N}x_i^2)-\bar{x}^2$

Standard deviation = $\sigma$

Histogram

- Summarizes the distribution of a given attribute
- for categorical data the term bar chart is used

Scatter plot

- Plots the values of a pair of attributes against each other
- Helps to see the relationships between attributes (Correlations)
- Helps to see outliers and clusters

Scatter plots are effective means for observing correlations

Correlation: a measure of linear relationship between two attributes

$corr(X,Y) = \rho_{X,Y} = \frac{cov(X,Y)}{\sigma_X\sigma_Y} = \frac{ \sum_{i=1}^N{(x_i-\bar{x})(y_i-\bar{y})} } { \sqrt{ \sum_{i=1}^N{(x_i-\bar{x})^2} \sum_{i=1}^N{(y_i-\bar{y})^2} } }$

Can be negative, positive, or zero

Scatter plots for datasets with more than 2-dimensions. Is it possible?

Example: Iris dataset

Scatter plot for 3 or 4 dimensions → Use a 3-d graph. Add a color depth for the 4th dimension

Scatter plot for n dimensions → Use $n(n-1)$ 2-d plots

- Scatter plots lose effectiveness with increase in dimensionality
- Alternatively, we can use
**parallel coordinate plots** - Draw n parallel lines for n dimensions
- Cross-connect the points of each record on the lines together

Sometimes, the parallel coordinates are scaled to equal ranges

The problem of parallel coordinate plots: losing readability for too many records

Idea: Human mind is able to recognize small differences in facial characteristics and to assimilate many facial characteristics at once

Each dimension is mapped to a facial characteristic (eye size, nose length, ...)