Fundamentals of Data Mining

2. Data: A Closer Look

Hamid Fadishei, Assistant Professor

CE Department, University of Bojnord

fadishei@yahoo.com, http://www.fadishei.ir

Introduction

  • Knowledge about data is vital for successful data mining
  • It is specially useful in data preprocessing step
  • It is difficult for human to know about data by looking at raw data
  • We'll learn how data can be described and visualized to help us understand it

Statistical descriptors for data

Two common categories of descriptors

  • Given an attribute, where do most of its values fall? → Measures of central tendency
  • How are data spread out? → Measures of dispersion

Statistical descriptors for data

Measures of central tendency

  • Mean
  • Median
  • Mode

Measures of central tendency

Mean

$\bar{x} = \frac{\sum_{i=1}^{N} x_i}{N} = \frac{x_1+x_2+...+x_N}{N}$

Measures of central tendency

Example

agegendersalary
youngm58
oldf33
teenagerm12
youngf86
adultm110
youngm75
youngf71
adultf90
youngm86
adultf86
oldm45

$mean(salary) = \frac{58+33+12+86+110+75+71+90+86}{9} = 69$

Measures of central tendency

Problem with the mean: sensitivity to extreme (outlier) values

Instead, we can use...

  • Trimmed mean
    • remove a fraction of data from top and bottom (e.g. 2%)
  • Median
    • Middle value in the ordered list

Example: median(salary)

12, 33, 45, 58, 71, 75, 86, 86, 86, 90, 110

Measures of central tendency

Mode: the value that occurs most frequently

Example: mode(salary) = 86

Mode and median can be applied to ordinal data

mode(gender) = m

median(age) = young

Measures of central tendency

Midrange: the average of the largest and smallest value

$midrange(salary) = \frac{110+12}{2} = 61$

Measures of central tendency

Where the descriptors fall within the distribution of data

Statistical descriptors for data

Measures of dispersion

  • Range
  • Quantile
  • Five-number summary (Boxplot)

Measures of dispersion

Range: The difference between the largest and smallest value

$range(salary) = 110-12 = 98$

Measures of dispersion

Quartile: Points taken at regular intervals of a data distribution, dividing it into essentially equal size consecutive sets

  • 2-quartile is the data point dividing the lower and upper halves of the data distribution
  • 4-quartiles are the three data points that split the data distribution into four equal parts
  • 100-quartiles: AKA percentiles

Measures of dispersion

IQR (Inter-Quartile Range)

  • A simple measure of dispersion
  • Difference between third and first quartile
  • Example:

$IQR(salary) = Q3(salary) - Q1(salary) = 86-45 = 41$

A rule of thumb for identifying outliers:

Values falling at least 1.5 x IQR above the third quartile or below the first quartile

Measures of dispersion

Five-number summary

  • Five numbers together: minimum, Q1, median (Q2), Q3, and maximum
  • Idea: using just one value to describe the spread of data is not efficient, specially for skewed data
  • Boxplots are usually used to visualize this

Measures of dispersion

Variance and standard deviation

A low standard deviation means that the data observations tend to be close to the mean

Variance = $\sigma^2 = \displaystyle\sum_{i=1}^N{\frac{1}{N}(x_i-\bar{x})^2} = (\frac{1}{N}\displaystyle\sum_{i=1}^{N}x_i^2)-\bar{x}^2$

Standard deviation = $\sigma$

Data visualization

Histogram

  • Summarizes the distribution of a given attribute
  • for categorical data the term bar chart is used

Data visualization

Scatter plot

  • Plots the values of a pair of attributes against each other
  • Helps to see the relationships between attributes (Correlations)
  • Helps to see outliers and clusters

Data visualization

Scatter plots are effective means for observing correlations

Correlation: a measure of linear relationship between two attributes

$corr(X,Y) = \rho_{X,Y} = \frac{cov(X,Y)}{\sigma_X\sigma_Y} = \frac{ \sum_{i=1}^N{(x_i-\bar{x})(y_i-\bar{y})} } { \sqrt{ \sum_{i=1}^N{(x_i-\bar{x})^2} \sum_{i=1}^N{(y_i-\bar{y})^2} } }$

Correlation

Can be negative, positive, or zero

Mini-break #3

Data visualization

Scatter plots for datasets with more than 2-dimensions. Is it possible?

Example: Iris dataset

(Part of) Iris dataset

Data visualization

Scatter plot for 3 or 4 dimensions → Use a 3-d graph. Add a color depth for the 4th dimension

Data visualization

Scatter plot for n dimensions → Use $n(n-1)$ 2-d plots

Data visualization

  • Scatter plots lose effectiveness with increase in dimensionality
  • Alternatively, we can use parallel coordinate plots
  • Draw n parallel lines for n dimensions
  • Cross-connect the points of each record on the lines together

Data visualization

Sometimes, the parallel coordinates are scaled to equal ranges

The problem of parallel coordinate plots: losing readability for too many records

Chernoff faces

Idea: Human mind is able to recognize small differences in facial characteristics and to assimilate many facial characteristics at once

Each dimension is mapped to a facial characteristic (eye size, nose length, ...)