Fundamentals of Data Mining

1. Introduction

Hamid Fadishei, Assistant Professor

CE Department, University of Bojnord

fadishei@yahoo.com, http://www.fadishei.ir

The deluge of data

Today,

  1. Almost every automated system generates some sort of data
  2. Data storage is very cheap (If not free)

Let's see some examples...

Data example #1: WWW

A huge number of sites (~1 billion) and documents (multi-billion) which makes a diverse source of data in various forms

  • Text: news, articles, blogs
  • Tiny text: tweets, comments, search phrases
  • Multimedia: audio/video posted on Youtube, Facebook
  • Graph: intra-website hyperlinks, friendships
  • Sequence: user access logs

Data example #2: Financial Transactions

Record of payments and purchases performed by people on ATMs or websites

Note: separate sources of data can be "joined" together to make an even richer data set:

  • Users' financial transactions, plus...
  • Their purchase history, plus...
  • Their social profiles (friend list)

Data example #3: Sensors and IoT

We (people, cars, buildings, cities, etc) are surrounded by sensors!

  • Accelerometer, GPS, light, proximity and health sensors on smartphones and wearable devices
  • Weather and air quality monitoring stations
  • Presence and passage sensors
  • etc

IoT enables sensor devices to communicate with the cloud and store their readings

Data example #4: Scientific projects

  • NASA Earth observation satellites generate a terabyte of data every day
  • The Human Genome project is storing thousands of bytes for each of several billion genetic bases
World is data rich and information poor!

Data Mining

What is data mining?

Data mining is the study of collecting, cleaning, processing, analyzing, and gaining useful insights from data.

Textbook

Administrative details

You are expected to...

  • Attend classes regularly
  • Put away your cellphones and other electronic devices
  • Not cheat on your exams, projects and homework
  • Not randomly walk in and out during the class

Administrative details

Any kind of / any size of cheating will be strongly punished by failing all involved students

DELIVER YOUR OWN WORK ONLY!

Hit the threshold of 3/16 absences and you'll be banned from final exam and receive a zero

STUDY THE EDUCATIONAL REGULATIONS CAREFULLY!

The penalty of cheating and excessive absences will be applied with no prior warnings and no compromises

Administrative details

The course will be presented in slides

  • You'll get a mini-break of a few seconds at every 15 slides or so watching a clip
  • It is recommended to bring the hard copy of slides to the class. Print them multiple pages per sheet to save papers

The course calls for intensive effort and you'll face...

  • A challenging final exam
  • A challenging programming project
  • Lots of homework

We've already seen examples of data

Now let's see some examples applications of mining that data!

Application example #1: Store product placement

Given

  • A merchant with d products
  • A history of the previous customer shopping baskets (items bought together)

He'd like to know where to put the items in the shelves to increase the likelihood that the items bought together are placed in adjacent shelves

This is a problem of "frequent pattern mining"

Application example #1: Store product placement

A phtoto of an Amazon.com inventory

Application example #2: Loan risk assessment

Given

  • A table of previous loan lender customers of a bank
  • Their age, income, gender, mortgage, etc
  • Whether each of them was a good lender or not

The bank wants to make a decision for a new loan request by predicting whether the customer will repay on schedule or not

This is a problem of "classification"

(Mention the concept of similarity)

Structure of a dataset

Look at this sample dataset

NameAgeGenderEmployedIncomeHomeLoan AmountRepaying
Ramin S.32MY40OWN3.7ONSCHED
Mina F.27FY12RENT5.5LATE
Babak J.37MN10OWN7.0NEVER
Nima D.20MN8RENT1.0ONSCHED
  • This is a multidimensional dataset
  • It contains a set of 4 records (AKA data point, instance, example, entity, tuple, row, and feature vector)
  • Each record consists of 8 fields (AKA attribute, feature, dimension, column)

Another example of a dataset

Car evaluation dataset

buyingmaintdoorspersonslug_bootsafetyaccpt
vhighvhigh22smallmedunacc
lowhigh24smallmedacc
lowhigh44biglowunacc
lowlow44medmedgood
lowlow44medhighvgood
...

Mini-break #1

Types of attributes

Attributes

  • Categorical (values are symbols)
    • Nominal
    • Ordinal
  • Numeric (values are numbers)
    • Interval-scaled
    • Ratio-scaled

Types of attributes

Categorical attribute values represent some kind of category, code, state, etc

Nominal attributes do not have a meaningful order

Ordinal attributes imply an order

Examples:

  • pizza size (pico, mini, family)
  • marital status (single, married, divorced, widowed)
  • hair color (black, white, red, blond, ...)
  • gender (male, female)
  • grade (A+, A, A-, B+, ...)

Types of attributes

Numeric attribute are quantitative, measurable, represented by integer or real numbers

For interval-scaled values, we can quantify the difference

For ratio-scaled values, we can also quantify the ratio

Examples:

  • Temperature (interval-scaled)
  • Temperature in Kelvin (ratio-scaled)

Types of attributes

Attributes

  • Continuous: can only take particular values (there can be infinite number of those values)
  • Discrete: not restricted to separate defined values

Discrete attribute examples: age (18, 9, 56), hair color (black, gray, red, etc)

Continuous attribute example: temperature (27.05, 31.0, -2.4, ...)

Types of attributes

The significance of attribute types

  • Type of data is important to the data mining. Some operations make sense on some data types.
  • Comparison over ordinal attributes
  • Mean and variance calculation over numeric data
  • Entropy calculation over categorical data

Types of attributes

Sometimes, data can be transformed from one type to another. For example...

  • Numeric age can be discretized to a categorical attribute with possible values of child, young, adult, old
  • Numeric RGB color codes can be used instead of categorical color names
NameAge
Ramin S.9
Mina F.27
Babak J.56
Nima D.31
NameAge
Ramin S.child
Mina F.adult
Babak J.old
Nima D.adult

Types of attributes

Binary type

Binary data can be considered as a special case of categorized data (true/false, male/female, ...)

Binary data can also be considered as a special case of numeric data (0/1)

Dependency-oriented data

Remember the previous examples (car and loan datasets)

  • The data instances in those examples are not related to each other
  • Each data record can be treated separately

Sometimes, this is not the case, and data instances are inter-related in some way

  • An example is time-series data, where each record has a time attribute
  • The topic of dependency-oriented data is out of the scope of this course

Dependency-oriented data

Example: Air quality of Tehran

DateCOO3PM10
11/22/2014423452
11/23/2014442246
11/24/2014462940
11/25/2014313344
...

Beyond basic data types

Sometimes an attribute type is beyond mentioned basic types. Examples include:

  • Set data
  • Text data

Complex types are usually represented and used as a mix of the basic data types

Beyond basic data types

A set data feature can be represented as multiple binary features

Example: the items in the shopping basket

agegenderbasket
19femalemilk, tissue, cookie
71malemilk, shampoo
38malechips, cigarette, lemonade

Can be represented like:

agegenderlemonademilkcookiechipsshampootissuecigarette
19femalenyynnyn
71male nynnynn
38male ynnynny

Beyond basic data types

Text data can be represented in different ways depending on the requirements

  • Small texts can usually be converted to categorical types (example: movie genres)
  • Longer text segments can be represented by a bag of words (a set)
  • Can also be represented by the frequency of each word in the text

Beyond basic data types

Such diversity in the possible data types makes mining them usually...

  1. A challenging task
  2. An application-specific task

Thus you often hear application-specific terms for data mining

  • Text mining
  • Web mining
  • Time-series mining
  • Mining social networks
  • And so on...

The data mining process

Data collection

  • The first step of the process
  • May require hardware/software tools labor (Examples)
  • Critically important (How affects the data mining process)
  • After collection, data is stored somewhere (in a database or warehouse or simply a file)

The data mining process

Data preprocessing

  • The goal of this step is to convert data to a format suitable for the data mining algorithm
  • Collected data is usually not in such format (We need to deal with free-form texts, data items mixed together, erroneous and missing data, ...)

The data mining process

Data preprocessing step itself may involve up to three sub-steps:

  • Feature extraction: Extracting features which are relevant to the data mining process.
  • Data cleaning: Estimation of the missing values, correction of the erroneous ones
  • Data integration: Joining the datasets together (Example)

The output of this step is usually a multidimensional dataset

The data mining process

Data preprocessing example: Web access logs

  • A retailer has a product website
  • Data source 1: The web server generates a text log each time a page is accessed
  • Data source 2: User profile of the customers/li>
  • Different format of data sources (integration is required)
  • No user name in the log. Just IP (extraction is required)
  • Some noisy log records may not be useable (cleaning is required)

The data mining process

Analytical preprocessing

  • At this step, the prepared data is analyzed in order to reach the final goal
  • Each data mining problem is unique (Thus a challenging task)
  • Requires skill of breaking the problem into the "known data mining super-problems" or common data mining tasks

Common data mining tasks

  • Classification
  • Clustering
  • Numerical prediction
  • Association rules mining
  • Outlier detection

Classification

A task that occurs frequently in everyday life

  • A hospital classifies patients into those who are at high, medium or low risk of acquiring a certain illness
  • An email system classifies incoming emails as spam/nonspam

Classification is the problem of identifying to which of a set of categories (classes) a new observation belongs

Classification targets a specific column of the dataset usually referred to as "label"

Classification

Example: predicting acceptability of a car based on its features

  • We build a classification model using the car dataset (How? we'll see later)
  • The model may be in the form of some rules or a decision tree
  • Each new car can be evaluated based on its feature
buyingmaintdoorspersonslug_bootsafetyaccpt
vhighvhigh22smallmedunacc
lowhigh24smallmedacc
lowhigh44biglowunacc
lowlow44medmedgood
lowlow44medhighvgood
...

Is this car acceptable: (buying=low, maint=high, doors=4, persons=4, lug_boot=small, safety=med)?

Mini-break #2

Numerical prediction

  • Classification, the target of prediction is a categorical attribute
  • If we want to predict a numeric attribute, then the problem is a numerical prediction AKA regression
  • Example: predicting the net profit of a movie based on features such as genre, critic score, budget, etc

Clustering

Clustering aims at finding groups of items that are similar

Association rules mining

Aims at finding relationships that exists amongst the values of variables

IF cheese AND milk THEN bread (probability = 0.7)

Outlier detection

Aims at finding instances that do not comply with the general behavior or model of the data

Example: finding fraudulent credit card transactions