Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.
For the best experience please use the latest Chrome, Safari or Firefox browser.
CS595
Machine Learning and Social Media
Lecture 3: Machine Learning Overview
Aron Culotta
Assistant Professor
Computer Science
Illinois Institute of Technology
- Dietterich: "Machine Learning"
- Domingos: "A few useful things to know about machine learning"
What is machine learning?
What is machine learning?
"Study of methods for programming computers to learn."
-- Dietterich
What is machine learning?
Study of systems that "automatically learn programs from data"
-- Domingos
What is machine learning?
A problem-solving technique that solves future problem instances based on
patterns found in past problem instances
What is machine learning?
Often relies on Big Data and statistics
Examples
Notation
- $\vec{x} \in \mathcal{X}$ instance, example, input
- $y \in \mathcal{Y}$ target, class, label, output
- e.g., $y=1$: spam ; $y=0$: not spam
- $f: \mathcal{X} \mapsto \mathcal{Y}$ hypothesis, learner, model, classifier
- e.g., if $x$ contain the word free, $y$ is $1$.
Problem types
- Classification
- $\vec{x}$: image of a person ; $y$: gender
- Regression
- $\vec{x}$: image of a person ; $y$: age
- Clustering
- $\vec{x}$: images of people ; $y$: cluster id of people that look similar
- Structured classification
- $\vec{x}$: image of a person ; $\vec{y}$: location of their eyes and ears
- $X$: sequence of images of people ; $Y$: subsequences containing people running
Workflow
- Collect raw data: emails
- Manually categorize them: spam or not
- Vectorize: email -> word counts [features]
- Train / Fit: create $f(x)$
- Collect new raw data
- Predict: compute $f(x)$ for new $x$
Example: Spam Classification
Steps 1 & 2: Collect and categorize
Spam:
Free credit report!
Free money!
Not spam:
Are you free tonight?
How are you?
Step 3: Vectorize
'Free money!'
becomes
free: 1
money: 1
!: 1
?: 0
credit: 0
...
Representation: "Feature engineering is the key" -- Domingos
Step 4: Train/Fit
Which model to use?
- Naive Bayes
- Logistic Regression
- Decision Tree
- K-Nearest Neighbors
- Support Vector Machines
- ... many many more
Steps 5-6: Predict on new data
Free vacation!
Spam
How do you know if it works?
Simplest machine learning algorithm:
f = dict()
def train(X, Y):
for x, y in zip(X, Y):
f[x] = y
def predict(x):
return f[x]
Second simplest machine learning algorithm:
f = dict()
def train(X, Y):
for x, y in zip(X, Y):
f[x] = y
def predict(x):
x_closest = find_most_similar(x)
return f[x_closest]
Generalization
How accurate will I be on a new, unobserved example?
How do you know if it works?
- Train on data ${\mathcal D_1}$
- Predict on data ${\mathcal D_2}$
- Compute accuracy on ${\mathcal D_2}$.
- Why not ${\mathcal D_1}$?
How do you know if it works?
- Train on data ${\mathcal D_1}$
- Predict on data ${\mathcal D_2}$
- Compute accuracy on ${\mathcal D_2}$.
- Tweak algorithm / representation
- Repeat
How do you know if it works?
- Train on data ${\mathcal D_1}$
- Predict on data ${\mathcal D_2}$
- Compute accuracy on ${\mathcal D_2}$.
- Tweak algorithm / representation
- Repeat
How many times can I do this?
Measuring Generalization
- Cross-validation
- train on 90%, test on 10%, repeat 10 x's
- each example appears only once in test set
Experimental Design
- Collect data
- Build model
- Compute cross-validation accuracy
- Tune model
- Repeat
Experimental Design
- Collect data
- Build model
- Compute cross-validation accuracy
- Tune model
- Repeat
- Report accuracy on new data
Discussion Questions
- What is the difference between machine learning and statistics?
Discussion Questions
- What is the impact of making an i.i.d. assumption among examples?
- independent and identically distributed
Discussion Questions
- What is overfitting? How do you know it is happening? How do you fix?
Discussion Questions
- What is the hypothesis space and why does its size matter?
Discussion Questions
- What is curse of dimensionality?