Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

CS595

Machine Learning and Social Media

Lecture 3: Machine Learning Overview


Aron Culotta

Assistant Professor

Computer Science

Illinois Institute of Technology

What is machine learning?

What is machine learning?

"Study of methods for programming computers to learn."

-- Dietterich

What is machine learning?

Study of systems that "automatically learn programs from data"

-- Domingos

What is machine learning?

A problem-solving technique that solves future problem instances based on patterns found in past problem instances

What is machine learning?

Often relies on Big Data and statistics

Examples

spam

money

Notation

Problem types

Workflow

  1. Collect raw data: emails
  2. Manually categorize them: spam or not
  3. Vectorize: email -> word counts [features]
  4. Train / Fit: create $f(x)$
  5. Collect new raw data
  6. Predict: compute $f(x)$ for new $x$

Example: Spam Classification

Steps 1 & 2: Collect and categorize

Spam:

Free credit report!

Free money!

Not spam:

Are you free tonight?

How are you?

Step 3: Vectorize

'Free money!'

becomes

free: 1
money: 1
!: 1
?: 0
credit: 0
...

Representation: "Feature engineering is the key" -- Domingos

Step 4: Train/Fit

Which model to use?

Steps 5-6: Predict on new data

Free vacation!

Spam

How do you know if it works?

Simplest machine learning algorithm:

f = dict()

def train(X, Y):
    for x, y in zip(X, Y):
      f[x] = y

def predict(x):
    return f[x]

Second simplest machine learning algorithm:

f = dict()

def train(X, Y):
    for x, y in zip(X, Y):
      f[x] = y

def predict(x):
    x_closest = find_most_similar(x)
    return f[x_closest]

http://www.scholarpedia.org/article/K-nearest_neighbor

Generalization

How accurate will I be on a new, unobserved example?

How do you know if it works?

  1. Train on data ${\mathcal D_1}$
  2. Predict on data ${\mathcal D_2}$
  3. Compute accuracy on ${\mathcal D_2}$.
    • Why not ${\mathcal D_1}$?

How do you know if it works?

  1. Train on data ${\mathcal D_1}$
  2. Predict on data ${\mathcal D_2}$
  3. Compute accuracy on ${\mathcal D_2}$.
  4. Tweak algorithm / representation
  5. Repeat

How do you know if it works?

  1. Train on data ${\mathcal D_1}$
  2. Predict on data ${\mathcal D_2}$
  3. Compute accuracy on ${\mathcal D_2}$.
  4. Tweak algorithm / representation
  5. Repeat

How many times can I do this?

Measuring Generalization

Experimental Design

  1. Collect data
  2. Build model
  3. Compute cross-validation accuracy
  4. Tune model
  5. Repeat

Experimental Design

  1. Collect data
  2. Build model
  3. Compute cross-validation accuracy
  4. Tune model
  5. Repeat
  6. Report accuracy on new data

Discussion Questions

Discussion Questions

Discussion Questions

Discussion Questions

Discussion Questions

http://scott.fortmann-roe.com/docs/BiasVariance.html

Discussion Questions