Welcome to week one of this specialization. You will learn about logistic regression. Concretely, you will be implementing logistic regression for sentiment analysis on tweets. Given a tweet, you will decide if it has a positive sentiment or a negative one. Specifically you will:
Before submitting your assignment to the AutoGrader, please make sure you are not doing the following:
print
statement(s) in the assignment.If you do any of the following, you will get something like, Grader not found
(or similarly unexpected) error upon submitting your assignment. Before asking for help/debugging the errors in your assignment, check for these first. If this is the case, and you don't remember the changes you have made, you can get a fresh copy of the assignment by following these instructions.
Lets get started!
We will be using a data set of tweets. Hopefully you will get more than 99% accuracy.
Run the cell below to load in the packages.
# run this cell to import nltk
import nltk
from os import getcwd
import w1_unittest
nltk.download('twitter_samples')
nltk.download('stopwords')
Download the data needed for this assignment. Check out the documentation for the twitter_samples dataset.
twitter_samples: if you're running this notebook on your local computer, you will need to download it using:
nltk.download('twitter_samples')
stopwords: if you're running this notebook on your local computer, you will need to download it using:
nltk.download('stopwords')
filePath = f"{getcwd()}/../tmp2/"
nltk.data.path.append(filePath)
import numpy as np
import pandas as pd
from nltk.corpus import twitter_samples
from utils import process_tweet, build_freqs
twitter_samples
contains subsets of five thousand positive_tweets, five thousand negative_tweets, and the full set of 10,000 tweets. # select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
# split the data into two pieces, one for training and one for testing (validation set)
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]
train_x = train_pos + train_neg
test_x = test_pos + test_neg
# combine positive and negative labels
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)
# Print the shape train and test sets
print("train_y.shape = " + str(train_y.shape))
print("test_y.shape = " + str(test_y.shape))
for y,tweet in zip(ys, tweets):
for word in process_tweet(tweet):
pair = (word, y)
if pair in freqs:
freqs[pair] += 1
else:
freqs[pair] = 1
# create frequency dictionary
freqs = build_freqs(train_x, train_y)
# check the output
print("type(freqs) = " + str(type(freqs)))
print("len(freqs) = " + str(len(freqs.keys())))
type(freqs) = <class 'dict'>
len(freqs) = 11436
The given function 'process_tweet' tokenizes the tweet into individual words, removes stop words and applies stemming.
# test the function below
print('This is an example of a positive tweet: \n', train_x[0])
print('\nThis is an example of the processed version of the tweet: \n', process_tweet(train_x[0]))
This is an example of a positive tweet:
#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
This is an example of the processes version:
['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']
Implement the sigmoid function.
# UNQ_C1 GRADED FUNCTION: sigmoid
def sigmoid(z):
'''
Input:
z: is the input (can be a scalar or an array)
Output:
h: the sigmoid of z
'''
### START CODE HERE ###
# calculate the sigmoid of z
h = 1/(1+np.exp(-z))
### END CODE HERE ###
return h
# Testing your function
if (sigmoid(0) == 0.5):
print('SUCCESS!')
else:
print('Oops!')
if (sigmoid(4.92) == 0.9927537604041685):
print('CORRECT!')
else:
print('Oops again!')
# Test your function
w1_unittest.test_sigmoid(sigmoid)
Logistic regression takes a regular linear regression, and applies a sigmoid to the output of the linear regression.
Regression: $$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$ Note that the $\theta$ values are "weights". If you took the deep learning specialization, we referred to the weights with the 'w' vector. In this course, we're using a different variable $\theta$ to refer to the weights.
Logistic regression $$ h(z) = \frac{1}{1+\exp^{-z}}$$ $$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$ We will refer to 'z' as the 'logits'.
The cost function used for logistic regression is the average of the log loss across all training examples:
$$J(\theta) = -\frac{1}{m} \sum_{i=1}^m y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)}))\tag{5} $$The loss function for a single training example is $$ Loss = -1 \times \left( y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)})) \right)$$
# verify that when the model predicts close to 1, but the actual label is 0, the loss is a large positive value
-1 * (1 - 0) * np.log(1 - 0.9999) # loss is about 9.2
# verify that when the model predicts close to 0 but the actual label is 1, the loss is a large positive value
-1 * np.log(0.0001) # loss is about 9.2
To update your weight vector $\theta$, you will apply gradient descent to iteratively improve your model's predictions.
The gradient of the cost function $J$ with respect to one of the weights $\theta_j$ is:
'j' is the index of the weight $\theta_j$, so $x^{(i)}_j$ is the feature associated with weight $\theta_j$
To update the weight $\theta_j$, we adjust it by subtracting a fraction of the gradient determined by $\alpha$: $$\theta_j = \theta_j - \alpha \times \nabla_{\theta_j}J(\theta) $$
Implement gradient descent function.
# UNQ_C2 GRADED FUNCTION: gradientDescent
def gradientDescent(x, y, theta, alpha, num_iters):
'''
Input:
x: matrix of features which is (m,n+1)
y: corresponding labels of the input matrix x, dimensions (m,1)
theta: weight vector of dimension (n+1,1)
alpha: learning rate
num_iters: number of iterations you want to train your model for
Output:
J: the final cost
theta: your final weight vector
Hint: you might want to print the cost to make sure that it is going down.
'''
### START CODE HERE ###
# get 'm', the number of rows in matrix x
m = x.shape[0]
for i in range(0, num_iters):
# get z, the dot product of x and theta
z = np.dot(x,theta)
# get the sigmoid of z
h = sigmoid(z)
# calculate the cost function
J = -1./m*(np.dot(y.transpose(),np.log(h))+np.dot(1.-y.transpose(),np.log(1-h)))
# update the weights theta
theta = theta - (alpha/m) * np.dot(x.transpose(),(h-y))
### END CODE HERE ###
J = float(J)
return J, theta
# Check the function
# Construct a synthetic test case using numpy PRNG functions
np.random.seed(1)
# X input is 10 x 3 with ones for the bias terms
tmp_X = np.append(np.ones((10, 1)), np.random.rand(10, 2) * 2000, axis=1)
# Y Labels are 10 x 1
tmp_Y = (np.random.rand(10, 1) > 0.35).astype(float)
# Apply gradient descent
tmp_J, tmp_theta = gradientDescent(tmp_X, tmp_Y, np.zeros((3, 1)), 1e-8, 700)
print(f"The cost after training is {tmp_J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(tmp_theta)]}")
The cost after training is 0.67094970.
The resulting vector of weights is [4.1e-07, 0.00035658, 7.309e-05]
# Test your function
w1_unittest.test_gradientDescent(gradientDescent)
Implement the extract_features function.
process_tweet
function and save the list of tweet words.
# UNQ_C3 GRADED FUNCTION: extract_features
def extract_features(tweet, freqs, process_tweet=process_tweet):
'''
Input:
tweet: a list of words for one tweet
freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
Output:
x: a feature vector of dimension (1,3)
'''
# process_tweet tokenizes, stems, and removes stopwords
word_l = process_tweet(tweet)
# 3 elements in the form of a 1 x 3 vector
x = np.zeros((1, 3))
#bias term is set to 1
x[0,0] = 1
### START CODE HERE ###
# loop through each word in the list of words
for word in word_l:
# increment the word count for the positive label 1
x[0,1] += freqs.get((word, 1.),0)
# increment the word count for the negative label 0
x[0,2] += freqs.get((word, 0.),0)
### END CODE HERE ###
assert(x.shape == (1, 3))
return x
# Check your function
# test 1
# test on training data
tmp1 = extract_features(train_x[0], freqs)
print(tmp1)
[[1.00e+00 3.02e+03 6.10e+01]]
# test 2:
# check for when the words are not in the freqs dictionary
tmp2 = extract_features('blorb bleeeeb bloooob', freqs)
print(tmp2)
[[1. 0. 0.]]
# Test your function
w1_unittest.test_extract_features(extract_features, freqs)
To train the model:
gradientDescent
, which you've implemented above.This section is given to you. Please read it for understanding and run the cell.
# collect the features 'x' and stack them into a matrix 'X'
X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
X[i, :]= extract_features(train_x[i], freqs)
# training labels corresponding to X
Y = train_y
# Apply gradient descent
J, theta = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 1500)
print(f"The cost after training is {J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(theta)]}")
Expected Output:
The cost after training is 0.22522315.
The resulting vector of weights is [6e-08, 0.00053818, -0.0005583]
It is time for you to test your logistic regression function on some new input that your model has not seen before.
Implement predict_tweet
.
Predict whether a tweet is positive or negative.
# UNQ_C4 GRADED FUNCTION: predict_tweet
def predict_tweet(tweet, freqs, theta):
'''
Input:
tweet: a string
freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
theta: (3,1) vector of weights
Output:
y_pred: the probability of a tweet being positive or negative
'''
### START CODE HERE ###
# extract the features of the tweet and store it into x
x = extract_features(tweet,freqs)
# make the prediction using x and theta
y_pred = sigmoid(np.dot(x,theta))
### END CODE HERE ###
return y_pred
# Run this cell to test your function
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
print( '%s -> %f' % (tweet, predict_tweet(tweet, freqs, theta)))
Expected Output:
I am happy -> 0.519275
I am bad -> 0.494347
this movie should have been great. -> 0.515979
great -> 0.516065
great great -> 0.532096
great great great -> 0.548062
great great great great -> 0.563929
# Feel free to check the sentiment of your own tweet below
my_tweet = 'It is a beautiful day'
predict_tweet(my_tweet, freqs, theta)
# Test your function
w1_unittest.test_predict_tweet(predict_tweet, freqs, theta)
After training your model using the training set above, check how your model might perform on real, unseen data, by testing it against the test set.
Implement test_logistic_regression
.
# UNQ_C5 GRADED FUNCTION: test_logistic_regression
def test_logistic_regression(test_x, test_y, freqs, theta, predict_tweet=predict_tweet):
"""
Input:
test_x: a list of tweets
test_y: (m, 1) vector with the corresponding labels for the list of tweets
freqs: a dictionary with the frequency of each pair (or tuple)
theta: weight vector of dimension (3, 1)
Output:
accuracy: (# of tweets classified correctly) / (total # of tweets)
"""
### START CODE HERE ###
# the list for storing predictions
y_hat = []
for tweet in test_x:
# get the label prediction for the tweet
y_pred = predict_tweet(tweet, freqs, theta)
if y_pred > 0.5:
# append 1.0 to the list
y_hat.append(1.0)
else:
# append 0 to the list
y_hat.append(0.0)
# With the above implementation, y_hat is a list, but test_y is (m,1) array
# convert both to one-dimensional arrays in order to compare them using the '==' operator
accuracy = (y_hat==np.squeeze(test_y)).sum()/len(test_x)
### END CODE HERE ###
return accuracy
tmp_accuracy = test_logistic_regression(test_x, test_y, freqs, theta)
print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}")
0.9950
Pretty good!
# Test your function
w1_unittest.unittest_test_logistic_regression(test_logistic_regression, freqs, theta)
In this part you will see some tweets that your model misclassified. Why do you think the misclassifications happened? Specifically what kind of tweets does your model misclassify?
# Some error analysis done for you
print('Label Predicted Tweet')
for x,y in zip(test_x,test_y):
y_hat = predict_tweet(x, freqs, theta)
if np.abs(y - (y_hat > 0.5)) > 0:
print('THE TWEET IS:', x)
print('THE PROCESSED TWEET IS:', process_tweet(x))
print('%d\t%0.8f\t%s' % (y, y_hat, ' '.join(process_tweet(x)).encode('ascii', 'ignore')))
Later in this specialization, we will see how we can use deeplearning to improve the prediction performance.
# Feel free to change the tweet below
my_tweet = 'This is a ridiculously bright movie. The plot was terrible and I was sad until the ending!'
print(process_tweet(my_tweet))
y_hat = predict_tweet(my_tweet, freqs, theta)
print(y_hat)
if y_hat > 0.5:
print('Positive sentiment')
else:
print('Negative sentiment')