CSE258 Homework 1

Name: Yi Rong
PID: REDACTED
Email: yrong@ucsd.edu

Preparation

  • Activate the virualenv
  • Install necessary dependencies for this hw
  • Import packages
  • Global functions
# Setup virtual env
!source venv/bin/activate
!pip3 install scipy sklearn matplotlib
Requirement already satisfied: scipy in ./venv/lib/python3.8/site-packages (1.7.1)
Requirement already satisfied: sklearn in ./venv/lib/python3.8/site-packages (0.0)
Requirement already satisfied: matplotlib in ./venv/lib/python3.8/site-packages (3.4.3)
Requirement already satisfied: numpy<1.23.0,>=1.16.5 in ./venv/lib/python3.8/site-packages (from scipy) (1.21.2)
Requirement already satisfied: scikit-learn in ./venv/lib/python3.8/site-packages (from sklearn) (1.0)
Requirement already satisfied: pillow>=6.2.0 in ./venv/lib/python3.8/site-packages (from matplotlib) (8.3.2)
Requirement already satisfied: cycler>=0.10 in ./venv/lib/python3.8/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in ./venv/lib/python3.8/site-packages (from matplotlib) (1.3.2)
Requirement already satisfied: pyparsing>=2.2.1 in ./venv/lib/python3.8/site-packages (from matplotlib) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7 in ./venv/lib/python3.8/site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in ./venv/lib/python3.8/site-packages (from scikit-learn->sklearn) (3.0.0)
Requirement already satisfied: joblib>=0.11 in ./venv/lib/python3.8/site-packages (from scikit-learn->sklearn) (1.1.0)
Requirement already satisfied: six in ./venv/lib/python3.8/site-packages (from cycler>=0.10->matplotlib) (1.16.0)
import numpy as np
import dateutil.parser
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import linear_model
import matplotlib.pyplot as plt
 
def parse_data(fname):
    for line in open(fname):
        yield eval(line)

Regression (week 1)

First, using the book review data, let’s see whether ratings can be predicted as a function of review length, or by using temporal features associated with a review.

Regression helper functions

# Fit Features(X) and Labels(Y) with regression, report theta and MSE
def linear_regression(X, Y):
    theta, residuals, rank, s = np.linalg.lstsq(X, Y, rcond=None) # rcond to silence the warning
    mse = np.square(np.matrix(Y).T - np.matmul(np.matrix(X), np.matrix(theta).T))[:,0].mean()
    return theta, mse 

Q2

book_reviews = list(parse_data("fantasy_10000.json"))
book_reviews[0]
{'user_id': '8842281e1d1347389f2ab93d60773d4d',
 'book_id': '18245960',
 'review_id': 'dfdbb7b0eb5a7e4c26d59a937e2e5feb',
 'rating': 5,
 'review_text': 'This is a special book. It started slow for about the first third, then in the middle third it started to get interesting, then the last third blew my mind. This is what I love about good science fiction - it pushes your thinking about where things can go. \n It is a 2015 Hugo winner, and translated from its original Chinese, which made it interesting in just a different way from most things I\'ve read. For instance the intermixing of Chinese revolutionary history - how they kept accusing people of being "reactionaries", etc. \n It is a book about science, and aliens. The science described in the book is impressive - its a book grounded in physics and pretty accurate as far as I could tell. Though when it got to folding protons into 8 dimensions I think he was just making stuff up - interesting to think about though. \n But what would happen if our SETI stations received a message - if we found someone was out there - and the person monitoring and answering the signal on our side was disillusioned? That part of the book was a bit dark - I would like to think human reaction to discovering alien civilization that is hostile would be more like Enders Game where we would band together. \n I did like how the book unveiled the Trisolaran culture through the game. It was a smart way to build empathy with them and also understand what they\'ve gone through across so many centuries. And who know a 3 body problem was an unsolvable math problem? But I still don\'t get who made the game - maybe that will come in the next book. \n I loved this quote: \n "In the long history of scientific progress, how many protons have been smashed apart in accelerators by physicists? How many neutrons and electrons? Probably no fewer than a hundred million. Every collision was probably the end of the civilizations and intelligences in a microcosmos. In fact, even in nature, the destruction of universes must be happening at every second--for example, through the decay of neutrons. Also, a high-energy cosmic ray entering the atmosphere may destroy thousands of such miniature universes...."',
 'date_added': 'Sun Jul 30 07:44:10 -0700 2017',
 'date_updated': 'Wed Aug 30 00:00:26 -0700 2017',
 'read_at': 'Sat Aug 26 12:05:52 -0700 2017',
 'started_at': 'Tue Aug 15 13:23:18 -0700 2017',
 'n_votes': 28,
 'n_comments': 1}
data = [d for d in book_reviews if 'review_text' in d]
X = [[1, len(d['review_text'])] for d in data]
Y = [d['rating'] for d in data]
list(zip(X, Y))[:10] # test print 10 pairs
[([1, 2086], 5),
 ([1, 1521], 5),
 ([1, 1519], 5),
 ([1, 1791], 4),
 ([1, 1762], 3),
 ([1, 470], 5),
 ([1, 823], 5),
 ([1, 532], 5),
 ([1, 616], 4),
 ([1, 548], 5)]
theta, mse = linear_regression(X, Y)
print("Theta0 = %.6f, Theta1 = %.8f" % (theta[0], theta[1]))
print("MSE = %.6f" % mse)
Theta0 = 3.685681, Theta1 = 0.00006874
MSE = 1.552209

Q3

date_added = [dateutil.parser.parse(d['date_added']) for d in data]
weekdays = [t.weekday() for t in date_added]
years = [t.year for t in date_added]
set(weekdays), set(years)
({0, 1, 2, 3, 4, 5, 6},
 {2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017})
w_enc = OneHotEncoder(drop="first", dtype=int)
ohe_weekdays = w_enc.fit_transform([[w] for w in weekdays]).toarray()
y_enc = OneHotEncoder(drop="first", dtype=int)
ohe_years = y_enc.fit_transform([[y] for y in years]).toarray()
 
X = []
for i in range(len(data)):
    X.append([1, len(data[i]['review_text']), *ohe_weekdays[i], *ohe_years[i]])
print("Feature vectors for the first two examples:")
X[:2]
Feature vectors for the first two examples:





[[1, 2086, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
 [1, 1521, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]]

Q4

X_direct = []
for i in range(len(data)):
    X_direct.append([1, len(data[i]['review_text']), weekdays[i], years[i]])
_, mse = linear_regression(X_direct, Y)
print("Using weekday and year directly as features:")
print("MSE = %.6f" % mse)
Using weekday and year directly as features:
MSE = 1.536774
X_ohe = X
_, mse = linear_regression(X_ohe, Y)
print("Using One-Hot Encoding from Q3:")
print("MSE = %.6f" % mse)
Using One-Hot Encoding from Q3:
MSE = 1.512358

Q5

print("Using weekday and year directly as features:")
X_train, X_test, Y_train, Y_test = train_test_split(X_direct, Y, test_size=0.5, random_state=258)
_, mse_train = linear_regression(X_train, Y_train)
_, mse_test = linear_regression(X_test, Y_test)
print("MSE on train set = %.6f" % mse_train)
print("MSE on test set = %.6f" % mse_test)
Using weekday and year directly as features:
MSE on train set = 1.526952
MSE on test set = 1.545680
print("One-Hot Encoding from Q3:")
X_train, X_test, Y_train, Y_test = train_test_split(X_ohe, Y, test_size=0.5, random_state=258)
_, mse_train = linear_regression(X_train, Y_train)
_, mse_test = linear_regression(X_test, Y_test)
print("MSE on train set = %.6f" % mse_train)
print("MSE on test set = %.6f" % mse_test)
One-Hot Encoding from Q3:
MSE on train set = 1.499324
MSE on test set = 1.518546

Q6

For :

Taking the derivative of MAE w.r.t. :

Setting to be 0 to find best value for , we get the condition that these two sets and should have the same size/cardinality in order for to be zero. i.e.

Therefore, by definition should be the median of label y, in which case the MAE value is minimized.

Classification (week 2)

In this question, using the beer review data, we’ll try to predict ratings (positive or negative) based on char- acteristics of beer reviews. Load the 50,000 beer review dataset, and construct a label vector by considering whether a review score is four or above

Q7

beer_reviews = list(parse_data("beer_50000.json"))
beer_reviews[0]
{'review/appearance': 2.5,
 'beer/style': 'Hefeweizen',
 'review/palate': 1.5,
 'review/taste': 1.5,
 'beer/name': 'Sausa Weizen',
 'review/timeUnix': 1234817823,
 'beer/ABV': 5.0,
 'beer/beerId': '47986',
 'beer/brewerId': '10325',
 'review/timeStruct': {'isdst': 0,
  'mday': 16,
  'hour': 20,
  'min': 57,
  'sec': 3,
  'mon': 2,
  'year': 2009,
  'yday': 47,
  'wday': 0},
 'review/overall': 1.5,
 'review/text': 'A lot of foam. But a lot.\tIn the smell some banana, and then lactic and tart. Not a good start.\tQuite dark orange in color, with a lively carbonation (now visible, under the foam).\tAgain tending to lactic sourness.\tSame for the taste. With some yeast and banana.',
 'user/profileName': 'stcules',
 'review/aroma': 2.0}
data = [d for d in beer_reviews if 'review/overall' in d and 'review/text' in d]
X = [[1, len(d['review/text'])] for d in data]
Y = [1 if d['review/overall'] >= 4 else 0 for d in data]
list(zip(X, Y))[:10] # test print 10 pairs
[([1, 262], 0),
 ([1, 338], 0),
 ([1, 396], 0),
 ([1, 401], 0),
 ([1, 1145], 1),
 ([1, 728], 0),
 ([1, 471], 0),
 ([1, 853], 0),
 ([1, 472], 1),
 ([1, 1035], 1)]
lr = linear_model.LogisticRegression(C=1.0, class_weight='balanced')
lr.fit(X, Y)
y_pred = lr.predict(X)
TP = sum(np.logical_and(y_pred, Y))
FP = sum(np.logical_and(y_pred, np.logical_not(Y)))
TN = sum(np.logical_and(np.logical_not(y_pred), np.logical_not(Y)))
FN = sum(np.logical_and(np.logical_not(y_pred), Y))
 
TPR = TP / (TP + FN)
TNR = TN / (TN + FP)
FPR = 1 - TNR
FNR = 1 - TPR
 
BER = 1 - 0.5 * (TP / (TP + FN) + TN / (TN + FP))
 
print("True Positive: %d\nTrue Negative: %d\nFalse Positive: %d\nFalse Negative: %d\n" % (TP, FP, TN, FN))
print("TPR: %.6f\nTNR: %.6f\nFPR: %.6f\nFNR: %.6f\n" % (TPR, FPR, TNR, FNR))
print("Balanced Error Rate: %.6f" % BER)
True Positive: 14201
True Negative: 5885
False Positive: 10503
False Negative: 19411

TPR: 0.422498
TNR: 0.359104
FPR: 0.640896
FNR: 0.577502

Balanced Error Rate: 0.468303

Q8

proba = lr.predict_proba(X)
proba = list(zip(proba, Y))
proba.sort(key=lambda x : x[0][1], reverse = True)
proba[:10]
[(array([0.19459931, 0.80540069]), 1),
 (array([0.19643684, 0.80356316]), 1),
 (array([0.20622631, 0.79377369]), 1),
 (array([0.21202257, 0.78797743]), 1),
 (array([0.21655241, 0.78344759]), 1),
 (array([0.22127384, 0.77872616]), 1),
 (array([0.22724752, 0.77275248]), 0),
 (array([0.23156561, 0.76843439]), 1),
 (array([0.23498476, 0.76501524]), 1),
 (array([0.23606837, 0.76393163]), 0)]
K = np.array([i for i in range(1, 10000, 200)])
# map(lambda x : x[1], proba[:i])
prec_at_k = np.array([sum([x[1] for x in proba[:i]]) / i for i in K])
plt.plot(K, prec_at_k, marker=".", markersize=3)
plt.xlabel('K')
plt.ylabel('Precision @ K')
plt.legend()
plt.show()
No handles with labels found to put in legend.



png

Q9

proba = lr.predict_proba(X)
pred_y = [1 if p[1] > p[0] else 0 for p in proba]
proba = list(zip(proba, Y, pred_y))
proba.sort(key=lambda x : abs(x[0][1] - 0.5), reverse=True)
plt.plot(K, prec_at_k, label='Q8', marker=".", markersize=3)
prec_at_k = np.array([sum(map(lambda x : 1 if x[1] == x[2] else 0, proba[:i])) / i for i in K])
plt.plot(K, prec_at_k, label='Q9', marker=".", markersize=3)
plt.xlabel('K')
plt.ylabel('Precision @ K')
plt.legend()
plt.show()

png

prec_at_k = np.array([sum(map(lambda x : 1 if x[1] == x[2] else 0, proba[:i])) / i for i in [1, 100, 10000]])
print("Precision @ K ∈ {1, 100, 10000}:\nK=1: %.6f\nK=100: %.6f\nK=10000: %.6f\n" % (prec_at_k[0], prec_at_k[1], prec_at_k[2]))
Precision @ K ∈ {1, 100, 10000}:
K=1: 1.000000
K=100: 0.750000
K=10000: 0.619600