1

CAP 5625: Programming Assignment 3

Due on Canvas by Friday, November 10, 2023 at 11:59pm Preliminary instructions

You may consult with other students currently taking CAP 5625 in your section at FAU on this programming assignment. If you do consult with others, then you must indicate this by providing their names with your submitted assignment. However, all analyses must be performed independently, all source code must be written independently, and all students must turn in their own independent assignment. Note that for this assignment, you may choose to pair up with one other student in your section of CAP 5625 and submit a joint assignment. If you choose to do this, then both your names must be associated with the assignment and you will each receive the same grade.

Though it should be unnecessary to state in a graduate class, I am reminding you that you may not turn in code (partial or complete) that is written or inspired by others, including code from other students, websites, past code that I release from prior assignments in this class or from past semesters in other classes I teach, or any other source that would constitute an academic integrity violation. All instances of academic integrity violations will receive a zero on the assignment and will be referred to the Department Chair and College Dean for further administrative action. A second offense could lead to dismissal from the University and any offense could result in ineligibility for Departmental Teaching Assistant and Research Assistant positions.

You may choose to use whatever programming language you want. However, you must provide clear instructions on how to compile and/or run your source code. I recommend using a modern language, such as Python, R, or Matlab as learning these languages can help you if you were to enter the machine learning or artificial intelligence field in the future. All analyses performed and algorithms run must be written from scratch. That is, you may not use a library that can perform coordinate descent, cross validation, elastic net, least squares regression, optimization, etc. to successfully complete this programing assignment (though you may reuse your relevant code from Programming Assignments 1 and 2). The goal of this assignment is not to learn how to use particular libraries of a language, but it is to instead understand how key methods in statistical machine learning are implemented. With that stated, I will provide 5% extra credit if you additionally implement the assignment using built-in statistical or machine learning libraries (see Deliverable 6 at end of the document). Note, credit for deliverables that request graphs, discussion of results, or specific values will not be given if the instructor must run your code to obtain these graphs, results, or specific values.

2

Brief overview of assignment

In this assignment you will still be analyzing the same credit card data from π = 400 training observations that you examined in Programming Assignment 1. The goal is to fit a model that can predict credit balance based on π = 9 features describing an individual, which include an individualβs income, credit limit, credit rating, number of credit cards, age, education level, gender, student status, and marriage status. Specifically, you will perform a penalized (regularized) least squares fit of a linear model using elastic net, with the model parameters obtained by coordinate descent. Elastic net will permit you to provide simultaneous parameter shrinkage (tuning parameter π β₯ 0) and feature selection (tuning parameter πΌ β [0,1]). The two tuning parameters π and πΌ will be chosen using five-fold cross validation, and the best-fit model parameters will be inferred on the training dataset conditional on an optimal pair of tuning parameters. Data

Data for these observations are given in Credit_N400_p9.csv, with individuals labeled on each row (rows 2 through 401), and input features and response given on the columns (with the first row representing a header for each column). There are six quantitative features, given by columns labeled βIncomeβ, βlimitβ, βRatingβ, βCardsβ, βAgeβ, and βEducationβ, and three qualitative features with two levels labeled βGenderβ, βStudentβ, and βMarriedβ. Detailed description of the task

Recall that the task of performing an elastic net fit to training data {(π₯1, π¦1), (π₯2, π¦2), β¦ , (π₯π , π¦π)} is to minimize the cost function

π½(π½, π, πΌ) = β(π¦π ββπ₯πππ½π

π

π=1

)

2 π

π=1

+ π(πΌβπ½π 2

π

π=1

+ (1 β πΌ)β|π½π|

π

π=1

)

where π¦π is a centered response and where the input π features are standardized (i.e., centered and divided by their standard deviation). Note that we cannot use gradient descent to minimize

this cost function, as the component β |π½π| π π=1 of the penalty is not differentiable. Instead, we

use coordinate descent, where we update each parameter π, π = 1,2, β¦ , π, in turn, keeping all other parameters constant, and using sub-gradient rather than gradient calculations. To implement this algorithm, depending on whether your chosen language can quickly compute vectorized operations, you may implement coordinate descent using either Algorithm 1 or Algorithm 2 below (choose whichever you are more comfortable implementing). Note that in languages like R, Python, or Matlab, Algorithm 2 (which would be implemented by several nested loops) may be much slower than Algorithm 1. Also note that if you are implementing Algorithm 1 using Python, use numpy arrays instead of Pandas data frames for computational speed. For this assignment, assume that we will reach the minimum of the cost function within a fixed number of steps, with the number of iterations being 1000.

3

Algorithm 1 (vectorized): Step 1. Fix tuning parameters π and πΌ Step 2. Generate π-dimensional centered response vector π² and π Γ π standardized

(centered and scaled to have unit standard deviation) design matrix π Step 3. Precompute ππ, π = 1,2, β¦ , π, as

ππ =βπ₯ππ 2

π

π=1

Step 4. Randomly initialize the parameter vector π½ = [π½1, π½2, β¦ , π½π]

Step 5. For each π, π = 1,2, β¦ , π: compute

ππ = xπ π(π² β ππ½ + xππ½π)

and set

π½π =

sign(ππ) (|ππ| β π(1 β πΌ)

2 ) +

ππ + ππΌ

Step 6. Repeat Step 5 for 1000 iterations or until convergence (vector π½ does not change)

Step 7. Set the last updated parameter vector as οΏ½ΜοΏ½ = [οΏ½ΜοΏ½1, οΏ½ΜοΏ½2, β¦ , οΏ½ΜοΏ½π]

4

Algorithm 2 (non-vectorized): Step 1. Fix tuning parameters π and πΌ Step 2. Generate π-dimensional centered response vector π² and π Γ π standardized

(centered and scaled to have unit standard deviation) design matrix π Step 3. Precompute ππ, π = 1,2, β¦ , π, as

ππ =βπ₯ππ 2

π

π=1

Step 4. Randomly initialize the parameter vector π½ = [π½1, π½2, β¦ , π½π]

Step 5. For each π, π = 1,2, β¦ , π: compute

ππ =βπ₯ππ

(

π¦π ββπ₯πππ½π

π

π=1 πβ π )

π

π=1

and set

π½π =

sign(ππ) (|ππ| β π(1 β πΌ)

2 ) +

ππ + ππΌ

Step 6. Repeat Step 5 for 1000 iterations or until convergence (vector π½ does not change)

Step 7. Set the last updated parameter vector as οΏ½ΜοΏ½ = [οΏ½ΜοΏ½1, οΏ½ΜοΏ½2, β¦ , οΏ½ΜοΏ½π]

Note that we define

sign(π₯) = { β1 if π₯ < 0 1 if π₯ β₯ 0

π₯+ = { 0 if π₯ < 0 π₯ if π₯ β₯ 0

and we use the notation xπ as the πth column of the design matrix π (the πth feature vector). This vector by definition is an π-dimensional column vector. When randomly initializing the parameter vector, I would make sure that the parameters start at small values. A good strategy here may be to randomly initialize each of the π½π, π = 1,2, β¦ , π,

parameters from a uniform distribution between β1 and 1. Effect of tuning parameter on inferred regression coefficients

You will consider a discrete grid of nine tuning parameter values π β {10β2, 10β1, 100, 101, 102, 103, 104, 105, 106} where the tuning parameter is evaluated across

a wide range of values on a log scale, as well as six tuning parameter values πΌ β {0, 1

5 , 2

5 , 3

5 , 4

5 , 1}.

For each tuning parameter value pair, you will use coordinate descent to infer the best-fit model. Note that when πΌ = 0, we obtain the lasso estimate, and when πΌ = 1, we obtain the ridge regression estimate.

5

Deliverable 1: Illustrate the effect of the tuning parameter on the inferred elastic net regression coefficients by generating six plots (one for each πΌ value) of nine lines (one for

each of the π = 9 features), with the π¦-axis as οΏ½ΜοΏ½π, π = 1,2, β¦ ,9, and the π₯-axis the

corresponding log-scaled tuning parameter value log10(π) that generated the particular οΏ½ΜοΏ½π.

Label both axes in all six plots. Without the log scaling of the tuning parameter π, the plots will look distorted.

Choosing the best tuning parameter

You will consider a discrete grid of nine tuning parameter values π β {10β2, 10β1, 100, 101, 102, 103, 104, 105, 106} where the tuning parameter is evaluated across

a wide range of values on a log scale, as well as six tuning parameter values πΌ β {0, 1

5 , 2

5 , 3

5 , 4

5 , 1}.

For each tuning parameter value pair, perform five-fold cross validation and choose the pair of π and πΌ values that give the smallest

CV(5) = 1

5 βMSEπ

5

π=1