R Programming Model Validation L4

R Programming Model Validation

For this assignment you will write an R program to complete the tasks given below. You will hand in two files for this assignment.

A File with your R program. This file should contain only the code (no output) and must have the typical r extension. No other file extensions will be accepted. The reason is that the assignment be graded based on your R code and not the output file. The output file will be used to verify the code commands. Also, please make sure that all comments, discussion, and conclusions regarding results are also annotated as part of your code.
A PDF/DOC file with your output code. We are giving you more flexibility regarding how you want to present your output (tables, plots, etc.). You can either use RMD files that combine code, narrative txt, and plots or you can use word document with copy and paste from the R platform you are using. However, please remember that all output (tables, plots, comments, conclusions, etc.) shown in this file has to be generated by the same R code that you submit. This is important! Output shown that is generated using a separate code or output shown that is not supported by the submitted code will not be graded. Screenshots will not be accepted.

R Data Set: HMEQ_Scrubbed.csv (in the zip file attached).
The Data Dictionary in the zip file.

Note: The HMEQ_Scrubbed.csv file is a simple scrubbed file from the previous week homework. If you did more advanced scrubbing of data for last week, you may use your own data file instead. You might get better accuracy! If you decide to use your own version of HMEQ_Scrubbed.csv, please hand it in along with the other deliverables.

This assignment is an extension of the Week 3 assignment. The difference is that this assignment will now incorporate model validation by using training and testing data sets.

Step 1: Read in the Data

Read the data into R
List the structure of the data (str)
Execute a summary of the data
Print the first six records

Step 2: Classification Decision Tree

Using the code discussed in the lecture, split the data into training and testing data sets.
Use the rpart library to predict the variable TARGET_BAD_FLAG
Develop two decision trees, one using Gini and the other using Entropy using the training and testing data
All other parameters such as tree depth are up to you.
Do not use TARGET_LOSS_AMT to predict TARGET_BAD_FLAG.
Plot both decision trees
List the important variables for both trees
Using the training data set, create a ROC curve for both trees
Using the testing data set, create a ROC curve for both trees
Write a brief summary of the decision trees discussing whether or not the trees are are optimal, overfit, or underfit.
Rerun with different training and testing data at least three times.
Determine which of the two models performed better and why you believe this

Step 3: Regression Decision Tree

Using the code discussed in the lecture, split the data into training and testing data sets.
Use the rpart library to predict the variable TARGET_LOSS_AMT
Do not use TARGET_BAD_FLAG to predict TARGET_LOSS_AMT.
Develop two decision trees, one using anova and the other using poisson
All other parameters such as tree depth are up to you.
Plot both decision trees
List the important variables for both trees
Using the training data set, calculate the Root Mean Square Error (RMSE) for both trees
Using the testing data set, calculate the Root Mean Square Error (RMSE) for both trees
Write a brief summary of the decision trees discussing whether or not the trees are are optimal, overfit, or underfit.
Rerun with different training and testing data at least three times.
Determine which of the two models performed better and why you believe this

Step 4: Probability / Severity Model Decision Tree (Push Yourself!)

Using the code discussed in the lecture, split the data into training and testing data sets.
Use the rpart library to predict the variable TARGET_BAD_FLAG
Use the rpart library to predict the variable TARGET_LOSS_AMT using only records where TARGET_BAD_FLAG is 1.
Plot both decision trees
List the important variables for both trees
Using your models, predict the probability of default and the loss given default.
Multiply the two values together for each record.
Calculate the RMSE value for the Probability / Severity model.
Rerun at least three times to be assured that the model is optimal and not over fit or under fit.
Comment on how this model compares to using the model from Step 3. Which one would your recommend using?

Essential Activities:

Watch all the training videos
Execute the example code while watching the training videos.

Notes:

This assignment is due Sunday at 11:59 PM EST

HMEQ_Scrubbed.zip

February 4 2026, 3:13 PM

WRITE MY PAPER

Comments

Leave a Reply Cancel reply

More posts

Statistics Question

Cindy review

Ameera hm review

Biology: The Study of Life