For this assignment you will write an R program to complete the tasks given below. You will hand in two files for this assignment.
- A File with your R program. This file should contain only the code (no output) and must have the typical r extension. No other file extensions will be accepted. The reason is that the assignment be graded based on your R code and not the output file. The output file will be used to verify the code commands. Also, please make sure that all comments, discussion, and conclusions regarding results are also annotated as part of your code.
- A PDF/DOC file with your output code. We are giving you more flexibility regarding how you want to present your output (tables, plots, etc.). You can either use RMD files that combine code, narrative txt, and plots or you can use word document with copy and paste from the R platform you are using. However, please remember that all output (tables, plots, comments, conclusions, etc.) shown in this file has to be generated by the same R code that you submit. This is important! Output shown that is generated using a separate code or output shown that is not supported by the submitted code will not be graded. Screenshots will not be accepted.
- R Data Set: HMEQ_Scrubbed.csv (in the zip file attached).
- The Data Dictionary in the zip file.
Note: The HMEQ_Scrubbed.csv file is a simple scrubbed file from the previous week homework. If you did more advanced scrubbing of data for last week, you may use your own data file instead. You might get better accuracy! If you decide to use your own version of HMEQ_Scrubbed.csv, please hand it in along with the other deliverables.
This assignment is an extension of the Week 6 assignment. The difference is that this assignment will now incorporate PCA and tSNE analysis.
Step 1: Use the code from Week 7 as a Starting Point
In this assignment, we will not be doing all the analysis as before. But much of the code from week 6 can be used as a starting point for this assignment. For this assignment, do not be concerned with splitting data into training and test sets. In the real world, you would do that. But for this exercise, it would only be an unnecessary complication.
Step 2: PCA Analysis
Step 3: Cluster Analysis – Find the Number of Clusters
- Use the principal components from Step 2 for this step.
- Using the methods presented in the lectures, complete a KMeans cluster analysis for N=1 to at least N=10. Feel free to take the number higher.
- Print a scree plot of the clusters and determine how many clusters would be optimum. Justify your decision.
Step 4: Cluster Analysis
- Using the number of clusters from step 3, perform a cluster analysis using the principle components from Step 2.
- Print the number of records in each cluster.
- Print the cluster center points for each cluster
- Convert the KMeans clusters into “flexclust” clusters
- Print the barplot of the cluster. Describe the clusters from the barplot.
- Score the training data using the flexclust clusters. In other words, determine which cluster they are in.
- Perform a scatter plot using the first two Principal Components. Color the plot by the cluster membership.
- Add a legend to the plot.
- Determine if the clusters predict loan default.
Step 4: Describe the Clusters Using Decision Trees
- Using the original data from Step 2, predict cluster membership using a Decision Tree
- Display the Decision Tree
- Using the Decision Tree plot, describe or tell a story of each cluster. Comment on whether the clusters make sense.
Step 6: Comment
- Discuss how you might use these clusters in a corporate setting.
Essential Activities:
- Watch all the training videos
- Execute the example code while watching the training videos.
Notes:
- This assignment is due Saturday at 11:59 PM EST
|
February 4 2026, 3:13 PM |
Leave a Reply
You must be logged in to post a comment.