Category: R

  • Exploring Probability Distributions

    Overview

    For your final project, you will use R to solve problems about probability distributions. Specifically, you will make use of the d, p, q, and r functions built into R for working with probability distributions. In most cases, you will need to determine the type of probability distribution that is described and use R to determine a numerical answer.

    Instructions

    To complete this assignment, you will produce and submit two files: an R script and report containing answers and visualizations. Detailed instructions for both the script and report are specified in the downloadable assignment sheet linked in the Guidelines tab. To successfully complete this project, carefully follow all instructions. Pay special attention to formatting guidelines.

    Submission Guidelines

    Follow the attached instruction set Download attached instruction setto build an R script inside of RStudio.

    Make use of the included files as directed in the instructions.

    project6_tests.RDownload project6_tests.R

    Complete all assigned tasks on the instruction sheet.

    Use the testing script to verify your script for accuracy, including all specified variable names.

    Review the R Project Module 6 Rubric to guide your final submission.

    Review all work prior to Canvas upload. Specifically, do the following tasks:

    Remove any lines in your code that use install.packages.

    Remove any lines in your code that use the view function.

    Include the environment reset code as the first thing in your script.

    Submit two (2) files under the assignment in Canvas:

    The R script file named Lastname_Project6_Script.R

    The report named Lastname_Project6_Report.pdf

    Requirements: incase

  • R Question

    Apply the fundamental R programming concepts, including vectors, data frames, subsetting, and basic visualization, to analyze a dataset of housing values in the suburbs of Boston.

    Uses an external CSV file that contains some missing values. A key part of the task will be identifying and handling these data issues before performing your analysis.

    Download 3 files to get started. 2 main files and a template where screenshots are to be pasted with written answers.

    Requirements: Template, R Script – 2 files.

    Requirements: Should answer all questions

  • US_Housing_Prices & Affordability

    Overview

    In this project, you will use R to solve counting and probability problems. To gain the most benefit from this project, avoid calculating numeric values and entering them into R. Instead, use R to do all necessary calculations. This project analyzes trends in U.S. housing prices and affordability using open-source data. The goal

    is to answer seven analytical questions using charts, graphs, and written explanations, culminating in a 2-3-slide presentation and a 1-2 page report.

    Data Sources

  • Zillow Housing Data home prices over time by state/city
  • U.S. Census Data median household income and demographic context
  • Instructions

    To complete this assignment, you will produce and submit two files: A report (PPT) containing answers and visual results following the use of R scripts, along with a Doc file addressing the questions. To successfully complete this project, carefully follow all instructions. Pay special attention to formatting guidelines.

    Submission Guidelines

    To build an R script inside of RStudio.

    Make use of the included files as directed in the instructions.

    -US_Housing_Prices & Affordability – final_housing_prices_affordability

    Complete all assigned tasks on the instruction sheet.

    Use the testing script to verify your script for accuracy, including all specified variable names.

    Review all work prior to Canvas upload. Specifically, do the following tasks:

    Use R to complete the following tasks and create visualizations:

    Regional Comparison Analyst: Analyze state/city and regional differences; create

    bar charts/heat maps; write findings.

    Analytical Questions

    Answer the following two questions and write them into the Doc file:

    1.Which states or cities have the highest median home prices?

    2.How do housing prices differ across regions (e.g., Northeast vs. Midwest)?

    o The R script file named Lastname_Project_Script.PDF

    o The report named Lastname_Project_Report.Doc

    Requirements: 2-3-slide presentation and a 1-2 page report

  • R Question

    The zipfile contains 2 CSV data set, Instruction for the assignment in words and template for the assignment in R.

    Please read the instruction in the word document, use 2 CSV data set for the assignment in R, and follow template in R studio answer all the questions in instructions Word.

    Requirements: 1 page

  • HW HELP ASAP(PART 1 NEED ASAP WILL GIVE MORE TIME FOR OTHER…

    PART ONE

    For this assignment you will write an R program to complete the tasks given below. You will hand in two files for this assignment.

    • A File with your R program. This file should contain only the code (no output) and must have the typical r extension. No other file extensions will be accepted. The reason is that the assignment be graded based on your R code and not the output file. The output file will be used to verify the code commands. Also, please make sure that all comments, discussion, and conclusions regarding results are also annotated as part of your code.
    • A PDF/DOC file with your output code. We are giving you more flexibility regarding how you want to present your output (tables, plots, etc.). You can either use RMD files that combine code, narrative txt, and plots or you can use word document with copy and paste from the R platform you are using. However, please remember that all output (tables, plots, comments, conclusions, etc.) shown in this file has to be generated by the same R code that you submit. This is important! Output shown that is generated using a separate code or output shown that is not supported by the submitted code will not be graded. Screenshots will not be accepted.

    Use the following file

    • R Data Set: HMEQ_Scrubbed.csv (in the zip file attached).
    • The Data Dictionary in the zip file.

    Note: The HMEQ_Scrubbed.csv file is a simple scrubbed file from the previous week homework. If you did more advanced scrubbing of data for last week, you may use your own data file instead. You might get better accuracy! If you decide to use your own version of HMEQ_Scrubbed.csv, please hand it in along with the other deliverables.

    This assignment is an extension of the Week 3 assignment. The difference is that this assignment will now incorporate model validation by using training and testing data sets.

    Step 1: Read in the Data

    • Read the data into R
    • List the structure of the data (str)
    • Execute a summary of the data
    • Print the first six records

    Step 2: Classification Decision Tree

    • Using the code discussed in the lecture, split the data into training and testing data sets.
    • Use the rpart library to predict the variable TARGET_BAD_FLAG
    • Develop two decision trees, one using Gini and the other using Entropy using the training and testing data
    • All other parameters such as tree depth are up to you.
    • Do not use TARGET_LOSS_AMT to predict TARGET_BAD_FLAG.
    • Plot both decision trees
    • List the important variables for both trees
    • Using the training data set, create a ROC curve for both trees
    • Using the testing data set, create a ROC curve for both trees
    • Write a brief summary of the decision trees discussing whether or not the trees are are optimal, overfit, or underfit.
    • Rerun with different training and testing data at least three times.
    • Determine which of the two models performed better and why you believe this

    Step 3: Regression Decision Tree

    • Using the code discussed in the lecture, split the data into training and testing data sets.
    • Use the rpart library to predict the variable TARGET_LOSS_AMT
    • Do not use TARGET_BAD_FLAG to predict TARGET_LOSS_AMT.
    • Develop two decision trees, one using anova and the other using poisson
    • All other parameters such as tree depth are up to you.
    • Plot both decision trees
    • List the important variables for both trees
    • Using the training data set, calculate the Root Mean Square Error (RMSE) for both trees
    • Using the testing data set, calculate the Root Mean Square Error (RMSE) for both trees
    • Write a brief summary of the decision trees discussing whether or not the trees are are optimal, overfit, or underfit.
    • Rerun with different training and testing data at least three times.
    • Determine which of the two models performed better and why you believe this

    Step 4: Probability / Severity Model Decision Tree (Push Yourself!)

    • Using the code discussed in the lecture, split the data into training and testing data sets.
    • Use the rpart library to predict the variable TARGET_BAD_FLAG
    • Use the rpart library to predict the variable TARGET_LOSS_AMT using only records where TARGET_BAD_FLAG is 1.
    • Plot both decision trees
    • List the important variables for both trees
    • Using your models, predict the probability of default and the loss given default.
    • Multiply the two values together for each record.
    • Calculate the RMSE value for the Probability / Severity model.
    • Rerun at least three times to be assured that the model is optimal and not over fit or under fit.
    • Comment on how this model compares to using the model from Step 3. Which one would your recommend usiING
    • PART TWO
    • For this assignment you will write an R program to complete the tasks given below. You will hand in two files for this assignment.
      • A File with your R program. This file should contain only the code (no output) and must have the typical r extension. No other file extensions will be accepted. The reason is that the assignment be graded based on your R code and not the output file. The output file will be used to verify the code commands. Also, please make sure that all comments, discussion, and conclusions regarding results are also annotated as part of your code.
      • A PDF/DOC file with your output code. We are giving you more flexibility regarding how you want to present your output (tables, plots, etc.). You can either use RMD files that combine code, narrative txt, and plots or you can use word document with copy and paste from the R platform you are using. However, please remember that all output (tables, plots, comments, conclusions, etc.) shown in this file has to be generated by the same R code that you submit. This is important! Output shown that is generated using a separate code or output shown that is not supported by the submitted code will not be graded. Screenshots will not be accepted.

      Use the following file

      • R Data Set: HMEQ_Scrubbed.csv (in the zip file attached).
      • The Data Dictionary in the zip file.

      Note: The HMEQ_Scrubbed.csv file is a simple scrubbed file from the previous week homework. If you did more advanced scrubbing of data for last week, you may use your own data file instead. You might get better accuracy! If you decide to use your own version of HMEQ_Scrubbed.csv, please hand it in along with the other deliverables.
      This assignment is an extension of the Week 4 assignment. The difference is that this assignment will now incorporate Random Forest and Gradient Boosting models.

      Step 1: Read in the Data

      • Read the data into R
      • List the structure of the data (str)
      • Execute a summary of the data
      • Print the first six records

      Step 2: Classification Models

      • Using the code discussed in the lecture, split the data into training and testing data sets.
      • Create a Decision Tree model using the rpart library to predict the variable TARGET_BAD_FLAG
      • Create a Random Forest model using the randomForest library to predict the variable TARGET_BAD_FLAG
      • Create a Gradient Boosting model using the gbm library to predict the variable TARGET_BAD_FLAG
      • All model parameters such as tree depth are up to you.
      • Do not use TARGET_LOSS_AMT to predict TARGET_BAD_FLAG.
      • Plot the Decision Tree and list the important variables for the tree.
      • List the important variables for the Random Forest and include the variable importance plot.
      • List the important variables for the Gradient Boosting model and include the variable importance plot.
      • Using the testing data set, create a ROC curves for all models. They must all be on the same plot.
      • Display the Area Under the ROC curve (AUC) for all models.
      • Rerun with different training and testing data at least three times.
      • Determine which model performed best and why you believe this.
      • Write a brief summary of which model you would recommend using. Note that this is your opinion. There is no right answer. You might, for example, select a less accurate model because it is faster or easier to interpret.

      Step 3: Regression Decision Tree

      • Using the code discussed in the lecture, split the data into training and testing data sets.
      • Create a Decision Tree model using the rpart library to predict the variable TARGET_LOSS_AMT
      • Create a Random Forest model using the randomForest library to predict the variable TARGET_LOSS_AMT
      • Create a Gradient Boosting model using the gbm library to predict the variable TARGET_LOSS_AMT
      • All model parameters such as tree depth are up to you.
      • Do not use TARGET_BAD_FLAG to predict TARGET_LOSS_AMT.
      • Plot the Decision Tree and list the important variables for the tree.
      • List the important variables for the Random Forest and include the variable importance plot.
      • List the important variables for the Gradient Boosting model and include the variable importance plot.
      • Using the testing data set, calculate the Root Mean Square Error (RMSE) for all models.
      • Rerun with different training and testing data at least three times.
      • Determine which model performed best and why you believe this.
      • Write a brief summary of which model you would recommend using. Note that this is your opinion. There is no right answer. You might, for example, select a less accurate model because it is faster or easier to interpret.

      Step 4: Probability / Severity Model Decision Tree (Push Yourself!)

      • Using the code discussed in the lecture, split the data into training and testing data sets.
      • Use any model from Step 2 in order to predict the variable TARGET_BAD_FLAG
      • Develop three models: Decision Tree, Random Forest, and Gradient Boosting to predict the variable TARGET_LOSS_AMT using only records where TARGET_BAD_FLAG is 1.
      • Select one of the models to predict damage.
      • List the important variables for both models.
      • Using your models, predict the probability of default and the loss given default.
      • Multiply the two values together for each record.
      • Calculate the RMSE value for the Probability / Severity model.
      • Rerun at least three times to be assured that the model is optimal and not over fit or under fit.
      • Comment on how this model compares to using the model from Step 3. Which one would your recommend using?

      PART 3

    • For this assignment you will write an R program to complete the tasks given below. You will hand in two files for this assignment.
      • A File with your R program. This file should contain only the code (no output) and must have the typical r extension. No other file extensions will be accepted. The reason is that the assignment be graded based on your R code and not the output file. The output file will be used to verify the code commands. Also, please make sure that all comments, discussion, and conclusions regarding results are also annotated as part of your code.
      • A PDF/DOC file with your output code. We are giving you more flexibility regarding how you want to present your output (tables, plots, etc.). You can either use RMD files that combine code, narrative txt, and plots or you can use word document with copy and paste from the R platform you are using. However, please remember that all output (tables, plots, comments, conclusions, etc.) shown in this file has to be generated by the same R code that you submit. This is important! Output shown that is generated using a separate code or output shown that is not supported by the submitted code will not be graded. Screenshots will not be accepted.
      • Use the following file
      • R Data Set: HMEQ_Scrubbed.csv (in the zip file attached).
      • The Data Dictionary in the zip file.

      Note: The HMEQ_Scrubbed.csv file is a simple scrubbed file from the previous week homework. If you did more advanced scrubbing of data for last week, you may use your own data file instead. You might get better accuracy! If you decide to use your own version of HMEQ_Scrubbed.csv, please hand it in along with the other deliverables.
      This assignment is an extension of the Week 5 assignment. We will now incorporate Regression Analysis to the problem.

      Step 1: Use the Decision Tree / Random Forest / Decision Tree code from Week 5 as a Starting Point

      In this assignment, we will build off the models developed in Week 5. Now we will add Regression to the models.

      Step 2: Classification Models

      • Using the code discussed in the lecture, split the data into training and testing data sets.
      • Do not use TARGET_LOSS_AMT to predict TARGET_BAD_FLAG.
      • Create a LOGISTIC REGRESSION model using ALL the variables to predict the variable TARGET_BAD_FLAG
      • Create a LOGISTIC REGRESSION model and using BACKWARD VARIABLE SELECTION.
      • Create a LOGISTIC REGRESSION model and using a DECISION TREE and FORWARD STEPWISE SELECTION.
      • List the important variables from the Logistic Regression Variable Selections.
      • Compare the variables from the logistic Regression with those of the Random Forest and the Gradient Boosting.
      • Using the testing data set, create a ROC curves for all models. They must all be on the same plot.
      • Display the Area Under the ROC curve (AUC) for all models.
      • Determine which model performed best and why you believe this.
      • Write a brief summary of which model you would recommend using. Note that this is your opinion. There is no right answer. You might, for example, select a less accurate model because it is faster or easier to interpret.

      Step 3: Linear Regression

      • Using the code discussed in the lecture, split the data into training and testing data sets.
      • Do not use TARGET_BAD_FLAG to predict TARGET_LOSS_AMT.
      • Create a LINEAR REGRESSION model using ALL the variables to predict the variable TARGET_BAD_AMT
      • Create a LINEAR REGRESSION model and using BACKWARD VARIABLE SELECTION.
      • Create a LINEAR REGRESSION model and using a DECISION TREE and FORWARD STEPWISE SELECTION.
      • List the important variables from the Linear Regression Variable Selections.
      • Compare the variables from the Linear Regression with those of the Random Forest and the Gradient Boosting.
      • Using the testing data set, calculate the Root Mean Square Error (RMSE) for all models.
      • Determine which model performed best and why you believe this.
      • Write a brief summary of which model you would recommend using. Note that this is your opinion. There is no right answer. You might, for example, select a less accurate model because it is faster or easier to interpret.

      Step 4: Probability / Severity Model (Push Yourself!)

      • Using the code discussed in the lecture, split the data into training and testing data sets.
      • Use any LOGISTIC model from Step 2 in order to predict the variable TARGET_BAD_FLAG
      • Use a LINEAR REGRESSION model to predict the variable TARGET_LOSS_AMT using only records where TARGET_BAD_FLAG is 1.
      • List the important variables for both models.
      • Using your models, predict the probability of default and the loss given default.
      • Multiply the two values together for each record.
      • Calculate the RMSE value for the Probability / Severity model.
      • Comment on how this model compares to using the model from Step 3. Which one would your recommend using?

      PART 4

    • For this assignment you will write an R program to complete the tasks given below. You will hand in two files for this assignment.
      • A File with your R program. This file should contain only the code (no output) and must have the typical r extension. No other file extensions will be accepted. The reason is that the assignment be graded based on your R code and not the output file. The output file will be used to verify the code commands. Also, please make sure that all comments, discussion, and conclusions regarding results are also annotated as part of your code.
      • A PDF/DOC file with your output code. We are giving you more flexibility regarding how you want to present your output (tables, plots, etc.). You can either use RMD files that combine code, narrative txt, and plots or you can use word document with copy and paste from the R platform you are using. However, please remember that all output (tables, plots, comments, conclusions, etc.) shown in this file has to be generated by the same R code that you submit. This is important! Output shown that is generated using a separate code or output shown that is not supported by the submitted code will not be graded. Screenshots will not be accepted.

      Use the following file

      • R Data Set: HMEQ_Scrubbed.csv (in the zip file attached).
      • The Data Dictionary in the zip file.

      Note: The HMEQ_Scrubbed.csv file is a simple scrubbed file from the previous week homework. If you did more advanced scrubbing of data for last week, you may use your own data file instead. You might get better accuracy! If you decide to use your own version of HMEQ_Scrubbed.csv, please hand it in along with the other deliverables.
      This assignment is an extension of the Week 6 assignment. The difference is that this assignment will now incorporate PCA and tSNE analysis.

      Step 1: Use the Decision Tree / Random Forest / Decision Tree / Regression code from Week 6 as a Starting Point

      In this assignment, we will not be doing all the analysis as before. But much of the code from week 6 can be used as a starting point for this assignment. For this assignment, do not be concerned with splitting data into training and test sets. In the real world, you would do that. But for this exercise, it would only be an unnecessary complication.
      Step 2: PCA Analysis

      • Use only the input variables. Do not use either of the target variables.
      • Use only the continuous variables. Do not use any of the flag variables.
      • Do a Principal Component Analysis (PCA) on the continuous variables.
      • Display the Scree Plot of the PCA analysis.
      • Using the Scree Plot, determine how many Principal Components you wish to use. Note, you must use at least two. You may decide to use more. Justify your decision. Note that there is no wrong answer. You will be graded on your reasoning, not your decision.
      • Print the weights of the Principal Components. Use the weights to tell a story on what the Principal Components represent.
      • Perform a scatter plot using the first two Principal Components. Color the scatter plot dots using the Target Flag. One color will represent “defaults” and the other color will represent “non defaults”. Comment on whether you consider the first two Principal Components to be predictive. If you believe the graph is too cluttered, you are free to do a random sample of the data to make it more readable. That is up to you.

      Step 3: tSNE Analysis

      • Use only the input variables. Do not use either of the target variables.
      • Use only the continuous variables. Do not use any of the flag variables.
      • Do a tSNE analysis on the data. Set the dimensions to 2.
      • Run two tSNE analysis for Perplexity=30. Color the scatter plot dots using the Target Flag. One color will represent “defaults” and the other color will represent “non defaults”. Comment on whether you consider the tSNE values to be predictive.
      • Repeat the previous step with a Perplexity greater than 30 (try to get a value much higher than 30).
      • Repeat the previous step with a Perplexity less than 30 (try to get a value much lower than 30).
      • Decide on which value of Perplexity best predicts the Target Flag.
      • Train two Random Forest Models to predict each of the tSNE values.

      Step 4: Tree and Regression Analysis on the Original Data

      • Create a Decision Tree to predict Loan Default (Target Flag=1). Comment on the variables that were included in the model.
      • Create a Logistic Regression model to predict Loan Default (Target Flag=1). Use either Forward, Backward, or Stepwise variable selection. Comment on the variables that were included in the model.
      • Create a ROC curve showing the accuracy of the model.
      • Calculate and display the Area Under the ROC Curve (AUC).

      Step 5: Tree and Regression Analysis on the PCA/tSNE Data

      • Append the Principal Component values from Step 2 to your data set.
      • Using the Random Forest models from Step 3, append the two tSNE values to the data set.
      • Remove all of the continuous variables from the data set (set them to NULL). Keep the flag variables in the data set.
      • Create a Decision Tree to predict Loan Default (Target Flag=1). Comment on the variables that were included in the model. Did any of the Principal Components or tSNE values make it into the model? Discuss why or why not.
      • Create a Logistic Regression model to predict Loan Default (Target Flag=1). Use either Forward, Backward, or Stepwise variable selection. Comment on the variables that were included in the model. Did any of the Principal Components or tSNE values make it into the model? Discuss why or why not.
      • Create a ROC curve showing the accuracy of the model.
      • Calculate and display the Area Under the ROC Curve (AUC).

      Step 6: Comment

      • Discuss how the PCA / tSNE values performed when compared to the original data set.

    Requirements: 1-4

  • Independent Data Analysis

    Overview

    Asking appropriate questions of data is an important part of data analytics, as is interpreting the results of the analysis. In this assignment, you will familiarize yourself with a dataset and begin thinking about key questions you could answer from the data.

    Instructions

    To complete this assignment, you will produce and submit two files: an R script and a slide deck that tells the story of your data. Detailed instructions for both the script and deck are specified in the downloadable assignment linked in the Guidelines tab. To successfully complete this project, carefully follow all instructions. Pay special attention to formatting guidelines.Submission Guidelines

    Follow the attached instruction set Download attached instruction setto produce a slide deck that tells the story of your data in 58 slides (not including title and reference list slides) through the use of descriptive statistics and visualizations. Properly cite all sources using APA style. Remember:

    Visualizations are the primary vehicle to convey information in an analytics presentation.

    Include written information in the Notes section on each slide that connects to the visualizations key points in a concise manner.

    Submit two (2) files under the assignment in Canvas with the following filename conventions:

    A slide deck <LastName>_Project4.pptx or <LastName>_Project4.pdf

    An R script LastName_Project4.R

    Instructions

    Preliminary Work Getting Started

    Locate a dataset of interest. The following sites may help you identify data of interest to you and your work:

  • The R Project for Statistical Computing
  • Kaggle
  • Data.gov
  • Your dataset should have at least 700 and no fewer than 6,000 records, as well as eight (8) attributes. Note: The data should not be “clean.” In this assignment, you will clean the data yourself.

    Before beginning your exploratory data analysis, develop 34 data questions. These may or may not change as you explore your data in greater depth, but they will provide you with direction to begin your analysis. The following steps will walk you through a typical exploratory data analysis. Note that your analysis may differ based on the specific dataset you selected.

    Part I Exploring

    1. Review any written description of your dataset. This is often referenced as the data dictionary.

    2. Clean your data. Cleaning involves any task that prepares the dataset for analysis. This might include the following tasks:

    a. Renaming columns

    b. Managing NAs

    c. Correcting data types

    d. Removing columns or rows

    e. Manipulating stringsf.

    Reorganizing the data

    g. Other steps that prepare your data

    3. Determine descriptive statistics for interesting variables.

    4. Produce visualizations from the raw data that identify and highlight interesting aspects. These can include bar charts, histograms, line graphs, scatter plots, etc. Be sure the chosen graph best represents the information.

    Part II Expanding

    1. Create new variables that better capture the data you want to report. For example, if the data shows yearly sales by month, you might calculate the month-to-month increase or decrease in sales as a new column.

    2. Group, summarize, rank, arrange, count, or perform any other useful operations to create new data frames that provide access to different views of the data.

    3. Extract the most interesting data results and produce visualizations that best communicate these results.

    Part III Communicating

    1. Report what you have learned about your data. Identify 35 observations or follow-up questions that you could explore in the future.

    2. Complete all data management tasks in R.

    Submission Guidelines

  • Follow the above instructions to produce a slide deck that tells the story of your data in 58 slides (not including Title and reference list slide) through the use of descriptive statistics and visualizations. Properly cite all sources using APA style. Remember:
  • Visualizations are the primary vehicle to convey information in an analytics presentation.

    Include written information in the Notes section on each slide that connects to the visualization’s key points in a concise manner.

  • Submit two (2) files under the assignment in Canvas with the following filename conventions:
  • A slide deck: LastName_Project4.pptx or LastName_Project4.pdf An R script: LastName_Project4.R.
  • Requirements: 58 slides