Category: R

  • R Question

    Question 1

    The data contained in smokers.csv reports the findings of Spilich, June, and Renner (1992), who asked nonsmokers (NS), smokers who had delayed smoking for three hours (DS), and smokers who were actively smoking (AS) to perform a pattern recognition task in which they had to locate a target on a screen. The dependent variable was latency (in seconds).

    Plot the means and 95% confidence intervals of the smokers.csv data.

    Display a data frame that shows the group means, as well as the lower and upper boundaries of the confidence intervals. Do not display any other statistics in the data frame.


    Question 2

    Looking at the plot, what is a potential concern for the smokers.csv data?


    Question 3

    Without using the lm() or aov() function, compute an omnibus F-test. Is there support for the hypothesis that smoking has an effect on performance? Report the:

    • F-statistic
    • Degrees of freedom
    • P-value

    Question 4

    Report an effect size in terms of for the test you ran in the previous question. You should obtain a negative result. What do you think this means in plain English? (You wont lose marks if you are incorrect.)


    Question 5

    Run an ANOVA with planned contrasts that:

    1. compares the combined effect of the active smokers and delayed smokers to the non-smokers, and
    2. compares the active smokers with the delayed smokers.

    Report the test statistic, degrees of freedom, and p-value for those two comparisons. You ARE allowed to use the functions lm() and aov().


    Question 6

    Report an effect size in terms of a partial correlation for each of the contrasts you conducted in the previous question.


    Question 7

    Recall that there is a version of the t-test called a Welchs t-test that is robust to unequal variances. The Welch procedure can actually be extended to One-Way ANOVAs using the function oneway.test().

    Use your excellent sleuthing skills to figure out how to use this function and display its results (Omnibus F-statistic, degrees of freedom, and p-value) for a One-Way analysis of variance that doesnt assume equal variances on the smokers.csv data. Does your interpretation of this ANOVA change from the first one you ran?


    Question 8

    Write some R code that re-creates the following matrix:

    [,1] [,2] [,3]
    [1,] 78 85 92
    [2,] 81 79 88

    Display the matrix and use the is.matrix() function to prove it is a matrix.

    Note: We never learned how to do this in class, so you need to figure it out on your own.


    Question 9

    From the previous question, apply the transpose function t() to the matrix. In plain English, what has this function done specifically?


    Question 10

    Create the following two matrices:

    Matrix A:

    [,1] [,2]
    [1,] 1 3
    [2,] 2 4

    Matrix B:

    [,1] [,2]
    [1,] 5 7
    [2,] 6 8

    Multiply them like this: A * B and then write out how each value in the matrix was calculated.

    Example:

    • [1,1] = … = 5
    • [2,1] = … = 12
    • [1,2] = … = 21
    • [2,2] = … = 32

    (Replace the … with the calculation that was performed.)


    Question 11

    Repeat the previous question, but this time multiply them like this: A %*% B.


    Question 12

    Write some (efficient) R code to re-create the following multiplication table:

    [,1] [,2] ... [,12]
    [1,] 1 2 ... 12
    [2,] 2 4 ... 24
    ...
    [12,] 12 24 ... 144


  • data analytics question

    ECON253: Data Analysis IReport Assignment : Part 1

    You need to complete the data analysis for the designated city.

    will shrae allocated data set

  • RStudio Help – R Script

    Files/questions attached. R Script required.

  • Colab- R 1

    Question 1

    Some archaeologists theorize that ancient Egyptians interbred with several different immigrant populations over thousands of years. To see if there is any indication of changes in body structure that might have resulted, they measured skulls of male Egyptians from 5 different epochs.

    Thomson and Randall-Maciver, Ancient Races of the Thebaid, Oxford: Oxford University Press, 1905.

    The data can be found in SkullsComplete.csv. The column mb measures the maximum breadth of the skull in millimetres.

    For the remaining questions, we will not be using the columns bh, bl or nh. Remove them from the dataframe and display the first 10 rows.


    Question 2

    Create a barplot of the mean maximal breadth measured for each epoch in the SkullsComplete.csv data. Give the plot errorbars with classic 95% confidence intervals.

    Order the epochs so that, left to right, they go from earliest ( ) to latest ( ). Note that years classified as B.C. count backwards. E.g., is more recent than .

    Adjust the y-axis scale so that it goes from 120 at the bottom to 140 at the top.

    Make the bars interesting colours (dont use ggplots default colours).

    Display a data frame that shows the group means, as well as the lower and upper boundaries of the confidence intervals. Do not display any other statistics inside the data frame.


    Question 3

    Using the SkullsComplete.csv, create a ordinary least-squares linear model predicting maximal breadth (mb) with coefficients that make the following comparisons:

    Report the models formula using the obtained coefficents.

    e.g.

    4000BC
    150AD 3300BC
    4000BC

    b0 = 4000BC
    b1 = 3300BC 4000BC
    b2 = 1850BC 4000BC
    b3 = 200BC 4000BC
    b4 = 150AD 4000BC

    = (value) + (value)x1 + (value)x2


    Question 4

    Is the omnibus F-test from the previous questions linear model statistically significant at an = 0.05? Report its value, degrees of freedom and p-value.

    Ensure that the F-statistic and p-value are displayed to 6 decimal places.

    To get accurate results, you will need to extract the F-stat values from the summary output.


    Question 5

    What do the results you displayed in the previous question tell you?


    Question 6

    Based on the planned contrasts/comparisons you used to evaluate the SkullsComplete.csv data, what epochs was there a significant difference between?

    Ensure your table of coefficients are displayed.


    Question 7

    For the SkullsComplete.csv data, assume all the classic assumptions of a OLS model are satisfied. Calculate an omnibus F-test manually without using lm() or aov() . Report the following . . . .

    Grand Mean
    Total Sum of Squares
    Model Sum of Squares
    Residual Sum of Squares
    Model Mean Squares
    Residual Mean Squares
    Multiple R
    F statistic
    Degrees of Freedom
    p-value

    Round displayed outputs to 6 decimal places for everything except the degrees of freedom.

    These results should be identical to the earlier F-statistic and p-value you obtained. If you get a different result, you have done something wrong somewhere somehow.


    Question 8

    Does the model you created for the SkullsComplete.csv data violate the normality assumption?


    Question 9

    The data salary.csv shows the salary of different high-level job positions.

    Use polynomial contrasts to determine which type of trends are most appropriate to describe this data. i.e., specify which trend types are significant.


    Question 10

    Write out the simplest polynomial equation/model (with the obtained values) that BEST describes the trend seen in the salary.csv data.


    Question 11

    Plot the equation from the previous question as a smooth line on top of the observed values.

  • R COLAB 10

    Question 1

    Some archaeologists theorize that ancient Egyptians interbred with several different immigrant populations over thousands of years. To see if there is any indication of changes in body structure that might have resulted, they measured skulls of male Egyptians from 5 different epochs.

    Thomson and Randall-Maciver, Ancient Races of the Thebaid, Oxford: Oxford University Press, 1905.

    The data can be found in SkullsComplete.csv. The column mb measures the maximum breadth of the skull in millimetres.

    For the remaining questions, we will not be using the columns bh, bl or nh. Remove them from the dataframe and display the first 10 rows.


    Question 2

    Create a barplot of the mean maximal breadth measured for each epoch in the SkullsComplete.csv data. Give the plot errorbars with classic 95% confidence intervals.

    Order the epochs so that, left to right, they go from earliest ( ) to latest ( ). Note that years classified as B.C. count backwards. E.g., is more recent than .

    Adjust the y-axis scale so that it goes from 120 at the bottom to 140 at the top.

    Make the bars interesting colours (dont use ggplots default colours).

    Display a data frame that shows the group means, as well as the lower and upper boundaries of the confidence intervals. Do not display any other statistics inside the data frame.


    Question 3

    Using the SkullsComplete.csv, create a ordinary least-squares linear model predicting maximal breadth (mb) with coefficients that make the following comparisons:

    Report the models formula using the obtained coefficents.

    e.g.

    4000BC
    150AD 3300BC
    4000BC

    b0 = 4000BC
    b1 = 3300BC 4000BC
    b2 = 1850BC 4000BC
    b3 = 200BC 4000BC
    b4 = 150AD 4000BC

    = (value) + (value)x1 + (value)x2


    Question 4

    Is the omnibus F-test from the previous questions linear model statistically significant at an = 0.05? Report its value, degrees of freedom and p-value.

    Ensure that the F-statistic and p-value are displayed to 6 decimal places.

    To get accurate results, you will need to extract the F-stat values from the summary output.


    Question 5

    What do the results you displayed in the previous question tell you?


    Question 6

    Based on the planned contrasts/comparisons you used to evaluate the SkullsComplete.csv data, what epochs was there a significant difference between?

    Ensure your table of coefficients are displayed.


    Question 7

    For the SkullsComplete.csv data, assume all the classic assumptions of a OLS model are satisfied. Calculate an omnibus F-test manually without using lm() or aov() . Report the following . . . .

    Grand Mean
    Total Sum of Squares
    Model Sum of Squares
    Residual Sum of Squares
    Model Mean Squares
    Residual Mean Squares
    Multiple R
    F statistic
    Degrees of Freedom
    p-value

    Round displayed outputs to 6 decimal places for everything except the degrees of freedom.

    These results should be identical to the earlier F-statistic and p-value you obtained. If you get a different result, you have done something wrong somewhere somehow.


    Question 8

    Does the model you created for the SkullsComplete.csv data violate the normality assumption?


    Question 9

    The data salary.csv shows the salary of different high-level job positions.

    Use polynomial contrasts to determine which type of trends are most appropriate to describe this data. i.e., specify which trend types are significant.


    Question 10

    Write out the simplest polynomial equation/model (with the obtained values) that BEST describes the trend seen in the salary.csv data.


    Question 11

    Plot the equation from the previous question as a smooth line on top of the observed values.

  • R Colab –

    I HAVE SOLVED THIS ASSIGMENT I JUST NEED SOMEONE TO CHECK IT OUT

    he data contained in lake.csv will now be used to create an OLS model with two predictors. Recall that the outcome the research was interested in was TN.

    Remove outliers from the data. List the outliers (i.e., data rows) that have been removed for the predictors and use the outlier-free data for the remaining questions.


    Question 2

    Fit two linear regression models to predict TN with the new outlier-free data set.

    • One model should use NIN as a predictor.
    • The other model should use both NIN and TW as predictors.

    Report the formula (i.e., with calculated b values) for both models you create.

    Please ensure clarity when referencing predictors within the formulas (i.e., dont just write x, provide clear labels).


    Question 3

    As we learned in class, the summary() function can be used to extract various information about a linear model, such as whether the coefficients are significantly different than 0.

    Find a way to extract only the Std. Error statistic that the summary() function displays for the TW predictor.


    Question 4

    Does the model with only a single predictor have a slope significantly different than 0?

    • Write R code to extract the p-value for the slope.
    • Report the 95% confidence interval around the slope.
    • Use of the function confint() is prohibited.

    Question 5

    Does the model with two predictors contain slopes significantly different than 0?

    • Write R code to extract the p-value for each predictors slope.
    • Report the 95% confidence interval around each predictors slope.
    • Use of the function confint() is prohibited.

    Question 6

    Conduct a test to determine whether the additional predictor in the second model significantly improves the fit relative to the single-predictor model.

    • Does the second predictor significantly improve the model?
    • Report the test-statistic, degrees of freedom, and p-value.
    • Use of anova() is prohibited.

    Question 7

    Assume that a lake has an average influent nitrogen concentration of 5.7 and a water retention time of 0.98.

    Use the preferred model, as determined from the previous question, to predict the annual nitrogen concentration of that lake.


    Question 8

    A psychologist studying perceived quality of life in a large number of cities came up with the following equation using mean temperature (F) and median income in $1,000 as predictors:

    = 5.37 0.01 Temp + 0.05 Income

    Interpret the regression equation in terms of the coefficients.
    (i.e., state what each predictor of the model means in plain English)


    Question 9

    Using the model from the previous question, assume a city has a mean temperature of 55 degrees and a median income of $12,000.

    What is its predicted Quality of Life score?


    Question 10

    You are a highfalutin marketing guru who wants to predict the sales of your brand using the data set DataDrivenMarketing.csv.

    However, there are a number of missing (NA) values in this data set.

    Using what you know about R, report how many NA values are in each of the datasets columns.

    (Note that there are many different ways to achieve this.)


    Question 11

    Recall that base-R has a function read.csv() that can be used to load a CSV file. The tidyverse has the function read_csv().

    Compare how both functions load the DataDrivenMarketing.csv data.

    There is a subtle difference, explain what it is.

    Tip: Look at the amount of NAs.


    Question 12

    Using the DataDrivenMarketing.csv data, create a dataframe that removes any row which has a NA value.

    Report how many rows this new data frame has.

    Use this cleaned up data set for all subsequent questions.


    Question 13

    Plot a correlation matrix of the possible (quantitative) predictors you could include in your model that predicts sales.

    • Do not use default colours
    • Make the category labels black

    Question 14

    Because the predictors are somewhat correlated, you suspect that multicollinearity may be affecting the regression estimates.

    Investigate this possibility by comparing two models:

    Sales = b + b(TV) + b(Radio) + b(Social Media)

    to a model with just TV and Social Media:

    Sales = b + b(TV) + b(Social Media)

    Use an appropriate diagnostic to assess the extent of collinearity among the predictors in each model.

    Summarize your findings and explain which model is more affected.


    Question 15

    Search for outliers in the $TV and $Social.Media columns.

    Report how many you find in each and remove them from the data set for subsequent questions.


    Question 16

    Using the data set with outliers removed, begin a hierarchical regression by creating a model with just TV as a predictor.

    Report the models formula (with coefficients) and R statistic.


    Question 17

    Repeat the previous question, but include Social Media as a predictor.


    Question 18

    Conduct a test to evaluate whether social media significantly improves the fit of the model.

    • Use of anova() is prohibited
    • Report the F-statistic, degrees of freedom, p-value, and conclusion

    Question 19

    Build an ordinary least-squares regression model with both TV and Influencer as predictors of Sales.

    • Are each of its coefficients significantly different than 0?
    • What is the multiple R of this model?

    Question 20

    Conduct an F-test to evaluate whether influencer significantly improved the fit to the model over one with just TV as a predictor.

    • Which is the preferred model?
    • Use of anova() is prohibited
    • Report the F-statistic, degrees of freedom, and p-value

    Question 21

    Using the preferred model from the previous question, create a plot of the residuals to evaluate homogeneity of variance.

    Is the assumption reasonable?


    Question 22

    Using the preferred model, evaluate whether the residuals are normally distributed.

    Is the assumption reasonable?

  • REGRESSION ANALYSIS

    The table below displays catalog-spending data for the first few of 200 randomly selected individuals from a very large (over 20,000 households) data base.1 The variable of particular interest is catalog spending as measured by the Spending Ratio (SpendRat). All of the catalog variables are represented by indicator variables; either the consumer bought and the variable is coded as 1 or the consumer didnt buy and the variable is coded as 0. The other variables

  • R Question

    Apply the linear regression models including Multiple Linear Regression, Stepwise Variable Selection, and LASSO, to analyze a dataset of Major League Baseball player statistics and predict their salaries.

    View the three files to get started:

    1. Assignment and dataset
    2. The report template where you will paste your screenshots and written answers
    3. The starter R script to help you organize your code

    Deliverables: Two files

    1. Assignment Report (Word Doc or PDF): Please refer to the template.
    2. R script: Please also submit your R script along with the report.

    No advanced answers please.

  • RStudio Questions Help

    Data Mining / RStudio:

    Topic: Linear Regression, Shrinkage Regression

    • Shrinkage Regression (Ridge Regression and LASSO Regression).
    • Variable Selection using LASSO.
    • Obtain the optimal tuning parameter through Cross Validation.

    Aimed outcomes:

    • Explain the difference in objective function between Ridge Regression and LASSO Regression.
    • Understand why LASSO can be used for variable selection.
    • Conduct cross-validation to obtain the optimal tuning parameter.

    Required: Questions in the assignment prompt – PDF provided along with notes.

    Requirements: As required

  • Perform some exploration of the data and understand how thes…

    library(tidyverse)

    library(tidyr)

    library(dplyr)

    ###PRACTICE MERGING###

    #1. Read in each file from the folder NH-2020-M5

    #2. Perform some exploration of the data and understand how these datasets join

    #3. Join these datasets to create a masters NIBRS dataset

    #4. Use this dataset to answer questions for your R Lab Quiz

    #5. Use this dataset to create a relational diagram in the lucidchart app (DATABASE ER DIAGRAM: CROWS FOOT)

    #6. Use this dataset to explore the central limit theorem

    ###########Perform your analysis and data manipulation below##########

    #1. NH-2020-M5 DATASET FILE READ IN

    ref_race<-read.csv(‘/cloud/project/NH-2020-M5/REF_RACE.csv’)

    NIBRS_OFFENSE<-read.csv(‘/cloud/project/NH-2020-M5/NIBRS_OFFENSE.csv’)

    NIBRS_OFFENSE_TYPE<-read.csv(‘/cloud/project/NH-2020-M5/NIBRS_OFFENSE_TYPE.csv’)

    NIBRS_OFFENDER<-read.csv(‘/cloud/project/NH-2020-M5/NIBRS_OFFENDER.csv’)

    NIBRS_LOCATION_TYPE<-read.csv(‘/cloud/project/NH-2020-M5/NIBRS_LOCATION_TYPE.csv’)

    NIBRS_incident<-read.csv(‘/cloud/project/NH-2020-M5/NIBRS_incident.csv’)

    agencies<-read.csv(‘/cloud/project/NH-2020-M5/agencies.csv’)

    Requirements: concise (simplified)