Rewrite to Lower Similarity
Fix the References
Clean Up the Code
Proofread for Typos and Clarify Sampling
The code is 850 words long and report is 1375 words.
the changes is very critical and will upload after assigning
Task
Download the Assignment 1.ipynb file, along with MLData2026.csv from the Assignment
1 folder on Canvas. To help you begin your assignment, the Assignment 1 Google Colab
file contains some starter code to
1. Mount your drive
2. Import the relevant packages. You may need to add more packages as your
assignment progresses.
3. Upload the MLData2026.csv to your Google Colab storage folder, then read and
convert the data into a dataFrame. Consider this as the Master data set.
4. Randomly select 600 sub-samples from the Master data set. Make sure to use
your Student ID to set the random set. This means that every student should
have their own unique set of sub-samples, i.e. mydata, to work on.
You are required to perform basic descriptive analysis on the relevant features in mydata
in Python on Google Colab and report your findings.
Exploratory Data Analysis and Data Cleaning
(i)
For each categorical variable, determine the frequency N and percentage (%)
of instances in each category and summarise the results in a table as follows.
You do not need to recreate the table in Python; your code only needs to
generate the statistics required to populate it. You may export or copy the
values to Microsoft Excel and format the table as shown in the next page. State
all percentages to 1 decimal places. 4 | P a g e
ECU Internal Information
Categorical
Feature
Category
N (%)
Feature 1
Category 1
10 (10.0%)
Category 2
30 (30.0%)
Category 3
50 (50.0%)
Missing
10 (10.0%)
Feature 2
Yes
75 (75.0%)
No
25 (25.0%)
Missing
0 (0.0%)
Feature k
Category 1
25 (25.0%)
Category 2
25 (25.0%)
Category 3
15 (15.0%)
Category 4
30 (30.0%)
Missing
5 (5.0%)
(ii) Summarise each of your numeric variables in a table as follows. State all decimal
values to 1 decimal place.
Continuous
Feature
N (%)
missing
Min
Max
Mean
Median Skewness
Feature 1
Feature2
.
.
.
.
.
.
.
Feature k
N (%) missing = Number and percentage of missing values
Note: The tables for parts (i) and (ii) should be based on the original sub-sample
of 600 uncleaned observations.
(iii)Examine the value in the tables in parts (i) and (ii). Are there any invalid
categories/values for the categorical variables? If so, how will you deal with them
and why? Is there any evidence of outliers for any of the numeric variables? If so, how
many and what percentage are there and how will you deal with them? Justify your
decision in the treatment of outliers (if any).
Note: You may use plots/graphs to further support your observations/decisions.
5 | P a g e
ECU Internal Information
What to Submit
1. A single report (standard margins, minimum required font size is 11, not
exceeding 4 pages, does not include cover page, contents page and reference page,
if there is any) containing:
a. Two summary tables of all the feature in the dataset
b. A list of data issues (if any) with appropriate actions
2. A copy of your Python code as a Google Colab notebook AND in pdf format.
The report must be submitted through TURNITINand checked for originality. The Google
Colab file is to be submitted via a separately Canvas submission link.
Note that no marks will be given if the results you have provided cannot be confirmed by
your code. Any use of generative AI must be acknowledged and used responsibly and
ethically.
Marking Criteria
Criterion
Contribution to
assignment mark
Correct implementation of descriptive analysis in Python (Google
Colab)
Working code
Good documentation/commentary
External sources referenced in APA 7 referencing style (if
applicable)
Acknowledgement of use of Gen AI (if applicable)
Note: At least 80% of the code must aligned with unit content.
Otherwise, a mark of zero will be awarded for this component.
5%
Tabulation of descriptive statistics
Properly formatted tables (NO direct screenshots from the
output in Google Colab)
Features are correctly placed in the appropriate table
Tables are populated with the correct statistics
Tables are appropriate captioned and referenced in-text
Relevant decimal values are rounded to the correct
number of decimal places
3%
Correct explanation and justification in the identification and
treatment of missing and/or invalid observations in the data
Justifications should be initially based on the values in the
tables You may use plot/graphs to further support your
observations and/or decisions. Screenshots of graphs are
acceptable
Provide appropriate actions to treat problematic values
Spelling and grammatical errors should be kept to a
minimum.
7%6 | P a g e
ECU Internal Information
Relevant sources referenced in APA 7 referencing style (