Python Question

Rewrite to Lower Similarity
Fix the References
Clean Up the Code
Proofread for Typos and Clarify Sampling

The code is 850 words long and report is 1375 words.
the changes is very critical and will upload after assigning

Task

Download the Assignment 1.ipynb file, along with MLData2026.csv from the Assignment

1 folder on Canvas. To help you begin your assignment, the Assignment 1 Google Colab

file contains some starter code to

1. Mount your drive

2. Import the relevant packages. You may need to add more packages as your

assignment progresses.

3. Upload the MLData2026.csv to your Google Colab storage folder, then read and

convert the data into a dataFrame. Consider this as the Master data set.

4. Randomly select 600 sub-samples from the Master data set. Make sure to use

your Student ID to set the random set. This means that every student should

have their own unique set of sub-samples, i.e. mydata, to work on.

You are required to perform basic descriptive analysis on the relevant features in mydata

in Python on Google Colab and report your findings.

Exploratory Data Analysis and Data Cleaning

(i)

For each categorical variable, determine the frequency N and percentage (%)

of instances in each category and summarise the results in a table as follows.

You do not need to recreate the table in Python; your code only needs to

generate the statistics required to populate it. You may export or copy the

values to Microsoft Excel and format the table as shown in the next page. State

all percentages to 1 decimal places. 4 | P a g e

ECU Internal Information

Categorical

Feature

Category

N (%)

Feature 1

Category 1

10 (10.0%)

Category 2

30 (30.0%)

Category 3

50 (50.0%)

Missing

10 (10.0%)

Feature 2

Yes

75 (75.0%)

No

25 (25.0%)

Missing

0 (0.0%)

Feature k

Category 1

25 (25.0%)

Category 2

25 (25.0%)

Category 3

15 (15.0%)

Category 4

30 (30.0%)

Missing

5 (5.0%)

(ii) Summarise each of your numeric variables in a table as follows. State all decimal

values to 1 decimal place.

Continuous

Feature

N (%)

missing

Min

Max

Mean

Median Skewness

Feature 1

Feature2

.

.

.

.

.

.

.

Feature k

N (%) missing = Number and percentage of missing values

Note: The tables for parts (i) and (ii) should be based on the original sub-sample

of 600 uncleaned observations.

(iii)Examine the value in the tables in parts (i) and (ii). Are there any invalid

categories/values for the categorical variables? If so, how will you deal with them

and why? Is there any evidence of outliers for any of the numeric variables? If so, how

many and what percentage are there and how will you deal with them? Justify your

decision in the treatment of outliers (if any).

Note: You may use plots/graphs to further support your observations/decisions.

5 | P a g e

ECU Internal Information

What to Submit

1. A single report (standard margins, minimum required font size is 11, not

exceeding 4 pages, does not include cover page, contents page and reference page,

if there is any) containing:

a. Two summary tables of all the feature in the dataset

b. A list of data issues (if any) with appropriate actions

2. A copy of your Python code as a Google Colab notebook AND in pdf format.

The report must be submitted through TURNITINand checked for originality. The Google

Colab file is to be submitted via a separately Canvas submission link.

Note that no marks will be given if the results you have provided cannot be confirmed by

your code. Any use of generative AI must be acknowledged and used responsibly and

ethically.

Marking Criteria

Criterion

Contribution to

assignment mark

Correct implementation of descriptive analysis in Python (Google

Colab)

  • Working code
  • Good documentation/commentary
  • External sources referenced in APA 7 referencing style (if
  • applicable)

  • Acknowledgement of use of Gen AI (if applicable)
  • Note: At least 80% of the code must aligned with unit content.

    Otherwise, a mark of zero will be awarded for this component.

    5%

    Tabulation of descriptive statistics

  • Properly formatted tables (NO direct screenshots from the
  • output in Google Colab)

  • Features are correctly placed in the appropriate table

  • Tables are populated with the correct statistics

  • Tables are appropriate captioned and referenced in-text
  • Relevant decimal values are rounded to the correct
  • number of decimal places

    3%

    Correct explanation and justification in the identification and

    treatment of missing and/or invalid observations in the data

  • Justifications should be initially based on the values in the
  • tables You may use plot/graphs to further support your

    observations and/or decisions. Screenshots of graphs are

    acceptable

  • Provide appropriate actions to treat problematic values
  • Spelling and grammatical errors should be kept to a
  • minimum.

    7%6 | P a g e

    ECU Internal Information

  • Relevant sources referenced in APA 7 referencing style (
  • WRITE MY PAPER

    Comments

    Leave a Reply