COSC 3337 – Data Science I
Clustering
The goal of this assignment is to:
- Learn to use popular clustering algorithms, K-means and DBSCAN.
- Learn how to interpret and summarize results of clustering.
- Learn to write evaluation functions to better understand clustering results.
- Learn to use cross-validation techniques to assess model performance.
Dataset – Patient Clinical Records
You are given a dataset of patient clinical records. The dataset contains 300 records, each with 13 attributes. The attributes are as follows:
Age: Age of the patient
Anaemia: Whether the patient has anaemia (decrease in hemoglobin) (0 = no, 1 = yes)
Creatinine Phosphokinase: Level of the CPK enzyme in the blood (mcg/L)
Diabetes: Whether the patient has diabetes (0 = no, 1 = yes)
Ejection Fraction: Percentage of blood leaving the heart at each contraction (percentage)
High Blood Pressure: Whether the patient has hypertension (0 = no, 1 = yes)
Platelets: Platelets in the blood (kiloplatelets/mL)
Serum Creatinine: Level of serum creatinine in the blood (mg/dL)
Serum Sodium: Level of serum sodium in the blook (mEq/L)
Sex: Whether the patient is male of female (0 = female, 1 = male)
Smoking: Whether the patient smokes or not (0 = no, 1 = yes)
Time: Follow-up period (days)
Death Event: Whether the patient died during the follow-up period (0 = no, 1 = yes)
Assignment Tasks
The last attribute, Death Event, is the class label. The goal of this assignment is to cluster the patients into two groups: those who died during the follow-up period and those who did not. This attribute is to be ignored and you will use the other 12 attributes to cluster the patients.
The class label is to be used in the post analysis of the clusters generated by running K-means and DBSCAN. In addition, ignore the Time attribute as well when clustering.
Task 1 [10 points]
Write a function purity(y_true,y_pred) that computes the purity of a clustering result based on the class labels of the data points. The function takes two arguments: a list of class labels, ytrue, and a list of cluster labels, ypred. The function returns a single number, the purity of the clustering result. The purity is defined as the number of data points that were assigned to the correct class label divided by the total number of data points. For example, if there are 1000 data points and 800 of them were assigned to the correct class label, then the purity is 0.8. The purity is a number between 0 and 1, with 1 being the best possible purity score. Use the starting code given in the provided clustering.py. There should only be 1 – 2 more lines of code you need to write
Task 2 [10 points]
Run K-means on the dataset with k=2. Use the default parameters for the algorithm. Compute the purity of the clustering result. Compute the purity of the clustering result for each of the two clusters. Which cluster has the highest purity? What percentage of the data points were assigned to this cluster? What percentage of the data points were assigned to the other cluster?
Task 3 [10 points]
Run K-means on the dataset with k=3 and k = 5. Compute the overall purity of clustering and the purity of each cluster for each value of k. Which value of k gives the best clustering result? Explain why.
Task 4 [10 points]
Run DBSCAN on the dataset with minPts=5 and eps=0.5. Compute the purity of the clustering result. Compute the purity of the clustering result for each of the two clusters. Which cluster has the highest purity? What percentage of the data points were assigned to this cluster? What percentage of the data points were assigned to the other cluster?
Task 5 [10 points]
Develop a search procedure to find the best parameters for DBSCAN. The parameters to search over are minPts and eps. The procedure should maximize the purity of the clustering result, subject to the following constraints:
- There should be between 2 and 18 clusters.
- The percentage of outliers should be less than 10%.
Which parameters give the best clustering result? What is the purity of the clustering result? What is the purity of the clustering result for each of the clusters? Which cluster has the highest purity?
Hint: Consider if you want to do a grid search or search random combinations of minPts and eps.
Deliverables
- A report that contains the results of your analysis for each of the tasks. The report should be a pdf file or a jupyter notebook
- The code for all of the tasks within the provided shell file
clustering.pyor a jupyter notebook. If submitting a Jupyter notebook, you can use the same for both code and analysis. Submit all the code in a single file for this assignment - The code should be well documented and easy to follow. The code should generate all the plots required for the report.
Submission Instructions
Please submit on canvas. Please ensure that your Github username, your full name, and PSID are filled in at the start of the file.
Data:
Code stub:
Leave a Reply
You must be logged in to post a comment.