Big Data analytics work

Important note: I need original work i don’t need work like others please work based on the instructions please, you will need multiple things please don’t forget any step, work on it step by step. Read the instructions below.

  1. Create a sample document representing a hotel reservation record in a hotel reservation system. This document should be stored in a NoSQL database like MongoDB. It should include the following fields: reservationID, guestName, guestContact (containing address, phoneNumber, and email), roomDetails (containing roomNumber, roomType, and pricePerNight), checkInDate (YYYY-MM-DD format), checkOutDate(YYYY-MM-DD format), numberOfGuests, servicesRequested (an array such as breakfast, airport pickup, spa, etc.), paymentDetails (containing paymentMethod, totalAmount, and paymentStatus), feedbackRatings (an array containing rating and comment), loyaltyProgram (containing membershipID and membershipLevel), stayHistory (an array containing previous hotel branch visited and visitDate, such as Dubai Marina Branch, Abu Dhabi Downtown Branch, etc.) and reservationStatus (confirmed, cancelled, completed).
  2. Write MongoDB commands to:

a. Create a database named hotel_management_system.

b. Create a collection named reservations.

c. Insert the reservation document you created into the reservations collection.

d. Insert 9 other reservation documents into the reservations collection. (When creating the reservation documents, ensure that the data values are chosen carefully so that all the queries in Question 2 can be demonstrated and tested successfully.)

e. Retrieve the information of all reservations.

f. Retrieve all reservations where the room type is “Suite” and the paymentStatus is “Completed”.

g. Retrieve all reservations where the servicesRequested array contains “spa”.

h. Retrieve all guests who have given a rating greater than 4 in feedbackRatings.

i. Update the membershipLevel of a guest with a specific reservationID to “Gold”.

j. Use an aggregation pipeline to calculate the total revenue generated by each roomType.

k. Use an aggregation pipeline to find the average feedback rating for each roomType.

l. Retrieve reservations where the stayHistory array contains visits to “Dubai Marina Branch”.

m. Use aggregation to find the number of reservations made for each membershipLevel.

n. Delete all reservations where the reservationStatus is “cancelled” and checkOutDate is before 2023-01-01.

o. Using an aggregation pipeline find the total number of services requested for each service type (e.g., breakfast, spa, airport pickup) across all reservations.

  1. Using the pandas library in Python, write code to:

a. Load the heart.csv dataset from ().

b. Explore the dataset using head(), info(), and describe().

c. Identify all the features that have missing values. (Print the missing data before handling).

d. Handle missing values in at least one column. (Print the missing data after handling)

e. Create an age_group feature based on existing features in the dataset.

f. Visualize the distribution of numerical features using histograms or box plots. Explain your interpretation of the data distribution from the histogram/box plot. Are there any outliers? If so, what might be the reason for their presence?

g. Visualize the relationship between two numerical variables (e.g. age and chol) using a scatter plot. What type of relationship and correlation do you observe between the variables?

h. Create violin plots for cp (chest pain type) vs thalach (max heart rate achieved). Determine the shape of the distributions and compare them across chest pain types.

i. Create a heatmap showing correlations between numerical features. Identify the strongest positive, the strongest negative, and the weak correlations. What can you conclude about the relationships between these features?

j. Select appropriate numerical features to compare multiple variables using a pair plot. Visualize the relationships between these features and analyze the patterns. Summarize key insights gained from the pair plot regarding the dataset.

  1. Explore the UCI dataset on student performance (student-mat.csv) and examine the features in the dataset. The dataset can be obtained from the following link. (). Analyze how studytime and absences impact the final grade G3.

a. Based on your understanding of regression, which variable would be the predictor (independent variable) and which would be the response (dependent variable)? Why?

b. Create a scatter plot with studytimeon the x-axis and G3 on the y-axis. Describe the direction, form, and strength of the relationship you observe. Evaluate whether a simple linear regression model is appropriate for this data. Justify your reasoning.

c. Discuss how the failuresand** age **might influence G3. Should these features be included in a multiple regression model? Justify your reasoning.

  1. Load the California Housing dataset from sklearn. Explore the features in the dataset and analyze how MedInc,** HouseAge**, and AveRooms influence Median House Value (target).

a. Based on your understanding of regression, which variable would be the predictor (independent variable) and which would be the response (dependent variable)? Why?

b. Create a scatter plot with MedInc (Median Income) on the x-axis and Median House Value on the y-axis. Analyze the direction, form and strength of the relationship. Assess whether a simple linear regression model would be a suitable fit for this data. If suitable, analyze its fit and evaluate how well the model predicts Median House Value.

c. Discuss how HouseAge and AveRoomsmight influence Median House Value. Should these features be included in the regression model to enhance prediction accuracy? Justify your reasoning based on the relationships between the variables and the target variable.

Submission Guidelines

Your primary submission must be a PDF report that includes answers to all the questions. The report should present your analysis and results in a well-organized manner.

Tasks 1 and 2 must be completed using MongoDB, and the corresponding JSON file should be submitted as part of your secondary submission.

Tasks 35 should be completed in a single Google Colab notebook. Include the link to the Google Colab workbook in your PDF report.

Your report must include:

Screenshots showing the queries and the results of the execution of every MongoDB operation performed in Tasks 1 and 2. Each screenshot should demonstrate the command executed and the corresponding results to validate your work. (Ensure that the submitted screenshots are clearly visible.)

Appropriate code snippets, tables, charts, and graphs (where applicable) to clearly illustrate and support your analysis and findings. You are required to interpret all the findings and gain meaningful insights using the data analysis techniques. Ensure that each visual or snippet is properly labelled and referenced in your explanations.

Explanations of the results obtained from the queries and data analysis.

WRITE MY PAPER

Comments

Leave a Reply