Category: Data science

  • midterm

    For this project, you will conduct a comprehensive descriptive analysis of a real-world dataset using Microsoft Excel. You will organize, summarize, visualize, and interpret data to tell a compelling story about what the data reveals. This project allows you to apply the skills learned in Weeks 1-6 and demonstrate your ability to extract meaningful insights from data.

    Weight: 30% of final grade (300 points)

    Submission: Excel workbook (.xlsx) + Written Report (PDF or Word document)

    This midterm project assesses your mastery of the following course learning outcomes:

    1. Organize, summarize, and visualize data using Excel to extract meaningful patterns and insights
    2. Apply fundamental descriptive statistical techniques (measures of central tendency, variation, frequency distributions) to analyze real-world datasets
    3. Ethically interpret and communicate data insights using appropriate visual and written forms, including consideration of data limitations and responsible use

    By completing this project, you will demonstrate your ability to conduct a complete data analysis workflowfrom data cleaning and exploration through statistical analysis and professional communication of findings.

    STEP 1: Select Your Dataset (Week 5)

    Choose ONE dataset from the approved options below. All datasets are available for free download. You may need to create a free Kaggle account to access some datasets.

    STEP 2: Create Your Excel Workbook (Weeks 5-6)

    Your Excel file must include five sheets with the following content:

    Sheet 1: Raw Data

    • Import your original dataset exactly as downloaded
    • Do NOT modify this sheet
    • Label tab clearly as “Raw Data”

    Sheet 2: Data Cleaning & Organization

    • Copy raw data and document all cleaning steps:Handle missing values (delete, fill with average, mark as “Unknown”)
    • Remove duplicates
    • Create new calculated fields if needed
    • Rename variables for clarity
    • Filter to relevant subset if dataset is very large
    • Add text boxes or comment cells explaining what you did and why
    • Example: “Removed 15 rows with missing graduation rate data because this variable is essential to my analysis”

    Sheet 3: Summary Statistics Create at least THREE summary tables including:

    • Numerical variables: Mean, median, standard deviation, min, max, range, count
    • Categorical variables: Frequency counts, percentages
    • Grouped summaries: Use pivot tables or formulas to summarize by categories (e.g., average temperature by city, median tuition by public/private)
    • Use Excel formulas (AVERAGE, MEDIAN, STDEV.S, COUNT, COUNTIF, SUMIF, etc.) – no manual calculations

    Sheet 4: Visualizations Create at least FIVE different charts including:

    • At least one histogram or bar chart (distribution/comparison)
    • At least one scatterplot (relationship between variables)
    • At least three different chart types total
    • Every chart must have: descriptive title, axis labels with units, legend (if needed), appropriate scale, professional colors, data source note

    Examples: histogram of graduation rates, bar chart of average sales by product, scatterplot of tuition vs. graduation rate with trendline, line chart of temperature trends, pie chart of public vs. private proportions

    Sheet 5: Analysis Notes (Optional but Recommended)

    • Document interesting findings, surprises, patterns, or questions as you work
    • Helps you write your report later

    STEP 3: Write Your Report (Week 6-7)

    Submit a written report of 3-5 pages, double-spaced, 12-point font (Times New Roman or Arial) with these sections:

    1. Introduction (0.5-1 page)

    • What dataset did you choose and why?
    • What is the data source? Is it credible?
    • What research question(s) are you exploring? Be specific.
    • What do you hope to learn?

    2. Data Description (0.5-1 page)

    • How many observations (rows) and variables (columns)?
    • What are your key variables? Define them clearly.
    • What time period or population does the data represent?
    • What data cleaning steps did you perform and why?

    3. Descriptive Analysis & Findings (2-3 pages) MOST IMPORTANT SECTION

    • Organize by findings, NOT by methods
    • Good: “Graduation Rates Vary Widely Across Institutions”
    • Poor: “First I Made a Histogram”
    • For each finding: state it clearly, present specific statistics as evidence, reference visualizations, interpret what it means
    • Required: Discuss central tendency for 2+ variables, variation/spread, group comparisons, relationships between variables
    • Reference all five visualizations from your Excel file
    • Use specific numbers but explain them in context

    4. Limitations & Ethical Considerations (0.5 page)

    • Data limitations: What’s missing? How representative? How current?
    • Interpretation cautions: What can you NOT conclude?
    • Ethical issues: Privacy concerns? Potential biases? Who’s represented/missing? How might findings be misused?

    5. Conclusion (0.5 page)

    • Summarize 2-3 most important findings
    • What “story” does your data tell?
    • What questions remain unanswered?
    • How might this analysis be useful in real-world decision-making?

    6. References

    • Cite your dataset source in APA format (with URL and access date)
    • Include any additional sources consulted
    • Excel WorkbookFilename: LastName_FirstName_Midterm.xlsx
    • All 5 sheets properly labeled
    • Formulas intact (NOT pasted as values)
    • Charts professional and formatted
    • Written ReportFilename: LastName_FirstName_MidtermReport.pdf or .docx
    • 3-5 pages, double-spaced, 12-point font
    • Proofread for grammar and clarity
    • Charts embedded OR clearly referenced by figure number
    • Excel WorkbookFilename: LastName_FirstName_Midterm.xlsx
    • All 5 sheets properly labeled
    • Formulas intact (NOT pasted as values)
    • Charts professional and formatted
    • Written ReportFilename: LastName_FirstName_MidtermReport.pdf or .docx
    • 3-5 pages, double-spaced, 12-point font
    • Proofread for grammar and clarity
    • Charts embedded OR clearly referenced by figure number
  • Business Research Assessment

    Instructions attached

    Attached Files (PDF/DOCX): Assignment_ResearchMethods.pdf

    Note: Content extraction from these files is restricted, please review them manually.

  • Applied analytic methods on a policy case: Urban Air Quality…

    *Please read the attached instruction files as they are very crucial for accomplishing the assignment.

    *the csv file is optional to open if you want to analyse data yourself or any of my data seems off to you.

    *writing tone should not be too academic as this is a policy document, and it is for people who are not experts in this area to read in a short period of time. Making sentences simple doesn’t mean dumbing down here.

    Aim

    Apply, appraise and recommend a range of analytic methods in informing

    decisions about a complex science, technology and public policy issue.

    Objectives

    Design and undertake exploratory data analysis using quantitative and/or

    qualitative techniques.

    Generate evidence describing the behaviour of policy systems influencing air

    quality.

    Evaluate how uncertainty affects your findings and propose ways to

    communicate this in policy contexts.

    Use data visualisation to enhance understanding and communication of data

    insights.

    Judge the suitability of analytic methods used for analysis.

    Marking

    This assignment is 50% of the module. The marking scheme is

    available on school website and covers the following relevant marking criteria:

    Conceptual Understanding (40%)

    Reasoning & Critical Analysis (40%)

    Communication, Structure & Clarity (20%)

    Required Length

    The word limit is 2,000 words. The penalties apply for going more than

    10% above or below this limit are outlined in the MPA Handbook.

    References and footnotes solely containing references are not included in

    the word count.

    1Project context

    Imagine that London has recently joined a newly established ‘Attractive Cities’

    network – a global network of the mayors of capital cities taking action to become

    attractive places to both live and work. A small group of core member cities have been

    asked to take the lead for developing shared research and policy action on public

    policy themes for the network’s members. Yesterday the news came through that

    London has been assigned as the thematic city lead on “Air Quality and Citizen

    Health”.

    This means that London’s Mayor now has responsibility for an entirely new policy

    portfolio, as well as in essence overnight become an influential and expert global voice

    on science, technology and innovation around air quality and health policy issues and

    their interrelationship. The Mayor is very conscious that while London has its own

    challenges and experiences with managing air quality, the real challenge at hand will

    be the development of evidence-based insights into air quality improvement as a

    common policy agenda, yet one with many different local realities and considerations.

    As it turns out, air quality is a topic the current Mayor knows little about, nor has

    previously had much policy exposure to. By background, she is a lawyer. She is,

    however, a keen advocate for evidence-based approaches to policy analysis and has

    often been frustrated when there has been insufficient rigour in analyses used to

    design and inform policy decisions. She has informally been warned that her Analysis

    team has recently had a few issues with analysis credibility. She has therefore asked

    a special analysis adviser be appointed immediately to support her over the next

    weeks in preparing for this new role. Your CV impressed and this is now your role.

    Part 1. Exploratory data analysis [50%]

    The Mayor would like to have insight into any potential interrelationships between air

    quality and citizen health to best understand the agenda of the network she has joined.

    She also wants to understand whether and how context matters and how any

    interrelationships vary across cities. As a first step, she has asked you to undertake

    some exploratory data analysis to understand, where possible:

    a. What are key trends, patterns and/or anomalies in the air quality and citizen

    health of cities?

    b. How do the phenomena of air quality and citizen health interact or interrelate?

    Are there potential causal relationships?

    c. How do air quality and health differ between cities with different

    characteristics (e.g., by population size, exercise levels and/or modal split)?

    d. How does London compare to other cities around the world?

    e. Are there any immediate issues with data uncertainties, outliers, and/or

    results confidence?

    2The Data

    Your predecessor has left you a data file containing the latest data that was about to

    be analysed to brief the Mayor before they switched roles. This file

    ‘aqhealthcities.csv’ is available for download under the Assignment 2 Moodle

    page. Your predecessor also forwarded an annotated data dictionary/codebook as

    well as a short annotated set of references. These two resources are included in

    Appendix A as Tables 1 and 2 respectively.

    Given the incredibly tight timeframes, the Mayor has asked you to focus your analysis

    in the next two weeks on the contents of the inherited ‘aqhealthcities.csv’ data. She is

    happy for you to seek out other theories, knowledges, data sets, etc. beyond this data

    set if they help you with your analysis, but she has stressed she wants you to do this

    proportionately – i.e., her priority is analysis of the .csv file data and she is not expecting

    you to spend much time looking at other sources. Further analysis of other data

    sources is welcome, but depending on your capacity may have to wait until the start

    of the new year.

    Q1. Produce for the mayor [~50%]:

    1. Exploratory data analysis informing her interests (a.-e.) outlined

    above. Explore the patterns, trends and possible observations and

    inferences to make from the data. Choose a set of final summative

    information points and messages you want to communicate to the mayor.

    Include at least two visualisations, though feel free to use more than this

    if appropriate.

    2. A short opening or closing summary of key recommendations derived

    from the exploratory data analysis. You may, for example, want to

    highlight some of the similarities and differences between and within cities

    she should be aware of whilst steering this network. Or you may have ideas

    for policy action based on your analysis. Or, you may may want to raise key

    issues about data, analysis and confidence you think the Mayor should be

    aware of at this point. For any recommendation you make, make clear

    your assessment of the quality and reliability of the data.

    Communicate your work as a short analysis report in a way appropriate for quick

    reference and understanding. This means you can use a mixture of prose, bulleted

    text, tables, figures, headers, labels, etc. Use data visualisation effectively to

    support your analytical narrative.

    Part 2: Informing policy decisions [50%]

    Following your initial exploratory analysis, the Mayor has asked for your advice on

    some analytical and methodological questions that have emerged as she prepares to

    lead the Attractive Cities network.

    Q2. Interpreting Probabilistic Analysis [5%]

    The Mayor is evaluating two policy interventions to improve air quality and health

    outcomes in London. As she now leads the Attractive Cities network’s air quality

    theme, she knows other network cities will be watching London’s choices with interest,

    but her primary responsibility is to make the best decision for London. An external

    consultancy has undertaken a Monte Carlo simulation comparing the two options:

    Option A: Comprehensive Air Quality Monitoring Network

    o Deploy high-density monitoring infrastructure across all boroughs

    o Estimated 5-year cost: ?50-90M

    o Provides real-time data for enforcement and public information

    Option B: Clean Air Zones Expansion with Technology Fund

    o Expand Ultra Low Emission Zone (ULEZ) to all Londo n boroughs

    o Create innovation fund for clean transport solutions

    o Estimated 5-year cost: ?30-120M

    o Includes enforcement tech and electric vehicle charging infrastructure

    The simulation results show probability distributions for total costs over 5 years:

    Fig 1. Monte Carlo Simulation of total total cost profile for 2 policy options from the consultants report

    6Based on these simulation results:

    a. b. What do the distributions tell you about the relative uncertainty of each option?

    Which option would you advise the Mayor to pursue for London, and why?

    c. What caveats should the Mayor be aware of when interpreting these

    probabilistic estimates?

    Q3. Additional Analytical Advice [5%]

    Before the Mayor makes her final decision between Option A and Option B for London,

    what one additional piece of analysis or evidence would you recommend she

    commission, and why?

    Q4. Deliberative Approaches to Policy Evaluation [10%]

    Beyond the immediate monitoring vs expansion decision, the Mayor faces ongoing

    choices about how to allocate London’s air quality budget across multiple competing

    priorities. She has ?5M available for the next financial year and must decide between

    investments such as:

    School street expansions (car-free zones around schools during drop-off)

    Green infrastructure (green walls, urban greening for pollution absorption)

    Low-emission zone enforcement technology upgrades

    Public transport fare subsidies to encourage modal shift

    Community air quality monitoring and engagement programmes

    Support for small businesses to transition to low-emission vehicles

    These priorities have different beneficiaries, different evidence bases, different time

    horizons, and involve different values and trade-offs. They cannot all be funded fully.

    Her team has suggested using Multi-Criteria Analysis (MCA) with an Expert

    Advisory Board to evaluate these options. The Mayor has read briefings on MCA and

    understands the methodology. She is also conscious that her approach to this decision

    may inform how other Attractive Cities network members handle similar allocation

    challenges.

    Provide advice to the Mayor addressing:

    a) Should she use MCA for this London budget allocation decision?

    Consider: Is MCA appropriate for this type of decision? Why or why not? What

    are the key strengths of MCA that make it suitable (or not) for managing this

    budget allocation for London?

    b) If she proceeds with MCA, identify and discuss THREE critical design

    choices from the following areas:

    Expert Board composition (who participates?)

    Deliberative process structure (how is it organixed?)

    7 Criteria selection and weighting (how are priorities determined?)

    Bias management (how to mitigate biases?)

    Stakeholder inclusion (which London communities represented?)

    Transparency and documentation (how is process communicated?)

    For each of your three chosen areas, explain the specific choice she must make and

    why it matters for quality.

    Note: The Mayor does not need you to explain how MCA works procedurally. She

    needs practical, context-specific advice on whether and how to use it effectively for

    this London budget allocation challenge.

    Q5. Critical Reflexivity on Your Analytical Choices [15%]

    Throughout this STEP0020 module, we have explored how analytical methods are

    not neutral tools but embody particular worldviews, values, and assumptions about

    what counts as knowledge and how it should be produced.

    Reflecting specifically on your exploratory data analysis in Part 1, reflect on

    your key methodological choices: which methods did you use for your analysis

    (e.g., visualisation types, statistical approaches, ways of grouping or comparing

    data); which variables did you focus on and which did you set aside; how you

    defined and measured relationships between air quality and health.

    a. Share a reflection on what your choices privileged and what they

    obscured. You can consider:

    What did your analytical approach emphasise or make visible? What

    might it have obscured or marginalised?

    What assumptions were embedded in your choices? (e.g., about

    causality, about what’s measurable, about relationships between

    variables)

    Whose perspectives were centred in your analysis? What voices or

    experiences were excluded?

    b. Consider whose knowledge matters.

    Your analysis relied on aggregate city-level data. What might be missed

    or obscured when relying primarily on quantitative, aggregate data?

    Given that the Attractive Cities network represents diverse cities with

    different contexts, capacities, and knowledge traditions, what additional

    approaches (drawing on methods from this course) might complement

    your quantitative analysis to incorporate broader perspectives?

    c. Reflect on your own positionality

    How might your own background, training, and assumptions have shaped

    what you chose to analyse and how you interpreted it?

    If someone with a different background or perspective (e.g., a community

    health worker, an environmental justice advocate, a city official from a

    8Global South city) were to analyse this data, what might they emphasise

    differently?

    Use specific examples from your Part 1 analysis to ground your reflections.

    Move beyond generic critiques to engage substantively with the actual analytical

    choices you made and their implications for a global network on air quality & health.

    Q6. Managing Deep Uncertainty in Policy Analysis [15%]

    In her role as thematic lead for the Attractive Cities network, the Mayor recognises

    that significant deep uncertainties affect the network’s collective approach to air

    quality and health policy:

    Future air quality trends are uncertain (climate change impacts on pollution,

    changing mobility patterns post-pandemic, technological disruptions in

    transport and energy systems)

    Health impacts are uncertain (emerging evidence on pollution exposure

    pathways, changing population health trends, new epidemiological

    understanding)

    Political and economic contexts vary enormously across network cities and

    are themselves changing (policy commitment, resource availability,

    governance capacity)

    Policy effectiveness is uncertain (what works in one city may not work in

    others, unexpected implementation challenges, behaviour change

    uncertainties)

    These uncertainties cannot easily be quantified probabilistically. The Mayor cannot

    assign reliable probabilities to different futures, nor can she rely on historical data to

    predict unprecedented changes.

    Advise the Mayor on using scenarios to manage these deep uncertainties:

    a. b. What are scenarios and why are they useful for this type of uncertainty?

    How could scenarios help her make more robust decisions for the

    network?

    c. What are key limitations or challenges of scenario-based approaches?

    Ground your advice in the Mayor’s specific context – leading a diverse global

    network of cities facing climate, health, and technological uncertainties. Explain how

    scenarios would work as a practical analytical tool, not just a conceptual framework.

    Attached Files (PDF/DOCX): Sample essay for ass2.pdf, Instructions Data analysis.docx, AMP class notes.docx, Instructions Writing Guide.docx

    Note: Content extraction from these files is restricted, please review them manually.

  • Problem Set 1

    Answer the problem set with r. Upload your code as a .R file. This file should include every line of code you wrote for the assignment. For full credit, you must leave thorough comments in the code explaining what you are doing. Thorough documentation skills are important for data scientists to have, and is something employers will look for! Write your code so that it can be easily understood by someone who reads it later.

    Attached Files (PDF/DOCX): ECON_4970_Problem_Set_1_S26.pdf

    Note: Content extraction from these files is restricted, please review them manually.

  • DATA ANALYTICS Discussion Responses (Classmate Feedback)

    **Must understand coding/RStudio and the data preparation phase of analytics. I am attaching my paper for reference if needed as well as the data sets related to each student’s paper for review. I only need a response to each student’s paper. This is a peer review discussion, giving positive feedback and offering suggestions. 2 – 3 full paragraphs per student should be plenty.

    Assignment:

    After completing the Data Understanding and Data Preparation phases for your analytic plan, post your milestone two draft to the Analytic Plan Peer Review topic in a new thread by Thursday of Module Three. (This part has already been completed).

    Then, select two of your peers’ drafts to review as follow-up discussion topic posts that you should submit by Sunday of Module Three in order to give yourself time to reflect upon the peer review experience in this module’s discussion. Select drafts that have not been reviewed or drafts with the fewest reviews.

    Attached Files (PDF/DOCX): Student 2.docx, Student 1.docx, DAT 690 Milestone Two.docx

    Note: Content extraction from these files is restricted, please review them manually.

  • Data visualization and analysis using Tableau

    Instructions Proof of Data Selection and Upload into Tableau (40 Points Total) In this assignment, you will demonstrate successful upload of your selected dataset into Tableau and begin to analyze the dataset’s structure and potential for visualization. Part 1: Upload and Screenshot (10 Points) Select a dataset that contains clear and measurable Key Performance Indicators (KPIs) suitable for analysis. Recommended sources include Data.gov and Kaggle, though you are welcome to use other reputable sources. Upload your chosen dataset into Tableau Desktop. Submit a screenshot clearly showing the data loaded into Tableau. This serves as proof of successful upload. If you encounter any issues during the upload (e.g., formatting problems, missing values), provide a brief written summary describing the problem and how you resolved it. Note: References are not required for this portion. Part 2: Data Visualization Planning (30 Points) In a well-written summary, respond to the following: Visualization Techniques (15 Points) What types of visualizations will you use to explore and present your data (e.g., bar charts, scatter plots, line graphs, heat maps)? Justify your choices based on the type of data and intended audience. Include at least one scholarly or professional reference to support your use of visualization techniques. Key Data Items (15 Points) Identify and describe the key data fields or variables in your dataset. Explain how these variables relate to your KPIs or business research question. Indicate any planned calculated fields, groupings, or filters you anticipate using in Tableau. Deliverables: Screenshot of data uploaded into Tableau Summary of upload issues (if applicable) Written discussion of: Visualization techniques (with references) Key data items for analysis 10/10 Upload your selected data to Tableau (Show screenshot) 15/15 Discuss visualization techniques you will use on this data? 15/15 Discuss the key data items in the data that you will use?

    Attached Files (PDF/DOCX): Assignment Instructions data visualisation.docx

    Note: Content extraction from these files is restricted, please review them manually.

  • Configuring and securing a Linux Server

    Using Virtualbox download and use the most updated version of Ubuntu to configure essential server services and implement basic security measures. In the document attached, it states that either an Apache or Nginx server can be configured. I do not have a personal preference

    Attached Files (PDF/DOCX): Project 2 Submission Template.docx, Project 2 Configuring and Securing a Linux Server.docx

    Note: Content extraction from these files is restricted, please review them manually.

  • CAPSTONE Milestone 1 part 2 FLOWCHART Creation!

    **ABSOLUTELY must understand and know how to use RStudio. I did 2 pages and graphic add ons just to make the pricing more fair, since there was not an option for creating a flowchart. I will provide as many helpful resources as I can, including my final project from a previous class that aligns with this capstone.

    Resources (if needed):

    Draw.io tutorial – https://drawio-app.com/tutorials/interactive-tutorials/

    Draw.io step by step guide – https://drawio-app.com/tutorials/step-by-step-guides/

    Draw.io – https://app.diagrams.net/

    Flowcharts on Draw.io – https://drawio-app.com/blog/flowcharts-in-draw-io-how-to-go-with-the-flow/

    Overview

    Having completed the analytic plan, it is now time to map out the initial steps needed to create a clean, well-described, analytic data set from which you will eventually build your final model.

    Prompt

    Using PowerPoint, Draw.io, Word, or a similar flowcharting tool, create a visual diagram of the steps to be taken for examining the source data. Indicate the source(s) of the data, quality checks, and data cleaning to be performed. Additionally, include written notes explaining the flowchart and capturing your data approach.

    If you have any questions after reading through the feedback on this assignment, reach out to your instructor. Remember that your instructor is a resource you should utilize throughout the course.

    Make sure to include the following critical elements in your flowchart:

    • Identify source for each data set
    • Identify any necessary steps for importing or converting data
    • Indicate steps for checking data quality, including missing or invalid data
    • Indicate steps for exploring distributions of numeric variables
    • Indicate steps for exploring levels of categorical variables
    • Indicate any steps in which alterations to the data may be performed

    What to Submit

    Submit your flowchart in whichever flowcharting tool you prefer (e.g., PowerPoint, Draw.io, Word). Include the image of the flowchart along with the written notes as either a Microsoft Word document or a PDF.

    Attached Files (PDF/DOCX): DAT650 Final Project.docx, CREDIT RISK CASE.docx, GE Culture and Analytics.pdf, INSTRUCTOR NOTES FLOWCHART.docx

    Note: Content extraction from these files is restricted, please review them manually.

  • CAPSTONE – Data Analytics Milestone One

    **ABSOLUTELY must know how to use RStudio and understand data analytics. I have provided several resources along with my previous final project which relates to this Capstone project.

    Resources:

    Data Cleaning Techniques – https://www.upgrad.com/blog/data-cleaning-techniques/

    Types of Data – https://builtin.com/data-science/data-types-statistics

    Overview

    You are a data analyst and your manager has assigned you a project to develop a predictive model that will support a business problem and be implemented into production. You are responsible for taking the project through the phases of CRISM-DM methodology.

    Prompt

    In this milestone, you will write your project summary and analytic plan. The project summary will identify the business problem, state the research question being modeled, and discuss how the solution will help the business. The analytic plan will describe each CRISP-DM phase and the activities that will be performed for each step in the project. Note that your audience is your data analytic team and data analytic manager. Refer to the CRISP-DM graphic in this weeks Module Overview for clarification of the phases.

    If you have any questions after reading through the feedback on this milestone, reach out to your instructor. Remember that your instructor is a resource you should utilize throughout the course.

    While you may reflect on your prior coursework, your submission must consist only of DAT 690 coursework to avoid self-plagiarism. Make sure to include the following critical elements in your paper:

    • Describe the CRISP-DM Business Understanding Phase: Identify the business problem
    • Describe the CRISP-DM Business Understanding Phase: State the research question
    • Describe the CRISP-DM Business Understanding Phase: Discuss how the solution will help the business
    • Describe the CRISP-DM Data Understanding Phase: describe, explore, and verify the data
    • Describe the CRISP-DM Data Preparation Phase: select, clean, construct, and integrate the data
    • Describe the CRISP-DM Modeling Phase: select, generate, build, and assess the model
    • Describe the CRISP-DM Evaluation Phase: evaluate the results, review the process, and determine next steps
    • Describe the CRISP-DM Deployment Phase: how the model will work in production
    • Clear Communication rows: Submission has no major errors related to citations, grammar, spelling, syntax, or organization

    What to Submit

    Your paper must be submitted as a two- to three-page Microsoft Word document with double spacing, 12-point Times New Roman font, and one-inch margins. Be sure to cite any sources in APA format.

  • Touchstone 6

    A. Analysis of TechGear Inc.

    Step 1: Read the Scenario

    SCENARIO: As a data analyst at TechGear Inc., a company specializing in electronic gadgets and accessories, your task is to analyze historical sales data, build predictive models, and use prescriptive analytical methods to provide actionable insights for improving decision-making. The company has been experiencing fluctuating sales and aims to optimize its marketing strategies and production processes to maximize profits and enhance customer satisfaction. Your analysis will help TechGear Inc. understand the factors influencing its sales, forecast future sales trends, assess financial risks associated with different business scenarios, and determine the optimal allocation of its marketing budget and production resources. Ultimately, your work will enable the company to make data-driven decisions, enhancing its sales and marketing strategies, and leading to improved profitability and customer satisfaction.

    Step 2: Look Over the Data

    • Questions 1-5 (Linear Regression) and 7 (Machine Learning): Use the data in the techgear_sales_data.xlsx Excel file, which can be found at the following GitHub link:
    • Question 6 (Forecasting): Use the data in the techgear_sales_data_monthly.xlsx Excel file, which is available at this GitHub link:
    • This file contains the same data as techgear_sales_data.xlsx, but the last row only includes a date with missing values for all other columns. These missing values are intended for you to apply forecasting methods for the upcoming time period.
    • Questions 8 and 9: Since Question 8 focuses on Monte Carlo simulations and Question 9 focuses on linear programming, all necessary data is provided in the problem statement.

    This dataset contains monthly sales and advertising spend data for TechGear from January 2020 to December 2024. It includes the following columns:

    Column NameDescriptionUnit/FormatDateThe month and year for each data entryMM/DD/YYYYSalesThe total sales generated in that monthNumber of SalesAd_Spend_FacebookThe amount of money spent on Facebook advertising in that monthDollarsAd_Spend_InstagramThe amount of money spent on Instagram advertising in that monthDollarsDiscount_RateThe discount rate applied to sales in that monthPercentage

    A snapshot of the first few rows of the dataset is provided below:

    Step 3: Read TechGear Inc. Questions

    Question 1: Exploring Data Structures and Averages in Advertising Spend and Discounts

    Before conducting an analysis, use Python to create a pandas DataFrame named sales from the dataset.

    • What key features of the dataset can you summarize, such as the number of rows and columns?
    • What is the average amount spent on advertising for each social media platform (Facebook and Instagram)?
    • What is the average discount provided to customers?
    • What insights can you draw from this summary regarding advertising spend and discount trends?

    Question 2: Visualizing Relationships

    • How can you visualize the relationships between sales and each advertising spend variable (Facebook and Instagram) as well as discount rates?
    • What types of plots (e.g., scatter plots, line plots, or histograms) would be most effective in identifying patterns or correlations between these variables?
    • What do these visualizations reveal about the impact of advertising spend and discount rates on sales?

    Question 3: Simple Linear RegressionTechGear wants to optimize its marketing strategy.

    • How can you develop a simple linear regression model in Python to predict sales based on Facebook ad spend?
    • What do the coefficients of the model indicate?
    • Specifically, how does the slope describe the relationship between Facebook ad spend and sales?
    • What does the R2 value tell you about how well the model explains the variability in sales?
    • How does the regression output from Python support your interpretation of the models performance?

    Question 4: Assessing the Fit of the Simple Linear Regression Model

    • How can you evaluate the performance of your simple linear regression model by analyzing residuals?
    • What insights do residual plots provide about the models accuracy?
    • Do they suggest any patterns, heteroscedasticity, or violations of linear regression assumptions?
    • How might these findings impact the reliability of the models predictions?

    Question 5: Multiple Linear Regression ModelThe simple linear regression model provides insights into Facebook ad spend.

    • How can you develop a multiple linear regression model to predict monthly sales using Facebook ad spend, Instagram ad spend, and discount rates?
    • How do the coefficients of this model compare to the simple linear regression model? What do they reveal about the combined influence of these factors on sales?
    • Which model performs better in predicting sales?
    • How can you compare the effectiveness using statistical metrics (such as R2 and RMSE)?
    • Based on this comparison, what recommendations can you provide to TechGear for optimizing its advertising strategy?

    Question 6: ForecastingUsing historical sales data, how can you construct:

    • A 3-month moving average forecast for January 2025?
    • An exponential smoothing forecast with a smoothing parameter of 0.80 for January 2025?

    Given TechGears preference for emphasizing recent sales trends:

    • Which forecasting method provides the most reliable prediction for January 2025?
    • What key differences exist between the two forecasting methods, and what do they imply for forecasting accuracy?

    Based on your analysis, consider:

    • What actionable recommendations can you provide to TechGear to improve its marketing strategies and production planning?

    Question 7: TechGear needs a reliable model to predict future sales.

    • How can you build and compare different predictive models to achieve this?
    • How can you develop a multiple linear regression model using 5-fold cross-validation to predict future sales?
    • How can you develop a decision tree model using 5-fold cross-validation to predict future sales?
    • How do the two models compare in terms of RMSE, and which model should TechGear choose?

    TechGear requires a minimum of $6,500 in sales each month to remain profitable.

    • If the best model predicts sales of $4,200, how can the RMSE value be used to determine the range within which actual sales may fall?
    • What are the implications of this for decision-making and risk assessment?

    Question 8: Monte Carlo SimulationsTechGear has experienced significant fluctuations in sales, making accurate predictions challenging.

    • How can you use Monte Carlo simulations to estimate future sales?
    • How can you estimate the average and median monthly sales by running 1,000 simulations?
    • What visuals (e.g., histograms or box plots) can you generate to summarize the results?
    • If daily sales are assumed to follow a uniform distribution between the minimum and maximum observed sales over the past 60 months, how does this impact the simulation results? You can assume that the value for minimum sales observed over 60 months is 2,299 and the maximum value is 7,702.
    • How can you interpret the standard deviation of simulated sales, and what does it reveal about TechGears sales variability?
    • How can TechGear use these insights to improve budgeting, sales forecasting, and operational decision-making?

    Question 9: Linear Programing

    TechGear wants to optimize its advertising spend across Facebook and Instagram to maximize its monthly sales. They have a fixed advertising budget and need to determine the optimal allocation of this budget to achieve the highest possible sales. The sales generated from advertising on each platform are influenced by the amount spent on that platform.

    TechGear has a monthly advertising budget of $10,000. The estimated sales generated from advertising on Facebook and Instagram are given by the following linear equations:

    • Sales from Facebook advertising: where F is the amount spent on Facebook advertising
    • Sales from Instagram advertising: where I is the amount spent on Instagram advertising

    TechGear has a monthly advertising budget of $10,000. They must spend at least $2,000 on Facebook advertising to maintain its presence on the platform. Additionally, they must spend a minimum of $1,000 and no more than $7,000 on Instagram advertising due to platform-specific constraints. The amount spent on Instagram advertising should be at least 50% of the amount spent on Facebook advertising to ensure balanced marketing efforts.

    • What is the optimal budget allocation for Facebook and Instagram, and what is the maximum sales revenue TechGear can achieve under these conditions?

    Step 4: Using the PowerPoint Template, Analyze Data for TechGear Inc.

    • Your task is to analyze historical sales data for TechGear Inc. using various analytical techniques.
    • Youll apply concepts from linear regression, forecasting, machine learning, and prescriptive analytics.
    • The goal is to provide actionable insights to help TechGear make data-driven decisions.
    • Include Python code snippets in your slides for data exploration, regression models, forecasting, machine learning, Monte Carlo simulation, and linear programming tasks.
    • Your Python code should be accurate and well-documented to demonstrate how each analysis step was performed.
    • Your findings will be presented in a PowerPoint presentation, with speaker notes explaining your approach and insights.

    Review each question and then follow the directions outlined on each slide to summarize and present your findings for each question.

    Step 5: Review the Grading Rubric to Ensure All Criteria are Met

    Review the rubric to ensure that you understand how you will be evaluated. Also review the requirements to ensure that your Touchstone is complete.

    Step 6: Submit Your Touchstone

    Submit your completed Touchstone (as a .pptx file) using the blue button at the top of this page.

    B. Rubric

    Advanced (100%)Proficient (85%)Acceptable (75%)Needs Improvement (50%)Non-Performance (0%)Python Analysis (Shown at Key Steps)

    The inclusion of well-documented, accurate Python code for data exploration, regression models, forecasting, machine learning, Monte Carlo simulation, and linear programming. (5%)

    Python code is shown for all major steps, including data exploration, visualization, regression models, forecasting, machine learning, Monte Carlo simulation, and linear programming. Code is well-documented and accurate.Python code is shown for most key steps. Minor issues with code documentation or accuracy.Python code is shown for some steps, but critical components are missing or incomplete.Python code is partially shown but lacks key analyses or is significantly incorrect.No Python code is provided.Data Exploration and Summary (Slide 2)

    Clear summary of data structure, accurate calculation of averages, and key insights from data exploration. Python analysis is included and well-integrated. (10%)

    There is a comprehensive summary of data structure with accurate calculation of averages and clear insights from the exploration. Python analysis is included and well-integrated.Data summary is mostly accurate, with minor errors or missing insights. Python analysis is included.Basic summary provided, but some key features are missing or inaccurate. Python analysis is incomplete.Minimal data exploration with several inaccuracies and no significant insights. Python analysis is missing or incorrect.No data exploration is provided.Visualizing Relationships (Slide 3)

    Accurate and clear visualizations showing relationships between sales, ad spend, and discount rate. Proper interpretation of patterns and correlations. (10%)

    Clear and accurate visualizations for all specified variables with detailed insights into patterns and correlations. Python-generated plots are used.Visualizations are mostly accurate and provide useful insights. Minor errors in interpretation or plot generation.Basic visualizations are provided, but significant patterns or correlations are overlooked. Python plots are incomplete.Visualizations are unclear or inaccurate with limited analysis. Missing Python plots.No visualizations are provided.Simple Linear Regression & Model Fit (Slides 4 & 5)

    Well-implemented regression model with correct interpretation of coefficients and R2 value. Assessment of model fit through residual analysis. (10%)

    Accurate regression model with clear interpretation of coefficients and R2 value. Residual plots are well-explained, and the fit is thoroughly assessed. Python output included.Regression model and assessment are mostly accurate, with minor errors or incomplete explanations.Basic model output provided, but interpretations and model fit assessments are incomplete or contain errors.Model is poorly developed, with incorrect interpretations and no reliable assessment of fit.No regression model or assessment is provided.Multiple Linear Regression (Slide 6)

    Complete multiple regression analysis, including variable interpretation and comparison to simple regression. Python output included. (10%)

    Complete and accurate multiple linear regression analysis, with well-explained coefficients and comparison to the simple linear regression model. Python output included.Multiple regression analysis is mostly accurate, with minor errors or incomplete comparisons.Basic multiple regression is provided, but interpretations and comparisons are incomplete or partially inaccurate.Incomplete or incorrect multiple regression model with minimal explanation.No multiple regression model is provided.Forecasting (Slide 7)

    Implementation of both forecasting methods, clear comparison, and justified selection of the best method based on business needs. (10%)

    Both forecasting methods are accurately implemented and compared. The recommendation is well-justified and aligned with TechGears preferences. Python output included.Forecasting analysis is mostly accurate, with minor errors or incomplete justification of the chosen method.Basic forecasting analysis is provided, but one method may be missing, or justification is unclear.Minimal forecasting analysis with significant errors and no clear recommendation.No forecasting analysis is provided.Machine Learning Models (Slide 8)

    Accurate implementation of multiple regression and decision tree models with RMSE comparison and well-supported model selection. (10%)

    Both models are accurately built and compared using RMSE. Clear model recommendation with actionable insights. Python output included.Machine learning analysis is mostly accurate, with minor errors in the comparison or recommendation.Basic models are provided, but the comparison and recommendation are incomplete or unclear.Models are incomplete or contain major errors. Limited or no comparison is provided.No machine learning analysis is provided.Monte Carlo Simulations (Slide 9)

    Simulation correctly executed with proper assumptions, visualizations, and interpretation of results. Actionable insights are provided. (10%)

    Simulation is well-executed with clear visualizations and interpretation of results. Actionable insights are provided. Python output included.Simulation is mostly accurate, with minor errors or incomplete insights.Basic simulation is provided, but interpretation is incomplete or unclear.Simulation is incomplete or incorrect with minimal explanation.No simulation is provided.Linear Programing (Slide 10)

    Accurate optimization model that meets constraints and clearly explains the best budget allocation for maximum sales. (10%)

    Linear programming solution is accurate and fully meets all constraints. Clear explanation of the optimal budget allocation and maximum achievable sales. Python output included.Solution is mostly accurate, with minor errors in constraints or explanation.Basic linear programming solution is provided but contains errors or incomplete explanations.Incomplete or incorrect solution with minimal explanation.No linear programming solution is provided.Presentation Quality & Speaker Notes

    Well-organized slides with readable formatting and professional layout. Speaker notes effectively explain analysis and insights. (15%)

    Slides are visually appealing and well-organized, with clear speaker notes that thoroughly explain the analysis and findings.Slides are mostly clear and organized. Speaker notes are informative but may lack detail.Basic slides with limited visual appeal. Speaker notes are incomplete or too brief.Poorly organized slides with missing or unclear speaker notes.No presentation or speaker notes provided.

    C. Requirements

    The following requirements must be met for your submission:

    • Hand in a .pptx file with slides listed above.
    • Use a readable 11- or 12-point font.
    • All writing must be appropriate for an academic context. Follow academic writing conventions (correct grammar, spelling, punctuation, and formatting).
    • Plagiarism of any kind is strictly prohibited.
    • Submission must include your name and the date (included in the template).

    This assignment provides a practical experience in business analytics, honing skills essential for data-driven decision-making in business environments. Your analysis and recommendations will help TechGear optimize its operations,

    Good luck, and enjoy uncovering insights for TechGear!