Exercise Notebook

In this notebook, you can find a take-home exercise that covers materials shown throughout the introductory workshop, but you will also need to do some self-study to complete it.

Let’s get started with the new dataset. Dataset contains information about patients from a healthcare system.

Load the dataset and inspect its structure.

import pandas as pd

url = 'https://gist.githubusercontent.com/atomashevic/5b7a2224895b9954aa2acaf69b9b6849/raw/9c3c1aa66c95d86632bee1e5e232c7b2cb24612b/healthcare.csv'

data = pd.read_csv(url)
data.head()
Name Age Gender Blood Type Medical Condition Date of Admission Doctor Hospital Insurance Provider Billing Amount Room Number Admission Type Discharge Date Medication Test Results
0 Bobby JacksOn 30 Male B- Cancer 2024-01-31 Matthew Smith Sons and Miller Blue Cross 18856.281306 328 Urgent 2024-02-02 Paracetamol Normal
1 LesLie TErRy 62 Male A+ Obesity 2019-08-20 Samantha Davies Kim Inc Medicare 33643.327287 265 Emergency 2019-08-26 Ibuprofen Inconclusive
2 DaNnY sMitH 76 Female A- Obesity 2022-09-22 Tiffany Mitchell Cook PLC Aetna 27955.096079 205 Emergency 2022-10-07 Aspirin Normal
3 andrEw waTtS 28 Female O+ Diabetes 2020-11-18 Kevin Wells Hernandez Rogers and Vang, Medicare 37909.782410 450 Elective 2020-12-18 Ibuprofen Abnormal
4 adrIENNE bEll 43 Female AB+ Cancer 2022-09-19 Kathleen Hanna White-White Aetna 14238.317814 458 Urgent 2022-10-09 Penicillin Abnormal

As we can see, the dataset contains both quantitative and qualitative variables. There are few interesting numerical variables such as Age and Billing Amount.

Using everything you have learned so far, answer the following questions:

  1. How many patients are diagnosed with each medical condition?
  2. What is the average age of patients in with each medical condition? Which medical condition has the oldest patients, and which one has the youngest patients?
  3. Are medical conditions evenly distributed by gender? Illustrate the answer with a pie chart.
  4. Filter the dataset to include only patients from top 5 most frequent hospitals. Then, calculate the median billing amount for each hospital.
  5. This information is probably not precise, so let’s break down the median billing amount by hospital and medical condition. Which hospital has the highest median billing amount for each medical condition?
  6. This is important piece of information. Visualize this with a bar chart.
  7. There is something weird in hospital names. Examine the data and try to find out what is wrong. Hint: think of hospitals as part of larger healthcare systems or corporations. Reanalyze the data with this in mind.

Exercise 1

How many patients are diagnosed with each medical condition?

Remember, you can use the value_counts() method to count the number of occurrences of each unique value in a column or you can use the groupby() method to group the data by unique values in a column.

# your solution here

Exercise 2

What is the average age of patients in with each medical condition? Which medical condition has the oldest patients, and which one has the youngest patients?

You can get the average age of patients in each group by using the groupby() method and then chaining the mean() method.

### your solution here

Exercise 3

Are medical conditions evenly distributed by gender?

Solution will contain multiple pie charts (one for each condition). You will need to set the subplots parameter to True in the plot() method to create multiple pie charts.

### your solution here

Exercise 4

Filter the dataset to include only patients from top 5 most frequent hospitals. Then, calculate the median billing amount for each hospital.

### your solution here

Exercise 5

This information is probably not precise, so let’s break down the median billing amount by hospital and medical condition. Which hospital has the highest median billing amount for each medical condition?

You can use the groupby() method to group the data by multiple columns. Also, you will need the median() method to calculate the median billing amount.

### your solution here

Exercise 6

This is important piece of information. Visualize this with a bar chart.

### your solution here

Exercise 7

There is something weird in hospital names. Examine the data and try to find out what is wrong. Hint: think of hospitals as part of larger healthcare systems or corporations. Reanalyze the data with this in mind.

To keep things simple, assume there are three hospital corporations in the data: “Smith”, “Johnson”, “Williams”. You can ou can use the str.contains() method to create a new variable df['Hospital Corporation'] on the fly or your can recode the Hospital variable using list comprehension.

List comprehension is harder, but it’s a nice exercise which combines everything we’ve learned so far about working with text and transforming Pandas dataframes.

Hint 2: You can check wheter ‘Smith’ is in the hospital name by using the in operator. You can use it in list comprehehsion to avoid loops.

### your solution here