import pandas as pd
= 'https://gist.githubusercontent.com/atomashevic/5b7a2224895b9954aa2acaf69b9b6849/raw/9c3c1aa66c95d86632bee1e5e232c7b2cb24612b/healthcare.csv'
url
= pd.read_csv(url) data
Exercise Notebook
In this notebook, you can find a take-home exercise that covers materials shown throughout the introductory workshop, but you will also need to do some self-study to complete it.
Let’s get started with the new dataset. Dataset contains information about patients from a healthcare system.
Load the dataset and inspect its structure.
data.head()
Name | Age | Gender | Blood Type | Medical Condition | Date of Admission | Doctor | Hospital | Insurance Provider | Billing Amount | Room Number | Admission Type | Discharge Date | Medication | Test Results | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Bobby JacksOn | 30 | Male | B- | Cancer | 2024-01-31 | Matthew Smith | Sons and Miller | Blue Cross | 18856.281306 | 328 | Urgent | 2024-02-02 | Paracetamol | Normal |
1 | LesLie TErRy | 62 | Male | A+ | Obesity | 2019-08-20 | Samantha Davies | Kim Inc | Medicare | 33643.327287 | 265 | Emergency | 2019-08-26 | Ibuprofen | Inconclusive |
2 | DaNnY sMitH | 76 | Female | A- | Obesity | 2022-09-22 | Tiffany Mitchell | Cook PLC | Aetna | 27955.096079 | 205 | Emergency | 2022-10-07 | Aspirin | Normal |
3 | andrEw waTtS | 28 | Female | O+ | Diabetes | 2020-11-18 | Kevin Wells | Hernandez Rogers and Vang, | Medicare | 37909.782410 | 450 | Elective | 2020-12-18 | Ibuprofen | Abnormal |
4 | adrIENNE bEll | 43 | Female | AB+ | Cancer | 2022-09-19 | Kathleen Hanna | White-White | Aetna | 14238.317814 | 458 | Urgent | 2022-10-09 | Penicillin | Abnormal |
As we can see, the dataset contains both quantitative and qualitative variables. There are few interesting numerical variables such as Age
and Billing Amount
.
Using everything you have learned so far, answer the following questions:
- How many patients are diagnosed with each medical condition?
- What is the average age of patients in with each medical condition? Which medical condition has the oldest patients, and which one has the youngest patients?
- Are medical conditions evenly distributed by gender? Illustrate the answer with a pie chart.
- Filter the dataset to include only patients from top 5 most frequent hospitals. Then, calculate the median billing amount for each hospital.
- This information is probably not precise, so let’s break down the median billing amount by hospital and medical condition. Which hospital has the highest median billing amount for each medical condition?
- This is important piece of information. Visualize this with a bar chart.
- There is something weird in hospital names. Examine the data and try to find out what is wrong. Hint: think of hospitals as part of larger healthcare systems or corporations. Reanalyze the data with this in mind.
Exercise 1
How many patients are diagnosed with each medical condition?
Remember, you can use the value_counts()
method to count the number of occurrences of each unique value in a column or you can use the groupby()
method to group the data by unique values in a column.
# your solution here
Exercise 2
What is the average age of patients in with each medical condition? Which medical condition has the oldest patients, and which one has the youngest patients?
You can get the average age of patients in each group by using the groupby()
method and then chaining the mean()
method.
### your solution here
Exercise 3
Are medical conditions evenly distributed by gender?
Solution will contain multiple pie charts (one for each condition). You will need to set the subplots
parameter to True
in the plot()
method to create multiple pie charts.
### your solution here
Exercise 4
Filter the dataset to include only patients from top 5 most frequent hospitals. Then, calculate the median billing amount for each hospital.
### your solution here
Exercise 5
This information is probably not precise, so let’s break down the median billing amount by hospital and medical condition. Which hospital has the highest median billing amount for each medical condition?
You can use the groupby()
method to group the data by multiple columns. Also, you will need the median()
method to calculate the median billing amount.
### your solution here
Exercise 6
This is important piece of information. Visualize this with a bar chart.
### your solution here
Exercise 7
There is something weird in hospital names. Examine the data and try to find out what is wrong. Hint: think of hospitals as part of larger healthcare systems or corporations. Reanalyze the data with this in mind.
To keep things simple, assume there are three hospital corporations in the data: “Smith”, “Johnson”, “Williams”. You can ou can use the str.contains()
method to create a new variable df['Hospital Corporation']
on the fly or your can recode the Hospital
variable using list comprehension.
List comprehension is harder, but it’s a nice exercise which combines everything we’ve learned so far about working with text and transforming Pandas dataframes.
Hint 2: You can check wheter ‘Smith’ is in the hospital name by using the in
operator. You can use it in list comprehehsion to avoid loops.
### your solution here