Exercise Notebook

In this notebook, you can find a take-home exercise that covers materials shown throughout the introductory workshop, but you will also need to do some self-study to complete it.

Let’s get started with the new dataset. Dataset contains information about patients from a healthcare system.

Load the dataset and inspect its structure.

import pandas as pd

url = 'https://gist.githubusercontent.com/atomashevic/5b7a2224895b9954aa2acaf69b9b6849/raw/9c3c1aa66c95d86632bee1e5e232c7b2cb24612b/healthcare.csv'

data = pd.read_csv(url)

data.head()

	Name	Age	Gender	Blood Type	Medical Condition	Date of Admission	Doctor	Hospital	Insurance Provider	Billing Amount	Room Number	Admission Type	Discharge Date	Medication	Test Results
0	Bobby JacksOn	30	Male	B-	Cancer	2024-01-31	Matthew Smith	Sons and Miller	Blue Cross	18856.281306	328	Urgent	2024-02-02	Paracetamol	Normal
1	LesLie TErRy	62	Male	A+	Obesity	2019-08-20	Samantha Davies	Kim Inc	Medicare	33643.327287	265	Emergency	2019-08-26	Ibuprofen	Inconclusive
2	DaNnY sMitH	76	Female	A-	Obesity	2022-09-22	Tiffany Mitchell	Cook PLC	Aetna	27955.096079	205	Emergency	2022-10-07	Aspirin	Normal
3	andrEw waTtS	28	Female	O+	Diabetes	2020-11-18	Kevin Wells	Hernandez Rogers and Vang,	Medicare	37909.782410	450	Elective	2020-12-18	Ibuprofen	Abnormal
4	adrIENNE bEll	43	Female	AB+	Cancer	2022-09-19	Kathleen Hanna	White-White	Aetna	14238.317814	458	Urgent	2022-10-09	Penicillin	Abnormal

As we can see, the dataset contains both quantitative and qualitative variables. There are few interesting numerical variables such as Age and Billing Amount.

Using everything you have learned so far, answer the following questions:

How many patients are diagnosed with each medical condition?
What is the average age of patients in with each medical condition? Which medical condition has the oldest patients, and which one has the youngest patients?
Are medical conditions evenly distributed by gender? Illustrate the answer with a pie chart.
Filter the dataset to include only patients from top 5 most frequent hospitals. Then, calculate the median billing amount for each hospital.
This information is probably not precise, so let’s break down the median billing amount by hospital and medical condition. Which hospital has the highest median billing amount for each medical condition?
This is important piece of information. Visualize this with a bar chart.
There is something weird in hospital names. Examine the data and try to find out what is wrong. Hint: think of hospitals as part of larger healthcare systems or corporations. Reanalyze the data with this in mind.

Exercise 1

How many patients are diagnosed with each medical condition?

Remember, you can use the value_counts() method to count the number of occurrences of each unique value in a column or you can use the groupby() method to group the data by unique values in a column.

# your solution here

Exercise 2

What is the average age of patients in with each medical condition? Which medical condition has the oldest patients, and which one has the youngest patients?

You can get the average age of patients in each group by using the groupby() method and then chaining the mean() method.

### your solution here

Exercise 3

Are medical conditions evenly distributed by gender?

Solution will contain multiple pie charts (one for each condition). You will need to set the subplots parameter to True in the plot() method to create multiple pie charts.

### your solution here

Exercise 4

Filter the dataset to include only patients from top 5 most frequent hospitals. Then, calculate the median billing amount for each hospital.

### your solution here

Exercise 5

This information is probably not precise, so let’s break down the median billing amount by hospital and medical condition. Which hospital has the highest median billing amount for each medical condition?

You can use the groupby() method to group the data by multiple columns. Also, you will need the median() method to calculate the median billing amount.

### your solution here

Exercise 6

This is important piece of information. Visualize this with a bar chart.

### your solution here

Exercise 7

There is something weird in hospital names. Examine the data and try to find out what is wrong. Hint: think of hospitals as part of larger healthcare systems or corporations. Reanalyze the data with this in mind.

To keep things simple, assume there are three hospital corporations in the data: “Smith”, “Johnson”, “Williams”. You can ou can use the str.contains() method to create a new variable df['Hospital Corporation'] on the fly or your can recode the Hospital variable using list comprehension.

List comprehension is harder, but it’s a nice exercise which combines everything we’ve learned so far about working with text and transforming Pandas dataframes.

Hint 2: You can check wheter ‘Smith’ is in the hospital name by using the in operator. You can use it in list comprehehsion to avoid loops.

### your solution here