31

How I used Python and R to analyze and predict Medical Appointment show ups!

 4 years ago
source link: https://www.tuicool.com/articles/eeAfYfj
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

How I used Python and R to analyze and predict Medical Appointment show ups!

A world where R and Python live together

Oct 27 ·12min read

bM7V7zN.jpg!web

Photo by NEW DATA SERVICES on Unsplash

Over the past years, I’ve gotten acquainted with Python and I really appreciate the breadth of data science processes I can do with it. I find Python to be really simple to use and with the many libraries available today, I can do almost anything from web scraping to developing deep learning models. I started with Python as everyone I knew was working with it and they said it’s the way to go.

However, I recently started working with R as one of my projects requires the use of ggplot and leaflet to develop interactive visualizations. My approach was simple — practical rather than theoretical. So, I started pursuing a course on R, which I really liked to understand what R is and also started fiddling with the already existing code of that project. I loved it! There is so much we can do with R as well and it’s great for statistics because of how easy it is and the variety of inbuilt functions it has.

So, I started thinking, what if I could use both Python and R to actually create workable solutions. In this article, I’ll first discuss about R and Python and how the trend looks like these days followed by how I used the two languages together to predict if people would show up for their appointment or not with an accuracy of 88%. After working on my own code, I referred to a few online Kaggle Kernels as well and found some really useful insights.

The repository is available below:

If you’d like to take a course in R, here’s the one I’m taking right now. You can use the link below to get a special discount:

When I started Python, I started pursuing the following course in Python (Udemy usually has discounts on this course as well):

R vs Python

The age old debate still runs where someone might prefer R while others might prefer Python and that is completely your own choice. However, as I work with both languages now, I can better inform that both are great.

R brings to the table built-in statistics functions and there is no match for the amazing ggplot library to draw plots. Python has good machine and deep learning libraries that are way easier to use.

yEfYZvQ.png!web

Source: Google Trends

Taking a look at the number of searches as tracked by Google Trends over the past 12 months, both R and Python are fairly searched across the whole world. While the trend shows more searches for R language, according to TIOBE Index, Python is way ahead of R as can be seen in the image below:

NbmA3yz.png!web

Source: TIOBE Index

The figures reveal that while Python is more popular today, the search results from Google reveal that a lot of search results are often more biased towards R rather than Python. Thus, a good mix of skills of both languages would not only make you ready for the challenges today but also make you future proof.

Now, let’s see how we can work with both languages!

Exploratory Data Analysis using R

Import packages

You can download packages in R using install.packages() and then import them into the notebook .Rmd file using library() . I am using ggplot2 to work with plots, gridExtra to work with grids of plots, and lubridate to work with dates.

Import the dataset

I took the dataset from Kaggle regarding various medical appointments and if patients showed up or not. It has 13 features and 1 target variable. The function read.csv() allows to import a dataset.

URbiMfJ.png!web

Dataset (First half)

Af2emqe.png!web

Dataset (Second half)

Dataset exploration

Let’s first start by seeing the number of rows and columns and the names of those columns. The function nrow() returns the number of rows, ncol() returns the number of columns and names() returns the list of column names. The paste() method combines various strings together into a single string.

The dataset has 110527 rows and 14 columns. The target variable is No.show and is in the negation form, so I’d like to change it to a more readable form. I’ll also have to reverse the column values from No to TRUE and Yes to FALSE when I change No.show to Showed_up .

We have 13 features, let’s explore them further:

  1. “PatientId”: It is a unique identifier for each patient but would not be useful for any predictions.
  2. “AppointmentId: It’s a unique identifier for each appointment.
  3. “Gender”: Whether the person is a Female denoted by F or Male denoted by M.
  4. “ScheduleDay”: The day on which appointment was scheduled.
  5. “AppointmentDay”: The day of the appointment.
  6. “Age”: Age of the person.
  7. “Neighbourhood”: The neighbourhood to which the person belongs.
  8. “Scholarship”: Whether the person had scholarship (defined Wikipedia ).
  9. “Hipertension”: Whether the person has hypertension.
  10. “Diabeletes”: Whether the person has diabetes.
  11. “Alcoholism”: Whether the person is alcoholic.
  12. “Handcap”: Whether the person is physically challenged.
  13. “SMS_received”: Whether the person received a text message about the appointment.

The target variable is:

  1. “Showed_up”: Depicts whether the person showed up for the appointment.

Next, let’s see the summary of the dataset.

BzANRzu.png!web

Dataset summary

A closer look at the dataset reveals that features like Alcoholism , Handcap etc. which are considered continuous values rather are categorical variables. The dates are not considered dates and the minimum of Age is -1 which is error data, so I’ll drop those rows.

Visualizations

Target class

Let’s see the distribution between how many people showed up for the appointment and how many did not. I’ll use ggplot which takes the argument as dataset and then the x value inside aes() . geom_bar() defines that we want a bar plot with fill color of each bar as white, ouline defined as color orange and width of each bar as 0.4 . The title is defined using ggtitle() and using labs() I defined the x and y axis labels.

uuIfiyb.png!web

Target class distribution

There are more people who showed up for appointments in comparison to those who didn’t show. We would definitely need to work to make sure the model does not get biased.

Gender distribution

Let’s see if any distribution between males and females exists in the dataset. I’ll create a bar plot and rename the column values as Female for F and Male for M . Just like the bar plot above, I used all the functions and renamed the x label using scale_x_discrete .

6FfMR3M.png!web

Gender distribution

There are more females who have set up appointments as compared to males.

Binary classes distribution

As there are many binary classes, defined using TRUE and FALSE , I’ll plot all of them based on the target class file color. The aes() function takes the argument fill which basically lets us plot the given data based on another class, in our case, the target variable. The function grid.arrange() helps us to plot a series of plots as rows and columns. The argument ncol = 2 states that I want 2 columns and rows based on the number of plots I want.

zmMJviy.png!web

Binary class distribution

All plots reveal that the distribution is almost constant between the target classes for either boolean value. The data is quite evenly spread.

Difference between appointment and schedule day

I noticed that the difference between when someone schedules an appointment and the date of the actual appointment might also be useful. Thus, I decided to create a new column Date.diff in the dataset which is basically the difference between the two days. As the difference is in the number of days, I converted it into numerical values after unlisting using as.numeric(unlist(date.diff)) .

FjABjuQ.png!web

Date diff between schedule and appointment day

The plot shows that there are a lot of appointments with the same day as it was scheduled. Let’s remove that huge spike and see if we can find anything apart from it. So, in the plot above, I’ll add xlim(1,200) to start it from 1.

2eiy6v3.png!web

Date diff between schedule and appointment day (≥ 1 day)

Notice that as the difference between days increases the count of appointments decrease. The appointments with 50 days or more difference are really less. There seems to be no correlation as the counts before 50 days rise and fall without any pattern.

Time of appointment set up

The time of the hour or the month of the appointment might also affect if someone shows up for an appointment or not. So, let’s create that data and see if there is such an effect. While the hour data would be useful, I noticed that the hour information of each appointment is the same so we can’t use that. Let’s just work with the month data.

Using month function, I was able to retrieve the month from AppointmentDay . Then, I used it to plot a bar plot.

3YR3u2v.png!web

Monthly distribution of appointments

There are very few appointments in April while May has the maximum number of appointments.

prop.table() results in a symmetric distribution based on the margin. As we can see from the data, the month has almost no effect on show ups as the false and true ratio is almost the same. We can thus drop the column itself.

We can retrieve a subset of the dataset using subset() method and then by defining the argument select as -Month , we select everything but the month.

Neighborhoods

As we take a broader look, there are many different neighborhoods as well. Let’s explore about them using bar charts.

YfqQJ3n.png!web

Neighborhoods in the dataset

The neighborhood data is quite varied. Jabour has very high number of appointments while there are some neighborhoods that have less than 10 appointments. We should keep the data but we will create dummy variables to accommodate each value in this column during model training.

Age

Finally, let’s see how age variations take place in the dataset. We already removed outliers in the dataset. I’ll now use two scatter plots one overlaying the other each for the two classes in the target variable.

I first select the records with Showed_up as TRUE and then use table and as.data.frame to create a frequency table called age.show . I similarly create the data frame age.no_show for Showed_up as FALSE . Using two times geom_point , I create two overlapping scatter plots.

nEF7raU.png!web

Age count and showed up status

The plot reveals that the number of appointments vary greatly by age. The maximum appointments are for infants. There is a drop and then a spike at around the age of 50. Finally, as the age progresses less and less people set up appointments. Thus, age might have an effect on the target variable too.

Now that we have a fairly good idea about the dataset, let’s save it in its modified form and then use Python libraries to make prediction. I’ll use write.csv() to save the dataset to the file dataset_modified.csv . row.names as FALSE ensures that the row indices are not saved.

Classification using Python

I’ll develop an Artificial Neural Network to train on the given data after doing proper data engineering to make the classification.

Import libraries

I’ll import the necessary libraries including pandas to work with csv, sklearn to work with data processing and keras & tensorflow to create the Artificial Neural Network.

Import dataset

I’ll use the function read_csv() to import the dataset file, dataset_modified.csv . I’ll then use the head() method to take a look at the first 5 rows.

3E3myqZ.png!web

Dataset (Modified) — Part 2

YnA3yiA.png!web

Dataset (Modified) — Part 2

Data Engineering

There are a number of data engineering steps I’ll perform before the dataset is ready for actual use.

Missed appointment

Kaggle always teaches me a lot of things. Even after so many steps that I had decided to do, I found another one that was really helpful. Greatly inspired by this Kaggle kernel , I decided to see the effect if someone had missed an appointment before on the target.

As we see that the correlation value is really high (~0.61) and thus it would be an important factor towards the target variable classification.

Remove extra columns

As the columns PatientId , AppointmentID , ScheduleDay and AppointmentDay will have no effect on the target variable directly, thus, I’ll remove them using drop() .

Dummy columns

I’ll convert the column Neighbourhood into a set of dummy columns. I’ll drop the original column using drop() and then use get_dummies() to create the columns which are added to the original dataset using concat() .

Mapping columns

As we know that the Gender column is categorized as either F or M , I’ll use the map function to change these to numerical values which the ANN could understand.

Split test train data

Next step is to split the features and columns and then create 33% test data and 67% train data using train_test_split() .

The final train data has total 91 columns.

Scale data

Neural networks work best with scaled data and hence, I’ll use the StandardScaler() to scale the data and use it for training the neural network. When using X_train , we use fit_transform as it ensures that it fits on those values and also transforms it. For X_test , we just use transform as it ensures that the scaler function uses it’s knowledge from X_train to transform the test data.

Model generation

The dataset is ready to be used so I’ll now create the Artificial Neural Network using the Sequence() method. I add four Dense layers with 512, 1024, 2048 and 1 neurons respectively. I also ensure to include Dropout() after each layer so that the neural network does not overfit on the data. For the first Dense layer we need to specify the input_dim as well which is equal to the number of columns in our data (91).

Our network has 2,673,665 different parameters that need to be trained.

I’ll now train the model with a validation split of 0.1 meaning from all the data in the training data, the model will train on 90% of the data and test its learning on 10% of the data.

Model prediction

I’ll now test the model on the test data and output both the confusion matrix and the accuracy.

While the model achieved an accuracy of around 88%, the confusion matrix shows that the model is able to correctly predict when someone would show up for an appointment but is approximately sure about 50% of when they would not show up.

We can further improve the model by exploring more data features and doing data engineering to identify other factors. But for now, 88% is a good enough accuracy for this model.

Conclusion

In this article, we worked with an Artificial Neural Network and data exploration using R and Python to develop data workflows that involve both languages.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK