Detecting Heart Failure using Machine Learning (Part 1)

5 min readMar 22, 2021

Machine Learning in the medical field has come a long way, solving one complicated problem after another. As a result, medical professionals have started relying on Machine Learning tools to detect various diseases. Today, I will be demonstrating one such use case, Heart Failure.

Heart failure is a condition in which the heart can’t pump enough blood to meet the body’s needs. In some cases, the heart can’t fill with enough blood. In other cases, the heart can’t pump blood to the rest of the body with enough force. It is a serious condition and requires immediate attention.

There are multiple ways to approach this problem: One can use scans to detect visual anomalies, which will be a Computer Vision problem OR one can process structured medical data, collected and published by medical researchers to detect Heart Failure. I will be moving forward with the latter.

The notebook and the dataset I am referring to, are in the Github repository, in case you want to dive into it (Link: https://github.com/preeyonuj/Heart-Failure-Detection). This article is a part of a series on Detecting Heart Failure and will focus on the introduction and basic EDA.

Dataset

I got this dataset from Kaggle(Link:https://www.kaggle.com/andrewmvd/heart-failure-clinical-data). The feature set isn’t exactly rich with just 12 features. But for this demonstration, it works fine. It contains general information about the subject like his/her age, sex and whether they smoke, and more complex medical features such as level of CPK enzyme, Platelets, and whether they have diabetes. The predicted variable here is the ‘DEATH_EVENT’ feature which states whether the subject lived or died due to heart failure.

Coming to the descriptive information about the dataset, it has 299 data points. All the features are either integers or float.

Data types of all the rows in the dataset

I changed the data type of the ‘age’ feature to integer because every number had a fractional part set to 0 which beats the purpose of a float number. A look into the descriptive statistics of the numerical features indicates a very dissimilar range of values, which must be adjusted later.

Exploratory Data Analysis (EDA)

Coming to EDA, the first thing I checked was whether it has any NaN values and to my surprise, it had none. It seems like an ideal scenario as this dataset is quite small. Next, I checked the correlation among features.

The features with the highest correlation with the predicated variable seem to be ‘time’, which is the follow-up period for the subject. Other high correlated features include ‘age’, ‘ejection_fraction’, and ‘serum_creatinine’. I will look deeper into these features. For all my graphs, I have used Plotly, as it is quite interactive.

Age

Age is a discrete integer feature, which as the name suggests, describes the age of the subject.

The distribution of the feature looks normal but right-skewed, with a mean around 60. I binned to get a better look and in the below table you can see an increase in death count with an increase in ‘age’ above 70.

Time

Time is a discrete integer feature that describes the follow-up period (in days) of the subject. Since this feature is highly correlated with the predicted variable, I decided to generate box plots of it with specific predicted classes.

As clearly seen on the Box Plot, time divides the binary classes fairly well. I can see some overlapping on the higher end, but the non-overlapping part outweighs it by a lot.

Ejection Fraction

Ejection Fraction is another highly correlated feature that describes the percentage of blood leaving the heart at each contraction. Plotting the distribution for this feature indicate a fairly normal distribution with an unusual peak after 60. The modified box plots show the same trend as ‘time’, with the classes being partially separable with respect to ‘ejection_fraction’.

Serum Creatinine

Serum Creatinine is a continuous float feature that describes the level of creatinine in the blood. I checked the distribution of the feature using a distribution plot.

The graph depicts a roughly normal distribution concentrated around 1, with residuals going till 9. The box plot again shows a partial difference in the range of values for the binary classes.

Next, we will be looking into harnessing these features in the Feature Engineering section and then build models based on them. All the features seem to have some differentiation between the classes of the predicted variable and could be potentially exploited. Stay tuned for the next part!

In the meantime, if you want to know more about me and my work, here’s my :

Github Profile: https://github.com/preeyonuj
Previous Medium Article: https://medium.com/analytics-vidhya/aptos-blindness-challenge-part-1-baseline-efficientnet-c7a256daa6e5?sk=d0e445f99daa71d79f0452665f1a59db
LinkedIn Profile: www.linkedin.com/in/pb1807

References :

1) https://www.nhlbi.nih.gov/health-topics/heart-failure#:~:text=Heart%20failure%20is%20a%20condition,the%20body%20with%20enough%20force.
2) https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5