1 Introduction

Since the spread of COVID-19 in December 2019, the healthcare sector has played a key role in combating the disease. The World Health Organization (WHO) on March 11, 2020, declared the COVID-19 a pandemic [1]. Hospitals as one of the main players have allocated a large part of their resources to deal with this disease. But, the growing number of patients due to the various variants of disease caused the lack of hospital resources such as ICU beds, medicine, and oxygen. Oxygen is a critical element in the treatment process of COVID-19 patients which according to WHO about 15% of cases require medical oxygen [2]. Decreasing respiratory failures caused by COVID-19 depends on the availability of oxygen and ventilation [3]. India was one of the countries, which lack of medical oxygen influenced its hospital service for COVID-19 patients [4].

To avoid the lack of supplies in hospitals during this pandemic, it is necessary to have an accurate and in-time prediction from required equipment like oxygen and ventilators. Artificial intelligence (AI) which has been used widely in medicine can detect and learn the non-linear relationship among variables and diagnose, treat, and predict the outcomes [5]. In healthcare, AI assists to realize the unknown patterns in data and make effective decisions accordingly [6]. This field of science has been applied to the COVID-19 pandemic for screening, analyzing, tracking patients, and making medical predictions [7]. Furthermore, machine learning (ML) as a subset of AI is used for computational epidemiology, early detection, diagnosis, and disease progression of COVID-19 disease as well as clinical management issues of this illness such as ICU admission, mechanical ventilation, multi-organ failure, and death [8, 9]. To make valid predictions in medicine, a supervised ML model requires a dataset containing a number of features and a relevant outcome [10]. For COVID-19, these features can be demographic information, symptoms, lab results, and the background of the patient.

From various variables related to COVID-19 patient’s background, such as diabetes, cancer, smoking, and kidney and liver diseases [11], the effect of consuming opium on COVID-19 patients’ needs more research [12]. Opium is one of the most common and popular drugs among Iranian people, which has been used for more than five centuries. This country has one of the highest rates of opium users in the world, which includes about 2.7% of its population [13]. In 2013 a study on a national scale was designed for evaluating the spread of substance abuse and opium addiction in Iran to zoning the country in low- to high-risk areas. The initial results of the study revealed that Kerman has the highest degree and is the province with the most opium users [14]. A survey on drug abuse in Kerman [15] indicated that the prevalence of substance abuse in the rural areas of Kerman was 22.5% and the rate of addiction was 6%.

As prior mentioned, since the spread of novel Coronavirus, issues and problems related to the capacity of the health system to service the increasing number of patients has grown [16]. One of the basic requirements in the time of pandemic is to accurately predict the required resources and likely outcomes. Among the numerous required resources in the treatment process of COVID-19 patients, medical oxygen is an essential one [17]. Some studies were conducted to apply ML prediction models to predict the requirement for medical oxygen. Lee et.al [18] developed a prediction model that specifies the COVID-19 patients with the risk of requiring medical oxygen. They used the information of 221 patients with C-reactive protein, hypertension, age, and neutrophil and lymphocyte count parameters. Their model achieved a high AUC. To predict the need for mechanical ventilation, [19] used a cohort of 1980 COVID-19 patients. Their data include demographics, patient’s background, vital signs at the emergency room, and laboratory data. Their results demonstrated that age and fever were associated with the risk of ventilator requirement. In another article [20], a machine learning approach was used to predict the mechanical ventilation for COVID-19 patients. The input data included 12 clinical features of 197 COVID-19 patients collected from US hospitals. Their model predicted the mechanical ventilation requirement by applying blood factors and other variables like blood pressure and heart rate.

In this research, a machine learning approach will be proposed to predict the requirement for oxygen-based treatment based on patient characteristics and clinical data. One of the main contributions of this research in compared to previous studies is to predict the outcome (oxygen requirement) in the initial time of patient admission at the hospital by only measuring the symptoms and patient’s background without requiring lab results and further information. This can accelerate the process of resource planning, especially in the time of the peak of the disease and avoid shortage occurrences. The second novelty of this article, which is significant from a medical viewpoint, is assessing the impact of using opium on requiring the oxygen-based treatment and fatality rate of COVID-19 cases, using data collected from Kerman that have a high prevalence of opium users in Iran. The results can assist hospitals in forecasting the need for oxygen and managing this source effectively.

2 Data and methods

In this section, the characteristics of the applied dataset and preprocessing operations on raw data will be described. Figure 1 illustrates an overview of the taken steps to build the prediction model. In the first phase, the required data of hospitalized COVID-19 patients were collected from hospitals. Next, the raw data were cleaned and preprocessing operations were applied. In the third phase, the relevant features were selected. In the fourth step, prepared data were split into train and test sets. Then, the train set was used as input for the numbers of machine learning models to train. After model training, the test set was applied to predict the outcomes. The prediction models were compared based on their accuracy and capability.

Fig. 1
figure 1

Steps of building machine learning approach for predicting oxygen-requirement treatment in hospitalized COVID-19 patients

2.1 Dataset population

Data for this study were collected from two local hospitals in Kerman province in the south of Iran. The data were acquired from the hospital database and written records of 398 hospitalized patients with positive COVID-19 tests (PCR) in a period of 6 months from February to July 2020. The admitted patients’ information contained demographic data, patient’s background, and symptoms of the disease. The average age of patients was 41.11 years old with the median and mode of 39 and 33 years old. The frequency of hospitalized male cases in the dataset was more than female and comprises 54% of the total records. The number of discharged cases and deaths were 377(94.72%) and 21(5.27%), respectively.

The severity of disease, based on the patient’s condition and symptoms, was divided into three categories: mild, moderate, and severe, in which 6% of the patients experienced severe disease conditions, while 94% experienced mild to moderate severity. Patients with mild condition were received only medication, the moderate group received medication and mask oxygen, and the severe cases, besides medication, used ventilators. Figure 2 demonstrates the flowchart of selecting cases for this study. From a total of 400 cases, two records including missing values were excluded. Of the 398 hospitalized COVID-19 patients, 28.18% received oxygen-based treatment which 13.39% of them were opioid-addicted. Non-oxygen-required treatments were applied to 286 patients.

Fig. 2
figure 2

Flow chart of selecting COVID-19 patients for predicting the necessity of oxygen-based treatment considering opioid and non-opioid addicted cases

2.2 Data preparation and feature selection

The original dataset included 57 features. At the data cleaning phase, non-required data such as patient ID were removed; also the job variable due to variation and difficulty in job classification was omitted. After specifying the study’s objectives, a consultant with medical specialists was conducted to determine the most relevant characteristics and features. The final features included demographic characteristics (two variables), patient’s background (nine variables), disease symptoms (eight variables), and a target variable (type of treatment). Demographic information comprises gender and age. History of other diseases such as diabetes, blood pressure, and lung disease is in the patient’s background class, and the last group includes initial symptoms of COVID-19 like cough, fever, and shortness of breath.

The type of treatment was divided into two classes: oxygen-required treatment and non-oxygen treatment, which the first group included patients who used oxygen masks and ventilators besides medication, but the second group only received medication. Opium and its extracts and heroin were the four addiction-related variables that for each of them the start age of consumption, the amount of daily consumption, the number of daily usages, and type of use (orally taken or smoked) were collected. To add the opioid addiction variable to the database, these drugs, which are common among addicted people in Kerman, were combined as one binary variable and added to the dataset. Since the data were exclusively collected for scientific purposes, in order to be more accurate, the missing values in the electronic dataset were filled by available data in written records of patients, and only two incomplete medical records were omitted. Except for the age variable, the rest of the variables are binary and no outliers were observed.

2.3 Machine learning models

Five machine learning algorithms including logistic regression, neural networks, decision tree C5.0, random forest, and XGboost were applied to predict the requirement for oxygen-based treatment in COVID-19 patients.

Logistic regression (LR) is one of the qualified models for binary outcomes in such fields as medical science especially in exploring the relationship between risk factors and the incidence of disease [21, 22]. In this paper, a multivariable LR with 19 predictors used to predict the oxygen and non-oxygen treatment for hospitalized patients with positive COVID-19. To fit the LR to the dataset, the iteratively reweighted least squares method was applied [23].

In a neural network (NN) algorithm, which is based on the nervous systems, the neurons represent the nodes in the algorithm that learn from the input data to optimize its final output [24]. The NNs for the purpose of this study were applied using one hidden layer, an output layer, and one input layer including 19 factor variables. The entropy fitting method is used to fit the NNs to the dataset. The maximum number of iterations and the maximum number of weight were set to 100 and 1000, respectively.

C5.0 is the improved version of the C4.5 decision tree algorithm developed by Quinlan [25]. This algorithm is based on the ID3 algorithm which decreases the misclassification errors caused by noise in the training data set [26]. For this algorithm in our prediction model, the boosting iterations were set to ten and the trees decomposed into the rule-based model.

Random forest (RF) [27] is a machine learning method that is normally used for classification and regression. The capability of matching with a wide range of prediction problems and the simplicity of parameter tuning are the two main reasons for the popularity of the RF algorithm [28]. To set the parameters for the proposed RF prediction model, the number of trees and the minimum size of terminal nodes were set to 200 and 1, respectively. The number of variables randomly sampled as candidates at each branch was set to ten.

Implementing the Gradient Boosting concept, the XGBoost, provides a parallel tree boosting to solve a wide range of regression and classification problems fast and accurately. This algorithm applies a more regularized formalization to control over-fitting [29]. In this study, the maximum depth, number of rounds, and subsample ratio of columns for XGBoost were set to 1, 150, and 0.8, respectively. All parameters related to the five ML algorithms are displayed in Table 1.

Table 1 Parameters’ values of five applied ML algorithms

3 Results

The 19 predictor features were categorized into three classes (as shown in Table 2). Patients in the age category of 19–60 years old consisted 74.37% of all cases and 61.60% of oxygen-required patients. Of the patients, 8.54% were under 18 years old in which 2.67%of them received oxygen-based treatment. Among the patient’s background features, blood pressure, lung disease, and diabetes with ratios of 32.14%, 28.57%, and 27.67%, respectively, had the most frequency in the oxygen-based treatment group. In the symptom category, shortness of breath with the rate of 71.42% had the highest frequency in patients with oxygen requirements. Fever (62.5%) and cough (59.82%) were in second and third places.

Table 2 Statistics of oxygen and non-oxygen required patients based on model variables

3.1 Opioid addiction

In this study, information about the use of opium and its subsequences for hospitalized patients with positive COVID-19 tests was collected. Because of the prevalence of opium consumption in this province, opium-addicted cases included 6.53% of total samples. As it is illustrated in Fig. 3, compared to females, the frequency of opioid addiction is higher in the male group. From the total number of hospitalized patients, 26 individuals were addicted which included 4 females and 22 males. The average age of women drug users was 65.25 and for men was 60.5 years. The minimum age of initiation of consumption was 15 years, and the maximum was 79 years with a mean of 44.11 years.

Fig. 3
figure 3

(a) Percentage of opium-addicted and non-addicted hospitalized COVID-19 patients among females and males in two local hospitals in Iran. (b) Normalized percentage of opium-addicted and non-addicted hospitalized COVID-19 patients among female and male

A point to consider is the high rate of mortality among opioid-addicted cases. The fatality ratio among non-addicted patients is 4.03%, while for opioid-addicted cases is 23.07%. The survival and death of both groups (addicted and non-addicted) are illustrated in Fig. 4.

Fig. 4
figure 4

(a) Fatality percentage for opium-addicted and non-addicted hospitalized COVID-19 patients in two local hospitals in Iran. (b) Normalized Fatality percentage for opium-addicted and non-addicted hospitalized COVID-19 patients

To determine the relationship between opioid addiction and oxygen-required treatment, the Chi-square test was conducted. The Chi-square test of independence applies to determine whether there is a relationship between two categorical variables. In this case, the null hypothesis assumes that there is not any relationship between addiction to opioids and the requirement of oxygen, and the alternative hypothesis assumes that there is an association between these two nominal variables. After computing the test statistics, it is found that \(p<0.001\). Hence, the null hypothesis was rejected considering the confidence interval of 95%. It can be concluded that the relationship between these two variables is statistically significant.

3.2 Prediction models

The basic dataset was randomly split into training and test set in a ratio of 80:20 with considering balanced data distribution. Ten-fold cross-validation method was applied to the training set to validate and evaluate the reliability of the developed models. ML algorithms including LR, NNs, C5.0, XGBoost, and RF were applied to build five different prediction models. To compare and evaluate the performance of models, accuracy, receiver operating characteristic (ROC) curves, Cohen’s Kappa, balanced accuracy, confusion matrix, and the area under the curve (AUC) were calculated. ROC curve displays the performance of a classifier system when discrimination cut-off value changes over the range of the predictor variable. Higher points above the diagonal line refer to the better predictive value of the test [30].

The accuracy and kappa of applied ML models in the train set are displayed in Fig. 5. For accuracy metric, LR and XGBoost obtained the maximum (90.90%), and RF and NNs with the value of 90.62% were in second place. The mean of accuracy for all models except NNs, which is 78.28%, was just above 80%. The kappa measurement was also calculated for all models. Kappa is a measure of inter-rater agreement [31]. In machine learning, it measures the level of agreement between the true values and the predicted values. The LR algorithm with the kappa of 0.7924 achieved the highest value, and the XGBoost with 0.7785 was in second place. RF and NNs with values of 0.0350 and 0.1428 obtained the minimum kappa, respectively.

Fig. 5
figure 5

Comparison of accuracy and kappa of five applied ML algorithms in train set using box plot

Figure 6 illustrates the ROC curves of five ML models in the test set. In comparing the performance of algorithms, XGBoost has the highest AUC followed by LR. NNs, C5.0, and RF was in third, fourth, and fifth positions, respectively. All five models demonstrate a desirable confidence interval result, ranging from 74.1 to 96.5%. The proposed approach was implemented in R software using libraries such as Caret [32], ggplot2 [33], and Liver [34]. The Caret environment consists of various machine learning models like NNs, RF, and LR. The overall runtime of the proposed model was 3.35 min, which makes it an appropriate tool for deciding on rush times. XGBoost had the longest and LR had the shortest runtime among all algorithms. In comparison to other schemes like making decisions based on historical data and previous experience, this approach assists decision-makers to decide more accurately in a shorter time.

Fig. 6
figure 6

Receiver operating characteristic (ROC) curves of five applied ML models to predict the oxygen-based treatment for COVID-19 patients. Area under the curve (AUC) and confidence interval are specified for each model

Sensitivity and specificity are two statistical performance metrics in using classification models or a diagnostic test. Sensitivity is the ability of the model to predict the true positives, while specificity evaluates the prediction of the true negatives by model [35]. To calculate them, the models’ confusion matrices (Fig. 7) were used. Table 3 demonstrates the performance of models, using accuracy, kappa, sensitivity, specificity, and balanced accuracy. LR and NNs obtained the highest and the same performance in five metrics. Compared to C5.0, RF and XGBoost presented better performance.

Fig. 7
figure 7

Confusion matrices of five ML models (a-e). Each plot represents the true positive (sensitivity) and true negatives (specificity) for predicting oxygen-based treatment for hospitalized COVID-19 patients

Table 3 Performance measures of the five ML models in test set

In a classification model, each variable has a specific impact on making predictions. Variable importance is a technique that indicates the relative importance of each input variable in a model prediction. The more important a variable, the more a model depends on it to make an accurate prediction [36]. The variable importance can be used to determine the most and least important variables to the model and improve the model’s performance by dropping ineffective features. In NNs, XGBoost, and RF, age had the most score (100%), while in LR and C5.0, shortness of breath and cough with 100% relative importance were the most effective variables in prediction. LR and C5.0 algorithms have found the variable importance for feature age, 59.97% and 91.48%, respectively. Shortness of breath and cough are among the five most important features in four of the classification models (RF, LR, C5.0, NNs). Four variables include age, shortness of breath, cough, and fever which are common in the top five features in NNs, LR, and RF. For the shortness of breath, the relative importance in LR, C5.0, NNs, RF, and XGBoost were 100%, 100%, 82.06%, 42.46%, and 31.88%, respectively. Using C5.0, LR, NNs, RF, and XGBoost, the relative importance of cough were 100%, 75.75%, 70.95%, 70.95%, and 14.65%, respectively. Opioid addiction variable with scores of 5.694% in NNs, 32.578% in LR, 1.903% in RF, 92.11% in C5.0, and 0.8475% in XGBoost showed different behaviors in each model.

4 Discussion

Oxygen therapy is one of the main treatment choices for COVID-19 patients which reduces the fatality rate among critical cases [37, 38]. In our proposed approach, four features including shortness of breath, cough, fever, and age were identified as the most important variables in predicting the requirement for oxygen-based treatment in the early stages of admission.

The association of shortness of breath and cough with receiving oxygen-based treatment has been addressed in various studies, which are the same as our results. Long et al. [38] analyzed the clinical information of 1362 COVID-19 patients of a local hospital in Wuhan. They found that most of the patients who experienced breathlessness, like shortness of breath, dyspnea, and chest tightness, received oxygen therapy. In another study in Ethiopia [39], the longer duration of supplemental oxygen requirement was associated with shortness of breath. They also found that compared to patients without this symptom on admission, the degree of ending oxygen therapy was 29.5% lower in cases with shortness of breath. Ni et.al [40] concluded that dyspnea is among the related factors to oxygen therapy for COVID-19 patients under 65 years and can increase their need for oxygen. They found that 59.5% of patients with dry cough received oxygen therapy.

Fever is one of the most common symptoms in patients with COVID-19 [41]. According to [42], 43.8% of COVID-19 patients on admission and 88.7% during the hospitalization experienced fever. In our ML models, fever was an important feature in the prediction of oxygen requirement that can be considered as an early sign on admission. [40] found the relationship between fever and oxygen therapy. Among COVID-19 patients, 70.9% of those with fever symptom received oxygen therapy.

It is shown that as the patient’s age increases, the severity of COVID-19 cases increases [43] and also the risk of in-hospital death [44]. In our algorithms, age was an effective factor in the prediction of oxygen-based treatment. Also, [39] recognized age as an important factor in the starting time of oxygen therapy, which is related to the longer duration of oxygen requirement among COVID-19 patients.

The effect of consuming opium on the requirement for oxygen-based treatment was analyzed in this study. The data were gathered from Kerman province in the south-east of Iran, which has a high number of opium consumers [45]. This research had the opportunity to evaluate the impact of opioid addiction in the prediction model. About 58% of patients addicted to opioid received oxygen-based treatment including ventilator and oxygen mask. This rate for non-addicted individuals was much less (26.34%). Additionally, compared to non-addicted patients, the fatality rate among opium-addicted cases was high (28.57%). It proves the previous claim [46] that there is a higher death rate for COVID-19 opium users’ patients. It is, probably, due to the negative impact of opium on the immune system and respiratory cells.

There were several limitations to this study. First, the sample size of COVID-19 patients was small, especially for oxygen therapy and mechanical ventilation. Second, the data were collected from two hospitals in one province, which may influence the model reliability due to the variability in symptoms and other factors of disease between different populations. Not defining the exact time of needing oxygen-based requirement for each patient in the prediction model was the third limitation. It is due to the fact that data were limited. Fourth, available features were limited to the patient’s background and symptoms with no information related to lab results.

Future research can consider more features like lab results, vital signs of the patient, and CT images. In addition, a larger set of data needs to be used to build models that are more reliable. Apart from oxygen, the need for other COVID-19-related supplies such as medications and beds can be predicted. Further studies also may try to collect the data, based on a specific time interval to specify the demand time of supplies and equipment.

5 Conclusion

In this study, information of hospitalized COVID-19 patients from two local hospitals in Iran was applied to predict the requirements for oxygen-based treatment. First, relevant attributes were selected based on experts’ opinions, and then the model performed five ML classifications to predict oxygen requirement. The proposed approach found that the most important variables in predicting the need for oxygen therapy were age, shortness of breath, cough, and fever. One of the main objectives of this research was to predict the oxygen-based treatment in the early stages of patient admission, which, according to the results, the model indicated high accuracy and sensitivity in predicting the outcome. Among five ML algorithms, NNs and LR achieved high sensitivity (0.9273) and specificity (0.7308) that demonstrate their capability in predicting the need for oxygen-based treatment for COVID-19 patients. XGBoost showed the highest AUC (0.887). Another aim was to analyze the effect of consuming opium on the requirement for oxygen in COVID-19 cases. The results revealed the high rate of the requirement to this medical resource and high fatality ratio in this group of patients compared to other cases.

In conclusion, the availability of medical resources especially in times of pandemic and the peak of the number of infected is an essential issue in managing hospital resources. Artificial intelligence tools like ML can help to accurately predict the need for medical supplies such as oxygen and avoid shortages.