Detecting COVID-19 from Raman Spectroscopy by Machine learning
One of the most challenging aspects of the COVID-19 pandemic has been the lack of testing needed to detect and trace infections. Many tests use biochemicals that can be expensive and difficult to produce. These tests can require long turnaround times for test results and can produce a high number of false negative results.
The principal diagnosis method for SARS-CoV-2 is a PCR technique, which allows for the detection of a genetic material of a pathogen or microorganism and it has high specificity, sensitivity and helps diagnosing even in the first stages of infection. However, it is not the fastest method to use in this situation and it’s very time consuming.
Raman spectroscopy could be used as a cheap and quick method to diagnose infection by SARS-CoV-2.
This article presents a Lasso Regularized Logistic Regression model intended to detect covid-19 from Raman spectroscopy.
The source of this data was found in the following DOI: https://doi.org/10.6084/m9.figshare.12159924.v1
The following portion delineates how the Machine learning model was developed.
First, import all the necessary libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
Load the data (csv file) into pandas dataframe. Use df.head() to peek into the first five rows of the data.
df=pd.read_csv('covid_and_healthy_spectra.csv')
df.head()
Here we can see that the data has 901 columns, out of them first 900 are features and the last one remaining is the target column(which is a binary variable).
Let us separate the features and target from the initial dataframe df.
features=df.iloc[:,:-1] # all rows of all columns except the last
target=df.iloc[:,-1] # all rows of the last columns
Split the data into train and test sets.
X_train, X_test, y_train, y_test = train_test_split(features,target, test_size=0.12, random_state=0)
Here, test_size=0.12 means that 12% of the observations should be in the test set and the rest would be in the train set.
Model Fitting Part:
The problem is a binary classification problem. In this article, we will use logistic regression to solve this problem.
As the number of features(columns) is much larger than the number of observations(rows), we need to get rid of less important features; otherwise the model would have a hard time finding out patterns in the data or it may overfit.
In logistic regression, we can filter out less important features by L1(Lasso : Least Absolute Shrinkage and Selection Operator ) regularization technique.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
clf=LogisticRegression(penalty='l1',random_state=0,solver='liblinear')
#Fitting the model
clf.fit(X_train,y_train)
We have fitted the L1 regularized logistic regression model on the train set.
Now, let us test our model’s performance on the test set
print(f"Classification report for test-dataset:\n{classification_report(clf.predict(X_test), y_test)}")
The output is as follows:
Summary : In this article, we have applied a lasso regularized logistic regression model to detect COVID-19 positive cases from Raman spectroscopy data. In the next article of this series, we will use PCA and Autoencoders for feature selection and will compare with this article’s result.