Removing Spikes from Raman Spectra

Raman spectroscopy is a widely used analytical technique which provides structural and electronic information from molecules and solids. It is applicable at both laboratory and mass-production scales, and has applications in many different fields such as physics, chemistry, biology, medicine or industry.

A typical issue known in Raman spectroscopy is that Raman spectra are sometimes ‘contaminated’ by spikes. Spikes are positive, narrow bandwidth peaks present at random position on the spectrum. They originate when a high-energy cosmic ray impacts in the charge-couple device detector used to measure Raman spectra. These spikes are problematic as they might hinder subsequent analysis, particularly if multivariate data analysis is required. Therefore, one of the first steps in the treatment of Raman spectral data is the cleaning of spikes.

First, the Python packages that will be needed are loaded:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Figure 1 shows the Raman spectrum of graphene. In the previous years, graphene has become a very popular material due to its remarkable physical properties, including superior electronic, thermal, optical and mechanical properties. Its characteristic Raman spectrum consists of several peaks as shown in the figure. From their shape and related intensity, a large amount of information such as doping, strain or grain boundaries can be learned. This spectrum is a clear example of a spectrum contaminated with a spike.

# load the data as data frame
df = pd.read_csv(‘folder_name.../spectrum.txt’, delimiter = ‘\t’)# Transform the data to a numpy array
wavelength = np.array(df.Wavelength)
intensity = np.array(df.Intensity)# Plot the spectrum:
plt.plot(wavelength, intensity)
plt.title(‘Spectrum’, fontsize = 20)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel(‘Wavelength (nm)’, fontsize = 20)
plt.ylabel(‘Intensity (a.u.)’, fontsize = 20)
plt.show()

ui6vumY.jpg!web

Figure 1. A spike was recorded while measuring the typical Raman spectrum of graphene (characterized by the G and 2D bands).

Z-score based approach for spike detection in Raman spectra

Spikes intensities are normally above the intensities of other Raman peaks in the spectrum, therefore using a z-scores based method could be a good starting point. The z-scores tell how far a value is from the average in units of standard deviation. So, if the population mean and population standard deviation are known, the standard score of a raw score x(i) is calculated as:

z(i) = (x(i)-μ) / σ

where μ is the mean and σ is the standard deviation of the population x (x(i) represent the values of a single Raman spectrum). More in detail information on how to detect outliers using Z-score approaches can be found in reference [3].

Let’s calculate the z-scores for the points in our spectrum:

def z_score(intensity):
 mean_int = np.mean(intensity)
 std_int = np.std(intensity)
 z_scores = (intensity — mean_int) / std_int
 return z_scoresintensity_z_score = np.array(z_score(intensity))
plt.plot(wavelength, intensity_z_score)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel( ‘Wavelength’ ,fontsize = 20)
plt.ylabel( ‘Z-Score’ ,fontsize = 20)
plt.show()

yymqaun.png!web

Figure 2. Z-scores of the spectrum plotted in figure 1.

A threshold is then needed in order to tell whether a value is an outlier or not. A typical value for this threshold is 3.5, as proposed as guideline by the American Society of Quality control as the basis of an outlier-labeling rule², while the authors of the publication used 6. In order to apply a threshold to exclude spikes, the absolute value of the Z-score must be taken:

|z(i)| = |(x(i)-μ) / σ|

calculated as

def z_score(intensity):
 mean_int = np.mean(intensity)
 std_int = np.std(intensity)
 z_scores = (intensity — mean_int) / std_int
 return z_scoresthreshold = 3.5intensity_z_score = np.array(abs(z_score(intensity)))
plt.plot(wavelength, intensity_z_score)
plt.plot(wavelength, threshold*np.ones(len(wavelength)), label = ‘threshold’)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel( ‘Wavelength’ ,fontsize = 20)
plt.ylabel( ‘|Z-Score|’ ,fontsize = 20)
plt.show()

UFZJV3y.png!web

Figure 2b. Z-scores of the spectrum plotted in figure 1. The threshold still cuts the 2D Raman peak.

However, the z-score pretty much resembles the original spectrum and the threshold still cut off the main Raman peak. Let’s plot what was detected as spikes using a 3.5 threshold:

threshold = 3.5# 1 is assigned to spikes, 0 to non-spikes:
spikes = abs(np.array(z_score(intensity))) > thresholdplt.plot(wavelength, spikes, color = ‘red’)
plt.title(‘Spikes: ‘ + str(np.sum(spikes)), fontsize = 20)
plt.grid()
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel( ‘Wavelength’ ,fontsize = 20)
plt.ylabel( 'Z-scores > ' + str(threshold) ,fontsize = 20)
plt.show()

ia2iAjU.png!web

Figure 3. Using modified Z-Scores, 18 spectral point are above the threshold.

Modified Z-score based approach for spike detection in Raman spectra

A second option would be making use of robust statistics and calculate the modified z-scores of the spectrum. This modified Z-score method uses the median (M) and median absolute deviation (MAD) rather than the mean and standard deviation:

z(i) = 0.6745 (x(i)-M) / MAD

where the MAD = median(|x-M|) and |…| represents the absolute value. Both the median and the MAD are robust measures of the central tendency and dispersion, respectively. The multiplier 0.6745 is the 0.75th quartile of the standard normal distribution, to which the MAD converges to⁴.

def modified_z_score(intensity):
 median_int = np.median(intensity)
 mad_int = np.median([np.abs(intensity — median_int)])
 modified_z_scores = 0.6745 * (intensity — median_int) / mad_int
 return modified_z_scoresthreshold = 3.5intensity_modified_z_score = np.array(abs(modified_z_score(intensity)))
plt.plot(wavelength, intensity_modified_z_score)
plt.plot(wavelength, threshold*np.ones(len(wavelength)), label = 'threshold')
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel('Wavelength (nm)' ,fontsize = 20)
plt.ylabel('|Modified Z-Score|' ,fontsize = 20)
plt.show()

BZFJRrv.png!web

Figure 4. Modified Z-scores of the spectrum plotted in figure 1. The threshold still cuts the 2D Raman peak.

The same issue can still be observed:

# 1 is assigned to spikes, 0 to non-spikes:
spikes = abs(np.array(modified_z_score(intensity))) > thresholdplt.plot(wavelength, spikes, color = ‘red’)
plt.title(‘Spikes: ‘ + str(np.sum(spikes)), fontsize = 20)
plt.grid()
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel( ‘Wavelength’ ,fontsize = 20)
plt.ylabel( 'Modified Z-scores > ' + str(threshold) ,fontsize = 20)
plt.show()

y6vqimB.png!web

Figure 5. Using modified Z-Scores, 24 spectral points are above the threshold.

The 2D Raman peak is still detected as spike, so a more sensitive approach is required.

Whitaker and Hayes’ modified Z-score based approach for spike detection in Raman spectra

Whitaker and Hayes propose to make advantage of the high intensity and small width of spikes and therefore use the difference between consecutive spectrum points ∇x(i) = x(i)-x(i-1) to calculate the z-scores, where x(1), …, x(n) are the values of a single Raman spectrum recorded at equally spaced wavenumbers and i = 2, …, n. This step annihilates linear and slow moving curve linear trends, while sharp thin spikes will be preserved. Now,

z(i) = 0.6745 (∇x(i)-M) / MAD

So there is only an extra step included in which the difference between consecutive values are included.

# First we calculated ∇x(i):
dist = 0
delta_intensity = [] 
for i in np.arange(len(intensity)-1):
 dist = intensity[i+1] — intensity[i]
 delta_intensity.append(dist)delta_int = np.array(delta_intensity)
# Alternatively to the for loop one can use: 
# delta_int = np.diff(intensity)
intensity_modified_z_score = np.array(modified_z_score(delta_int))
plt.plot(wavelength[1:], intensity_modified_z_score)
plt.title('Modified Z-Score using ∇x(i)')
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel('Wavelength (nm)', fontsize = 20)
plt.ylabel('Score', fontsize = 20)
plt.show()

rEfqquE.png!web

Figure 6. Modified Z-scores using ∇x(i) of the spectrum plotted in figure 1.

Again, in order to apply a threshold to exclude spikes, the absolute value of the modified Z-score must be taken:

|z(i)| =|0.6745 (∇x(i)-M) / MAD|

Resulting in

threshold = 3.5intensity_modified_z_score = np.array(np.abs(modified_z_score(delta_int)))
plt.plot(wavelength[1:], intensity_modified_z_score)
plt.plot(wavelength[1:], threshold*np.ones(len(wavelength[1:])), label = 'threshold')
plt.title('Modified Z-Score of ∇x(i)', fontsize = 20)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel('Wavelength (nm)', fontsize = 20)
plt.ylabel('Score', fontsize = 20)
plt.show()

QvyUBr2.png!web

Figure 7. Abosolute value of the modified Z-scores of ∇x(i) of the spectrum plotted in figure 1.

And once more, the number of detected spikes can be calculated as

# 1 is assigned to spikes, 0 to non-spikes:
spikes = abs(np.array(modified_z_score(intensity))) > thresholdplt.plot(wavelength, spikes, color = ‘red’)
plt.title(‘Spikes: ‘ + str(np.sum(spikes)), fontsize = 20)
plt.grid()
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel( ‘Wavelength’ ,fontsize = 20)
#plt.ylabel( ‘Spike(1) or not(0)?’ ,fontsize = 20)
plt.show()

bIRVf27.png!web

Figure 8. Using modified Z-Scores with ∇x(i), 17 spectral points are above the threshold = 3.5.

For the 3.5 recommended threshold many false spikes are assigned. However, the value scored by this approach is much higher in comparison with the Raman peaks than before.

plt.plot(wavelength[1:],
np.array(abs(modified_z_score(delta_int))), color='black', label = '|Modified Z-Score using ∇x(i)|')
plt.plot(wavelength, np.array(abs(modified_z_score(intensity))), label = '|Modified Z-Score|', color = 'red')
plt.plot(wavelength, np.array(abs(z_score(intensity))), label = '|Z-Score|', color = 'blue')
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel( 'Wavelength' ,fontsize = 20)
plt.ylabel( 'Score' ,fontsize = 20)
plt.legend()
plt.show()

vENzaqa.png!web

Figure 9. Comparison between the three different approaches.

In general, the right threshold must be chosen depending on the data set. For this case, a threshold = 7 is already enough to obtain a clear selection.

V3qQJz2.png!web

Figure 10. Absolute values of the modified Z-scores with ∇x(i) of the spectrum plotted in figure 1.

AFJRfaA.png!web

Figure 11. Using modified Z-Scores with ∇x(i), 3 spectral points are above the threshold = 7.

Fixing the Raman spectrum

Once the spikes are detected, the next step is to remove them and fix the spectra. For this, interpolated values are obtained at each candidate wavenumber by calculating the mean of its immediate neighbors (within a 2m+1 values window).

def fixer(y,m):
 threshold = 7 # binarization threshold. 
 spikes = abs(np.array(modified_z_score(np.diff(y)))) > threshold
 y_out = y.copy() # So we don’t overwrite y for i in np.arange(len(spikes)):
  if spikes[i] != 0: # If we have an spike in position i
   w = np.arange(i-m,i+1+m) # we select 2 m + 1 points around our spike
   w2 = w[spikes[w] == 0] # From such interval, we choose the ones which are not spikes
   y_out[i] = np.mean(y[w2]) # and we average their values
 
return y_out# Does it work?
plt.plot(wavelength, intensity, label = 'original data')
plt.plot(wavelength, fixer(intensity,m=3), label = 'fixed spectrum')
plt.title('Despiked spectrum',fontsize = 20)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel('Wavelength (nm)' ,fontsize = 20)
plt.ylabel('Intensity (a.u.)' ,fontsize = 20)
plt.legend()
plt.show()

yQ3EBze.png!web

Figure 12. Original and despiked Raman spectrum using modified Z-Scores with ∇x(i) with the threshold = 7.

So after calculation of the modified Z scores of ∇x, and thresholding by setting an appropiate threshold value, the spike is removed and smoothed by applying a moving average filter.

Other examples

To wrap up, two examples of spikes with different intensity signal are presented, showing the strenght of this approach.

Ifquui6.png!web

Figure 13. Examples of two more spectra in the first row of graphs, their Z-scores (blue lines) , modified Z-scores (red lines) and modified Z-scores using ∇x(i) (black lines) in the second row of graphs and thei despiked spectra in the third row, are presented.

Aknowledgements.I would like to thank Jorge Luis Hita and Ana Solaguren-Beascoa for taking the time to proofread this notes.

Z-score based approach for spike detection in Raman spectra

Modified Z-score based approach for spike detection in Raman spectra

Fixing the Raman spectrum

Other examples

Recommend

The rise of SQL-based data modeling and DataOps

Running Kafka on Windows 10

Qt配置文件之Xml

Build a WhatsApp Chatbot With Python, Flask and Twilio

TPU的起源，Jeff Dean综述后摩尔定律时代的ML硬件与算法

SIGIR 2019 | 为什么你的毛衣显胖又显黑？这个算法给你答案

微软研究院开源DialoGPT：「你有什么梦想？」「让世界充满机器人」

java property 配置文件管理工具框架，避免写入 property 乱序

191120-SpringBoot系列教程Web篇之开启GZIP数据压缩

一个简单的Vue按钮级权限方案

About Joyk