19

Decoding Your Genes

 3 years ago
source link: https://towardsdatascience.com/decoding-your-genes-4a23e89aba98?gi=d5eb80f72cff
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Can Neural Networks Unravel The Secrets Of Our DNA?

I explore the impact of ML on the traditional sciences by summarising exciting new research papers. In this article I am discussing another cool paper: “Detection of DNA base modifications by deep recurrent neural network on Oxford Nanopore sequencing data” ( Nature Communications, 10, 2449, 2019 ).

1|The Code Behind You

Every part of your body is a product of your DNA (deoxyribonucleic acid) a complex genetic code which describes exactly what your cells should be doing. We all know the famous double helix shape of a DNA molecule: it’s made of different chemical units called ‘bases’ (Cytosine [C], Guanine [G], Adenine [A] and Thymine [T]) which are bonded together into beautiful coiling chains. This is a bit like binary computer code, but instead of a sequence of 1s and 0s, it’s Cs, Gs, As and Ts. The precise sequence of CGAT bases in these chains encodes everything about you as well as every other animal on the planet.

6NriIvJ.jpg!web

The double helix structure of DNA. Image from pixabay.com

A really important biological process is DNA methylation: this is when a simple chemical methyl group (CH3) is added to the normal DNA bases: the image below shows how small this change is. Although this change looks tiny, it can have huge consequences on gene regulation, aging and even cancer. It’s even thought that these methylations could act as therapeutic targets for cancer treatment!

zieAf2n.png!web

Comparison of normal Cytosine (C base in DNA) and a methylated version. Copied with permission under the Creative Commons Licence .

2| Mapping DNA Changes With Nanotech

It’s obviously really important that we are able to map out modifications in our genetic code… but it’s actually very difficult to do. Current techniques are noisy and give such poor resolution that there is a real drive for improvement. One recent idea has been based around a really cool Nanopore technology. You can take a look at this great video showing exactly how it works, but basically an ionic current is passed over a polymer membrane containing tiny nanopores. As molecules of interest move through these pores, from one side of the membrane to the other, the current is disturbed in a characteristic way. So you put in DNA molecules and you get out a 2D sequence of changing electronic signals which can be used to identify the genetic code. This technique was recently used to sequence the entire genetic code of COVID-19 in just 7 hours !

Although this system is really good at getting the overall DNA sequence, it is a bit harder to locate where subtle methylations exist because understanding the signal is context dependent. Methylations can be located by comparing the Nanopore electronic signals of methylated and un-methylated DNA sequences.

NZzI3qI.png!web

Scheme showing DNA passing through a membrane to generate disturbances in the electronic signal.

3| How Can ML Help?

If you’re looking for patterns in sequence data, RNNs (Recurrent Neural Networks) are the perfect architecture to use. In case you’re not familiar, lets take a quick look at how they work.

A typical ‘feed-forward’ neural network follows the process of applying randomly initialised weights and biases to an input to predict an output. When the generated output is compared to the target output an error (or ‘Loss’) is calculated. This loss is then propagated back through the network to update the weights and biases with the aim of improving the output. This process is repeated over and over again with different inputs during training until the network hopefully learns to generate an accurate output. This is just a very simple overview, but sums up the main principles.

What if you are not predicting a single output but rather a sequence of outputs? Conceptually, an RNN can be thought of as a connected sequence of feed-forward networks with information passed between them. The information being passed is the hidden-state which represents all the previous inputs to the network. At each step of the RNN, the hidden state generated from the previous step is passed in, as well as the next sequence input. This then returns an output as well as the new hidden state to be passed on again. This allows the RNN to retain a ‘memory’ of the sequence information it has seen so far and makes them great for understanding sequential data. You can check out a more mathematical description here .

Let’s take a look at exactly how RNNs can decipher DNA modifications.

rAbqMfM.png!web

Representation of the movement of information through an RNN

4| Unravelling DNA With RNNs

In the current paper, a new tool ‘DeepMod’ was developed. This is a bidirectional RNN (it passes sequence information both forwards and backwards) with long-short-term-memory (LSTM); check out a great summary of LSTMshere.

DeepMod takes a reference genetic code and a Nanopore electric signal as input. The ‘events’ in the electric signal (series of signal points generated by the Nanopore sequencer) are aligned with the DNA code in the reference. This was achieved using BWA-MEM , a alignment algorithm for matching DNA sequences with reference genomes. This algorithm is capable of matching DNA sequences up to megabases long!

The authors used a 7-feature vector description of the input signal; signal mean, standard deviation and number of signal points associated with an event combined with a four feature description of the DNA base (A, C, G or T). This acts as the input to the network (see its architecture below) which predicts if the signal event is the result of a modified base.

The algorithm was first trained and optimised using data from E. Coli bacteria DNA. A 21 unit LSTM with 3 hidden layers was found to achieve high accuracy results while maintaining reasonable computational costs. Analysis on several different E. Coli datasets showed strong results with amazing single base methylation mapping resolution. In addition the network showed great precision (up to 0.99) at identifying which bases were methylated.

muu6Jra.png!web

Chart showing the structure of DeepMod RNN with LSTM indicating how an electronic input is passed through the network to output a prediction of methylation on the site of interest. Copied with permission from the original paper under the Creative Commons License .

5|Could This Map Human Genetic Code?

Yes! Even though Deepmod was trained only on bacterial DNA data, it was used to make accurate predictions on methylations in human DNA. This cross-species testing is really exciting because it shows that a model trained on one species can be used to accurately map the DNA structure of a species the model has never seen before. Perhaps the model could also be applied to loads of different species, and it would be really exciting to see these results too!

This is a great example of the power of neural networks to generalise to new tasks and speed up the rate of scientific learning. It’s also a great example of machine learning being applied to current scientific problems to generate an immediate and practical solution. The DeepMod code is now available online and the authors plan to maintain and update it for future users. This tool’s high speed and high accuracy ability to analyse human DNA modifications may help in the understanding and treatment of different diseases like cancer, so this is a really cool result!

MBfEZje.png!web

Image from pixabay.com

6| Final Thoughts

Despite these great results, as always there are a few things to bear in mind:

  1. DeepMod was trained and tested on only 2 types of DNA methylation, but there are actually loads of different types. More testing is needed to know if this model can be used to locate the wide range of modifications that exist in real DNA.
  2. The model did not examine RNA (ribonucleic acid), DNA’s single stranded cousin. This is an essential biological molecule for coding, decoding and gene expression so it would be really interesting to see how the model fares with this task as well.
  3. Finally, the model relies heavily on the alignment of the input signal with the reference DNA using BWA-MEM . If poorly aligned, the model’s performance will suffer heavily and this dependency needs to be remembered for training.

Overall this is a really promising physical application of neural networks and if you enjoyed this brief summary I would encourage you to read the original paper to get more in depth details about the DeepMod framework, training and validation process.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK