Deep Learning on Neanderthal Genes

Detecting regions of Neanderthal ancestry with Deep Learning

This is the seventh post of my column Deep Learning for Life Sciences where I give concrete examples of how Deep Learning can already now be applied in Computational Biology, Genetics and Bioinformatics. In the previous posts I demonstrated how to use Deep Learning for Ancient DNA , Single Cell Biology , OMICs Data Integration , Clinical Diagnostics and Microscopy Imaging . Today we are going to dive into the exciting History of Human Evolution and learn that it is straightforward to borrow methodology from the Natural Language Processing (NLP) and apply it to Human Population Genetics in order to infer regions of Neanderthal introgression in modern human genomes.

Brief History: Out of Africa

When ancestors of Modern Humans migrated out of Africa ~ 50 000 – 70 000 years ago , they encountered Neanderthals and Denisovans , two groups of ancient hominins that populated Europe and Asia at that time. We know that Modern Humans interbred with both Neanderthals and Denisovans since there is an evidence of presence of their DNA in genomes of Modern Humans of non-African origin. These great genetic advances in the History of Human Evolution became possible due to sequencing of Neanderthal and Denisovan genomes in 2010 and 2012, respectively, by the group of Svante Pääbo .

There were a few attempts made to infer the exact positions of DNA segments inherited from Neanderthals and Denisovans. They utilized various genomics resources such as 1000 Genomes project and computed different measures of local similarity of a non-African genome toward the high-quality Neanderthal genome and divergence with respect to African genomes (right figure below).

S* / S’ statistics and Conditional Random Field (CRF) were applied to detect Neanderthal introgression

Those similarity measures ( summary statistics ) despite being interpretable and efficient result in the loss of information because they attempt to capture a stretch of DNA in a single number without considering the ordered sequence of nucleotides itself. Other attempts to scan Neanderthal introgression across a genome use Hidden Markov Model (HMM) which is a memoryless model that again does not account for long-range correlations between nucleotides along the DNA sequence. This is where the power of Deep Learning to process raw genomic sequences and utilize the long memory of nucleotide linkage across the genome with RNNs / LSTMs and CNNs can be incredibly beneficial.

Potential of Deep Learning for Ancient Genomics

Currently, Deep Learning is basically ignored in Evolutionary Biology despite its great potential for working with Genomics data that represent one of major data sources in the present Evolutionary Science and Ancient DNA research areas. Recent attempts utilize simulated genomic data despite availability of real Genomics data that often have annotated regions with well understood biological function. Deep Learning is ideally suited for working with Genomics and Ancient Genomics simply because one genome is actually a Big Data . To realize this, just think about chopping the 3*10⁹ nucleotides long genome into stretches of 1000 nucleotides which brings 3*10⁶ training examples to be used for Deep Learning providing that an annotation (labels) can be figured out for each of the DNA stretches. That is a massive amount of training data!

Deep Learning for Genomics from Zou et al. Nature Genetics 51 , pages12–18 (2019)

I demonstrated how to use this idea for discriminating ancient from modern DNA sequences in my previous post . There I used a 1D Convolutional Neural Network (CNN) and a one-hot-encoded representation of the sequences . Here I am going to demonstrate another way to represent DNA sequences for input into Deep Learning. Please look at the left figure below, this is what we usually mean by text. However, this is not what people in Genomics mean by text. What they mean by text is shown to the right. DNA sequence is a text! It looks boring but with time you start seeing things in strings. What if I told you that I see a gene there? What if I told you that I see a gene strongly linked to Type 2 Diabetes (T2D) and we likely inherited this gene from Neanderthals?

What people usually mean by text (left), vs. what bioinformaticians mean by text (right), SLC16A11 gene

Now, if DNA sequence is a text we can apply all the apparatus from the Natural Language Processing (NLP) to such texts. However, where are sentences in the DNA texts and where are words? If we were to predict whether a bunch of DNA sequences (the bunch can be considered as a text) was inherited from Neanderthals or not, one DNA sequence from the bunch can be considered as a sentence of the text, and a k-mer (sub-sequence) can be treated as a word .

DNA sequence is a sentence that can be split into k-mers, space-delimited k-mers can be seen as words

Once we converted DNA sequences into space-delimited k-mers / words, we are done. Now we can turn to advanced NLP techniques and use e.g. a simple Bag Of Words model for comparing frequencies of words / k-mers between the sequences inherited from Neandertals and sequences of depleted archaic ancestry.

Prepare Sequences for Sentiment Analysis

Now let us demonstrate how to practically use Machine / Deep Learning and Natural Language Processing (NLP) for identifying the regions of Neanderthal introgression in modern human genomes. Here I will formulate the following problem to solve:

Show me a stretch of your DNA and I will predict how likely it was inherited from Neanderthals

As a training data set we are going to use coordinates of the candidate regions for Neanderthal introgression from Vernot and Akey, Science 2016 identified using S*-statistic on Europeans and Asians from the 1000 Genomes Project , the data can be downloaded from here https://drive.google.com/drive/folders/0B9Pc7_zItMCVWUp6bWtXc2xJVkk . We used the introgressed haplotypes file LL.callsetEUR.mr_0.99.neand_calls_by_hap.bed.merged.by_chr.bed and selected only unique coordinates, so we ended up with 83 601 regions of Neanderthal ancestry in modern Europeans. Let us read the coordinates and have a look at the length distribution of the Neanderthal introgressed regions.

We can see that the length of Neanderthal introgression segments varies from 10 kbp to 1.2 Mbp with the mean of ~100 kbp. Now we are going to use those coordinates and extract the actual sequences from the hg19 version of human reference genome which has a fasta-file format and can be downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ . We can of course use Python for sequence extraction but it is much faster to do it with the C++ based samtools .

Now we are going to build a Pandas DataFrame with coordinates of segments outside of the Neanderthal introgression coordinates. For this purpose, we are going to randomly draw DNA stretches of the same length as the introgression regions on the same chromosome and check whether these stretches intersect with any of the introgressed regions. In this way, we construct two sets of non-overlapping coordinates: introgressed and depleted regions.

Once the DataFrame containing regions of depleted Neanderthal ancestry has been built, we can again extract the actual DNA sequences from the hg19 fasta corresponding to the depleted segments with samtools , here I skip for brevity this step but check the complete Jupyter notebook on github to see the details. Taking a closer look at the extracted introgressed and depleted sequences we can notice quite a few N-nucleotide containing sequences. This is the missing data in Genomics, i.e. positions not resolved by the sequencing technologies. In order not to let the missing data affect our analysis I omitted the sequences containing at least one N-nucleotide, this can be efficiently done in BioPython:

Now the data is prepared for inputting into NLP analysis. Let us proceed with the simple Bag Of Words model, i.e. looking at the difference in frequency of k-mers between Neanderthal introgressed vs. depleted sequences.

Sentiment Analysis: Introgressed vs. Depleted

We are going to start with formatting the sequences from the two fasta-files as texts with space-delimited k-mers as words. Because of memory limitations of my laptop I read only first 10 000 nucleotides from each sequence, then I used the function getKmers that splits each sequence into k-mers and merged the k-mers in space-delimited manner, so at the end I had a list of sentences, each of them represents a list of words / k-mers.

Now we can easily visualize k-mer frequencies in Neanderthal introgressed and depleted sequences using the Counter class in Python for efficient word counting.

Although A- and T-rich k-mers seem to be most common for both Neanderthal introgressed and depleted regions, we observe small differences in the k-mer counts between the two cases which can be indicative for non-identical k-mer composition in the introgressed and depleted sequences. Next we encode the words / k-mers as integers with CountVectorizer class which simply builds a vocabulary of the text (number of unique k-mers) and counts occurrences of each word / k-mer in each sentence / sequence.

After we have split the data set into train and test sub-sets, we define a feed-forward neural network with Stochastic Gradient Descent (SGD) optimizer + Momentum and L1 weight regularizer.

MLP Accuracy training curve (left) and confusion matrix of evaluation on test data set (right)

The model reached the accuracy of 82.2% on the test data set classifying the sequences of Neanderthal introgressed vs. depleted origin. This was a clearly better result compared to linear models such as Logistic Regression, Support Vector Machines (SVM) or Naive Bayes Classifier, however, unexpectedly, the Random Forest Classifier reached even higher accuracy of 84.5% on the same test data set. Displaying the feature importances of the Random Forest model we observe such k-mers as AAAAA , CAAAA , CATTT and TTTTT to be the most predictive. We immediately get an intuition that the prediction of Neanderthal introgressed / depleted regions has something to do with the GC/AT content because the most predictive k-mers are extremely AT-rich .

Random Forest classifier confusion matrix (left) and feature importances dominated by AT-rich k-mers (right)

From the Confusion Matrix for Random Forest Classifier (above) it becomes clear that the model had a very high accuracy when predicting segments of depleted Neanderthal ancestry, however performed very poorly (worse than the neural network) on classifying introgressed regions.

Predict Genes Inherited from Neanderthals

Now let us come back to the question of predicting from a given DNA sequence whether it was inherited from Neanderthals or not. Here I will generate such predictions for human protein-coding genes. Let us download the RefSeq gene annotation file for hg19 here http://genome.ucsc.edu/cgi-bin/hgTables , clean it and use the coordinates of the genes in order to extract their sequences from the reference genome fasta-files similarly to how we did it for the introgressed and depleted regions, see details in the Jupyter notebook on github . Again we construct the gene texts via space-delimited concatenation of the k-mers.

Next, we are going to use the trained Random Forest classifier to generate predictions for the sequences of the genes that were previously converted to texts. Hence, the gene texts are converted to integers with CountVectorizer.

Genes predicted to have high (left) and low (right) probability to be inherited from Neanderthals

In this way, for each gene we generate a probability (Prob_1) to be inherited from Neanderthals and to be a native human gene (Prob_0). We can easily display the genes with highest Prob_1 (lowest Prob_0) and highest Prob_0 (lowest Prob_1), please see the gene lists above. Remarkably, out of ~22000 protein-coding genes, only 477 were predicted to contain Neanderthal DNA. That is a very interesting observation, we will revisit it later.

Visualization of K-mer / Word Embeddings

The vocabulary of the genomic texts produced by concatenating k-mers can be visualized via a Word Embedding model Word2Vec which makes intelligent guesses about similarity between words (k-mers in our case). Once fit on our data Word2Vec represents each k-mer as a 100-dimensional latent vector, the vectors can be viewed as another numeric representation of our data that can be input into any dimension reduction technique for further exploration. Here we are going to embed and visualize with UMAP and tSNE the k-mers used for the Neanderthal introgression vs. depletion Sentiment Analysis. We can detect two obvious k-mer clusters . A closer look shows that one of them seems to be enriched for AT-rich k-mers and the other one (smaller one) is predominantly composed of GC-rich k-mers.

I specifically highlighted by green the most predictive k-mers according to the Random Forest classifier that discriminate between Neanderthal introgressed vs. depleted sequences. As expected they all fall into the bigger AT-rich cluster. This Word Embedding visualization is yet another evidence of some structure present in the k-mer sentences, which is capable of predicting the Neanderthal ancestry and has something to do with the balance between GC- and AT-rich segments of DNA .

Evolution Eliminates Neanderthal DNA from Genes

Summarizing everything discussed in the previous sections one can conclude that the presence of Neanderthal ancestry in our DNA suspiciously correlates with the GC / AT-content across the genome. It is well known that our genes are GC-rich , their GC-content is around 47% compared to 41% genome-wide, which is a dramatic difference in GC-content.

Human genes are typically GC-rich which is indicative for absence of Neanderthal ancestry

Therefore we can ask how much actually the Neanderthal introgression and depletion regions overlap with the genes? We can quickly compute the number of intersects between the genes and introgression coordinates with bedtools intersect . However, since the coordinates of the Neanderthal depleted regions were previously selected randomly (we only demanded their non-overlapping with the introgression coordinates), we should repeat the selection procedure a number of times in order to estimate the significance of the intersects.

We can see that the number of intersects between the genes and Neanderthal introgression segments, 140 821 , is significantly lower than the number of intersects between genes and Neanderthal depleted regions. In fact, none of the 17 randomly drawn regions of Neanderthal depletion had the number of intersects with the genes below or equal 140 821, so we can compute p-value as p-value < 1 / 17. What we see is that the Neanderthal introgressed regions fall predominantly outside of the genes, which implies that Evolution did not prioritize the Neanderthal ancestry and tried to push it away from the most functional elements of our genome, which are protein-coding genes . Therefore for some mysterious reasons interbreeding with Neanderthals did not improve our fitness and Evolution tried to correct for this. They talk a lot in newspapers and magazines about a very cool fact of interbreeding between Modern Humans and Neanderthals, but they never mention that this was not beneficial for Modern Humans, do they?

Image source

Summary

In this post we have learnt about a huge potential of Machine / Deep Learning and Natural Language Processing (NLP) for Evolutionary Biology and Ancient DNA research areas, which is still not fully utilized . Genomics data provides one of the main sources of information in Evolutionary Biology and comprises millions and billions of short DNA fragments that can and should be analyzed with Deep Learning. Here we demonstrated how to prepare Genomics data for NLP and analyze it with Bag Of Words and Word Embedding . We trained a model capable of predicting whether a genomic sequence was inherited from Neanderthals. Using the model we constructed lists of genes that were likely inherited from Neanderthals and discovered a lack of archaic ancestry in our genes which suggests that interbreeding with Neanderthals was evolutionary not beneficial for modern humans .

Important to mention that I am currently working on various types of neural network architectures with long memories across the sequence such as RNNs / LSTMs, CNNs / multichannel CNNs with application to Ancient Genomics and am going to come back to this topic in my later posts.

As usually, let me know in the comments your favorite area in Life Sciences which you would like to address within the Deep Learning framework. Follow me at Medium Nikolay Oskolkov , in Twitter @NikolayOskolkov and connect in Linkedin . The complete Jupyter notebook can be found on my github . I plan to write the next post about How to predict population size back in time with Deep Learning , stay tuned.

Deep Learning on Neanderthal Genes