## Summary
Thanks to the increasing availability of genomics and other biomedical data, many machine learning algorithms have been proposed for a wide range of therapeutic discovery and development tasks. In this survey, we review the literature on machine learning applications for genomics through the lens of therapeutic development. We investigate the interplay among genomics, compounds, proteins, electronic health records, cellular images, and clinical texts…. ## The bigger picture
The genome contains instructions for building the function and structure of organisms. Recent high-throughput techniques have made it possible to generate massive amounts of genomics data. However, there are numerous roadblocks on the way to turning genomic data into tangible therapeutics. We observe that genomics data alone are insufficient for therapeutic development. We need to investigate how genomics data interact with other types of data such as compounds, proteins, electronic health records, images, and texts…. Machine learning techniques can be used to identify patterns and extract insights from these complex data. In this review, we survey a wide range of genomics applications of machine learning that can enable faster and more efficacious therapeutic development. Challenges remain, including technical problems such as learning under different contexts given low-resource constraints, and practical issues such as mistrust of models, privacy, and fairness…. Recent high-throughput techniques have made it possible to generate massive amounts of genomics data. However, there are numerous roadblocks on the way to turning genomic data into tangible therapeutics. We need to investigate how genomics data interact with other types of data such as compounds, proteins, electronic health records, images, and texts. In this review, we survey a wide range of genomics applications of machine learning that can enable faster and more efficacious therapeutic development…. Recent advances in high-throughput technologies have led to an outpouring of large-scale genomics data.^6^^,^^7^ However, the bottlenecks along the path of transforming genomics data into tangible therapeutics are innumerable…. ^12^ We argue that genomics data alone are insufficient to ensure clinical implementation, but it requires the integration of a diverse set of data types, from compounds, proteins, cellular image, and EHRs to scientific literature. This heterogeneity and scale of data enable the application of sophisticated computational methods such as machine learning (ML)…. We also formulate these tasks and data modalities in ML languages, which can help ML researchers with limited domain background to understand those tasks. In summary, this survey presents a unique perspective on the intersection of ML, genomics, and therapeutic development…. In the penultimate section, we identify seven open challenges that present numerous opportunities for ML model development and also novel applications. We provide a GitHub repository (https://github.com/kexinhuang12345/ml-genomics-resources) that curates a list of resources discussed in this survey…. #### Texts
One common categorization of texts is structured versus unstructured data. Structured data follow rigid form and are easily searchable, whereas unstructured data are in a free-form format such as texts. While they are more difficult to process, they contain crucial information that usually does not exist in structured data. The first important example of text encountered in therapeutics development includes clinical trial design protocols, where texts describe inclusion and exclusion criteria for trial participation, often as a function of genome markers…. ##### Machine learning representations
Clinical texts are similar to texts in common natural language processing. The standard way to represent them is a matrix of size M×N, where *M* is the number of total vocabularies and *N* is the number of tokens in the texts. Each column is a one-hot encoding for the corresponding token. An example is depicted in Figure 2H…. ### Machine learning methods for biomedical data
ML models learn patterns from data and leverage these patterns to make accurate predictions. Numerous ML models have been proposed to tackle different challenges. This section briefly introduces the main mechanisms of popular ML models used to analyze genomic data. Figure 3 describes a typical ML for genomics data workflow. We also provide a list of public benchmarks or competitions that compare various discussed ML methods in Table 2…. #### Preliminary
A typical ML model for genomics usage is as follows. Given an input of a set of data points, where each data point consists of input features and a ground-truth biological label, an ML model aims to learn a mapping from input to a label based on the observed data points, which are also called training data. This setting of predicting by leveraging known supervised labels is also called supervised learning. The size of the training data is called the sample size. ML models are data-hungry and usually need a large sample size to perform well…. The input features can be DNA sequences, compound graphs, or clinical texts, depending on the task at hand. The ground-truth label is usually obtained via biological experiments. The ground truth also presents the goal for an ML model to achieve. Thus, if the ground-truth label contains errors (e.g., human labeling error or wet-lab experiments error), the ML model could optimize over the wrong signals, highlighting the necessity of high-quality data curation and control…. It is also worth mentioning that the input can also present quality issues, such as shifts of the cell image, batch effect for gene expressions, and measurement errors. There are various forms of ground-truth labels. If the labels are continuous (e.g., binding scores), the learning problem is a regression problem. And if the labels are discrete variables (e.g., the occurrence of interaction), the problem is a classification problem…. When labels are not available, an ML model can still identify the underlying patterns within the unlabeled data points. This problem setting is called unsupervised learning, whereby models discover patterns or clusters (e.g., cell types) by modeling the relations among data points. Self-supervised learning uses supervised learning methods for handling unlabeled data. It creatively produces labels from the unlabeled data (e.g., masking out a motif and using the surrounding context to predict the motif).^14^^,^^23^… Classic ML models are very simple to implement and are highly scalable. They can serve as a strong baseline. However, they only accept real-valued vectors as inputs and do not fit the diverse biomedical entity types such as sequence and graph. Also, these vectors are usually features engineered by humans, which further limits their predictive powers. Examples are shown in Figures 4A and 4B.
##### Suitable biomedical data
Any real-valued feature vectors built upon biomedical entities such as SNP profile and chemical fingerprints…. They have also been successfully adapted for state-of-the-art performances on proteins^36^ and compounds.^37^ Transformers are powerful, but they are not scalable due to the expensive self-attention calculation. Despite several recent advances to increase the maximum size to the order of tens of thousands,^38^ this limitation has still prevented its usage for extremely long sequences such as genome sequences and usually requires partitioning and aggregation strategies. An example is depicted in Figure 4E.
##### Suitable biomedical data
DNA sequence, protein sequence, texts, and image…. One disadvantage of AEs is that they model training data, while in single-cell analysis test data can come from different settings from training data. It is thus challenging to obtain accurate latent embeddings with AEs on novel test data. An example is depicted in Figure 4G.
##### Suitable biomedical data
Unlabeled data…. ##### Suitable biomedical data
Data in which new variants can have more desirable properties (e.g., molecule generation for drug discovery).^47^^,^^48^ Depending on the data modality, different encoders can be chosen for the generative models…. In this section, we review ML for genomics tasks in target discovery. First, we review six tasks that use ML to facilitate understanding of human biology, and second, we describe four tasks in using ML to help identify druggable biomarkers more accurately and more quickly…. Various methods, including linear regression^77^ and support vector machines,^78^ are used to predict a cell-composition vector when combined with the signature matrix to approximate the gene expression. In these works the signature matrix is pre-defined, which may not be optimal. A learnable signature matrix could lead to improved accuracy…. Andersson et al.^80^ model various cell-type-specific parameters using a customized probabilistic model. As spots in a slide have spatial dependencies, modeling them as a graph can further improve performance. Notably, Su and Song^81^ initiate the use of graph convolutional network to leverage information from similar spots in the spatial transcriptomics…. There are two major challenges for this task. The first is the quality of the gold-standard annotations as the cell-proportion estimates are usually noisy. This calls for ML methods that can model the label noise.^82^ Another challenge is that the proportions are highly dependent on phenotypes such as age, gender, and disease status. How to take into account this information in the ML models is also valuable for more accurate deconvolution…. Despite the promises, gene network construction is difficult due to the sparsity, heterogeneity, and noise of the gene expression data, particularly the diverse datasets from the integration of scRNA-seq experiments. The clinical validation of the predicted gene associations also poses challenges, since it is difficult to screen such a large set of predictions…. Raw sequencing outputs are usually billions of short reads, and these reads are aligned to a reference genome. In other words, for each locus we have a set of short reads that contain this locus. Since sequencing techniques have errors, the challenge is to predict the variant status of this locus accurately from the set of reads. Manual processing of such a large number of reads to identify each variant is infeasible. Thus, efficient computational approaches are needed for this task…. Many other deep learning-based methods are proposed to tackle more specific challenges, such as long sequencing length using LSTMs.^90^ Benchmarking efforts have also been conducted.^91^ Although most methods have greater than 99% accuracy, thousands of variants are still being called incorrectly, since the genome sequence is extremely long.

