Machine Learning and Drugs

In January of 2020, #WorldWar3 started to trend on social media platforms. Well, they say you should be careful what you wish for. Misery showed up at the world’s doorstep in the form of a deadly virus called SARS-Cov2, the causative agent of COVID-19. As patients flood one hospital after the other, and the death toll surges everywhere from Madrid to Mumbai, a sprint to find medications has ensued. While no research has yielded a viable vaccine or treatment so far, pharmaceutical companies have decided to adopt a different approach this time around. Rather than taking years to produce and test compounds from scratch, pharmaceutical companies are improvising by using drugs that have been approved for similar diseases, which have acceptable safety profiles.



A growing appreciation for Machine Learning (ML) in the pharmaceutical industry has yielded a powerful ally in this new approach to drug discovery. ML can promote data-driven decision making and has the potential to speed up the process and reduce failure rates in drug discovery and development. It utilizes both simple linear equations and more complex non-linear models to enhance discovery and decision making for well-specified problems with abundant and high-quality data. 



One of the more well-established uses of ML in healthcare is rapid pathological diagnosis, especially with the trend of digitizing microscopic images. It allows for the early detection of cancers, which reduces the mortality due to this disease since most treatment methods fail against the late stages of cancer. Let us understand the various ML algorithms used in pathological diagnosis using the example of leukemia, which is commonly known as blood cancer. Leukemia can be treated effectively if detected early. However, it can cause death across all ages if allowed to progress without treatment, which makes it an apt target for testing new diagnostic techniques.



SVM, a binary classification algorithm, is used to classify sample blood images as lymphoid stem cells and myeloid stem cells and subsequently mark them as Acute Lymphoblastic Leukemia (ALL) and Acute Myeloid Leukemia (AML). Using SVM, scientists achieved an accuracy of 92 percent. For each nucleus, sample shape and texture-based features are extracted and recorded. 



The K-Nearest Neighbour algorithm is a ubiquitous classification tool with good scalability and is used to distinguish leukemic cells from healthy blood cells. This algorithm classifies new objects based on similarity measures under the assumption that similar things exist in proximity. It is used to group blasts in leukemic cells and classifies the cells into AML and ALL with an accuracy of 80 percent. 



Another popular algorithm is the Naive Bayes classifier, which requires only a small set of data for training (a considerable advantage when tackling new, emergent diseases). It is used to estimate various parameters which are necessary for leukocytes classification. In Bayesian statistics, the revised or updated probability of an event occurring after taking into consideration new information is referred to as posterior probability. Each white blood cell will belong to the class having maximum posterior probability with an accuracy of 80.88 percent. 



Neural networks based methods may also be used to classify blood smear images into healthy blood cells and leukemic blood cells. The preprocessing steps include image preprocessing, clustering, and segmentation. The hidden nodes receive the input load, along with some weight and hidden node bias, and get their input. The final node output is calculated using a complex equation and classifies the cancerous cells into ALL and AML. Extending the capacity of an ordinary neural network, we have a CNN or a Convolutional Neural Network, which is a deep learning algorithm in which biological image processing is utilized for the classification of blood images. It excludes manual feature extraction and manual determination of differentiating parameters. Instead, features are learned directly from the training images, and the algorithm subsequently applies what it has learned to the test data for classification. Data is partitioned into training and testing, in some predetermined ratio (typically, 80:20), to yield accuracy figures of up to 97.78%.



On the surface, CNN seems to be the best bet for leukemia diagnosis and potential diagnoses of other diseases, with an accuracy of 98%. However, it is not the answer to everything. It is like a black box because it can approximate any function, study its structure but does not give us any insights on the structure of the function being approximated. Although KNN and Naive Bayers give us a less accurate result than CNN, they are still popular models due to their simplicity in allowing each attribute to contribute towards the final decision equally. This simplicity also equates to computational efficiency. According to Occam’s razor idea, the most straightforward and direct solution should be preferred.


fig. 1 Neural Network


Pharmaceutical companies have scored some other notable successes in applying these algorithms to solve day-to-day tasks:

  1. APLD (Anonymised Patient-Level Data) and Truven Markets are patient claims databases used to identify patients that show characteristics that are similar to other patients with the same diagnosis codes. This approach can also be used to find patients with rare diseases. 
  2. Patient Journey and Treatment Pathways refer to the process of finding how a patient progresses from one disease state to another through multiple lines of therapies. Using clustering and scoring models, ML can help assess which treatments or drugs should be recommended for patients based on historical outcomes and success rates of treatments.
  3. Based on data collected from satellites, real-time social media updates, historical information on the web, and other sources, ML technologies are being applied to monitor and predict epidemic outbreaks around the world. 
  4. Advanced predictive analytics can identify candidates for clinical trials accurately, rapidly, and at a low cost since they draw inferences based on a much wider range of data. It can also find the best sample size to increase efficiency and reduce data errors. 
  5. ML can also be used for remote monitoring and real-time data access for increased safety; for example, monitoring biological and other signals for any sign of deterioration in a patient’s condition.



With all the potential ML has, it is often easy to lose sight that the materialization of this dream will be far from easy. Most healthcare companies’ existing IT infrastructures are based on legacy systems that were not designed with ML in mind. The data is often kept in unwieldy free form, and their systems lack operability. Furthermore, to maintain a patient’s confidentiality, companies do not have access to the required data. The availability of clean data at scale becomes a fundamental precursor for establishing a suitable environment for the growth of AI-based solutions. Existing genetics and clinical trial databases predominantly include Caucasian data, which points out that an entire geographic region and ethnic groups are at risk of being left out from this technology usage.



Keeping in mind the aforementioned challenges, pharmaceutical companies should continue efforts to digitize pathological slides since more extensive libraries for multiple diseases and pathologies are needed. These libraries can serve as robust databases that can be used to train and validate future models. In addition to the quantitative increase in sample numbers, libraries can allow for sets to be more diverse and not limited to a specific population.



Diagnostic accuracy of ML models is not their only advantage; they also increase clinical care efficiency in terms of cost and workflow. The methods discussed above are only a peek into this emerging discipline of data-driven healthcare. Several new algorithms and approaches are under development. ML approaches applied to data collected from such an amalgamation of Internet-enabled technologies, coupled with biological data, have the potential to dramatically improve the predictive power of such algorithms and aid medical decision making. While seemingly large enough medical data sets, adequate learning algorithms from thousands of research papers have been available for many decades now, only very few have contributed meaningfully to clinical care. This lack of impact stands in stark contrast to the enormous relevance of machine learning to almost every other industry. If we can achieve this dream of data-driven healthcare, we might take a big step in achieving universal healthcare and might prevent acute diseases from taking hold.

About the Author


Leave a Comment

Your email address will not be published. Required fields are marked *