Bioinformatics is an interdisciplinary research area concerned with developing new algorithms, methodologies, and software tools for the analysis of biological data. Bioinformatics combines computer science, statistics, mathematics and engineering to study and extract new knowledge from biological data. With the introduction of technology called high throughput, it has become a vital part of many areas of biology. Bioinformatics, in fact, provides the methods and software tools for sequencing and annotation of genomes and their observed mutations, such as single nucleotide polymorphisms (SNPs).
Analysis of whole genome/transcriptome sequencing results is useful for simulation and modeling of DNA molecules, RNA, protein structures, as well as molecular interactions. The study of gene expression data by means of machine learning and data mining algorithms leads to the discovery of regulatory mechanisms related to particular diseases such as cancer.
The application to the clinical field of the results produced by the analysis of genomic and biomedical data in order to improve the condition of human health is the basis of translational bioinformatics. In experimental molecular biology, bioinformatics techniques such as image and signal processing allow the extraction of useful results from large amounts of raw data, produced by microscopes and spectrometers.
In addition, bioinformatics also includes text mining techniques for extracting knowledge from biological scientific articles and for developing ontologies to organize and query biological data and extracted information. Considering a more integrative level, it helps to analyze, catalog and annotate with functional descriptions biological entities, such as genes, proteins, metabolic pathways and gene regulatory networks.
In a certain sense, the concept of informatics applied to biology was born even before the use of bits, when in the 50’s, Frederick Sanger was awarded the Nobel Prize for chemistry thanks to the introduction of protein sequencing, initially manual and later refined up to the automation proposed by Pehr Edman in 1967. Sequencing and computer science were destined to go hand in hand.
The execution of complex calculations and operations, called to manage an enormous amount of data, showed since the beginning the practical need to resort to simulations supported by computer disciplines and devices. In the 50’s IBM introduced the FORTRAN language, to allow the programming of the first computers used in research centers, while in the following decade, MIT pioneered 3D by reconstructing the model of the structure of cytochrome c, opening a new era in molecular biology.
Since then, the synergistic relationship between biology and computer science has become a constant, with a series of milestones that have opened new avenues of exploration and research, which continue today, with the proliferation of new programs useful to perform increasingly complex calculations, thanks to increasingly innovative methodological approaches.
In 1970, Saul Needleman and Christian Wunch wrote a new algorithm, able to effectively compare two sequences for similarity analysis. In 1977, Allan Maxam and Walter Gilbert invented the DNA sequencing, inaugurating the season of gene sequencing, where the contribution of bioinformatics began to go from useful to indispensable, even more if we consider how the representation of DNA and protein sequences was described by strings of characters.
Thanks to a growing variety of computer programs, researchers were able to use digital files to store, print, and identify DNA sequences, rather than translate them into amino acid sequences.
This is the branch of bioinformatics that deals with the analysis and prediction of the 3D structure of proteins, RNA and DNA. Similarly to structural biology, its computer science counterpart aims at creating new methods to process data about macromolecules in order to solve concrete problems, for example in medicine and pharmacology.
To grasp the practical sense of structural bioinformatics it is appropriate to focus on the difference between “sequence” and “structure”. A surprising aspect is given by the total perceptive distance between the sequence of a genome and the actual shape of the protein, comparable to the 3D model of its molecular structure. Seemingly similar amino acid sequences can turn out to be protein structures that look absolutely nothing alike, since their folding expresses very different three-dimensional shapes.
The calculation of protein folding is one of the areas of study in which bioinformatics has provided a fundamental contribution, moving from laboratory techniques to simulations based on distributed computing, such as the collaborative projects Rosetta@Home and Folding@Home, which employ the unused resources of computers to support complex simulations of protein structures.
Another open source initiative to support protein folding is characterized by Foldit, a real game in which each user can use his or her ability to fold three-dimensional structures and share them with the project’s global platform.
If the three cases mentioned above were or are accessible to everyone, there are at the same time some initiatives reserved to research groups, such as the famous CASP (Critical Assessment of protein Structure Prediction), which has been held on a biennial basis since 1994.
University and private teams (supported by multinational AI companies) compete to identify computational methods capable of predicting the 3D structures of proteins as accurately as possible. The goal of CASP is to find solutions that are more efficient, faster, and cheaper than laboratory technologies such as X-ray crystallography or NMR spectroscopy.
It is a biological discipline that deals with the study of living organisms as systems capable of evolving over time, through the use of dynamic models based on high throughput technology, from which derives the link with bioinformatics.
In other words, systems biology combines the procedures of generic systems theory, bioinformatics, mathematics and statistics to create dynamic models capable of describing and explaining the functioning of biological systems.
Unlike molecular biology in its broadest sense, systems biology does not focus specifically on sequence and structure, but goes beyond the single mechanism to investigate the dynamic interactions that molecules establish with each other as a function of time.
Genes define every living organism, but they are not the only source of biological information. The networks of interactions between genes, as well as the structural protein and metabolic networks, allow us to acquire a database of extremely broad and useful to understand the functioning of the organisms themselves, in order to identify both the causes and possible therapies for a disease.
Thanks to the use of mathematics, computer science, parallel computing and data mining techniques, it is possible to analyze the complexity of genetic interaction networks and other major biological networks to discover new information useful for research.
The challenge is open and the focus of many research projects around the world, supported by the awareness that no single source of biological data is able to explain the processes that are generated in each organism, so it is necessary to approach the concept of “biological networks”, trying to isolate the parts that compose them to associate them and get a unitary picture of their complexity.
In other words, to explain the functioning of a cell it is necessary to consider many thousands of proteins and all the possible interactions that characterize them. The numerical size of the problem requires the adoption of a computational system based on a series of algorithms capable of processing a dataset of enormous size, to perform all the analysis useful to the objectives of the research.
The applied disciplines of bioinformatics
In parallel to the many areas of study involving bioinformatics, it is appropriate to focus our introductory review towards three disciplines of reference: computational biology, computational genomics and biorobotics, which correspond to the three main strands in which research and application are evolving thanks to the support of the most advanced computational technologies.
In some ways a synonym of bioinformatics itself, as it assimilates the general assumptions, which coincide with the use of algorithms to solve certain problems of biological matrix.
As we have seen, one of the first disciplinary areas in which biology and computer science have benefited from their collaboration is given to the alignment of sequences, to understand for example if their similarity is the effect of a common ancestor.
The concept behind alignment algorithms is relatively simple, given the complexity of the operations they are called to solve: given a given scoring system, the algorithm must align two sequences with the highest possible score. There are therefore various types of algorithms.
The exact algorithms, such as the already mentioned Needleman-Wunsch (global gapped alignments) and Smith-Waterman (local gapped alignments) are very precise but at the same time require very high computational resources to achieve their goal. The heuristic algorithms, of more recent conception, do not guarantee the best alignment but they distinguish themselves for their remarkable speed in executing the simulation. This is the case of the BLAST (ungapped local alingments) algorithm.
The scenario varies significantly depending on the type of sequences analyzed, so it is necessary to identify from time to time the calculation model based on algorithms and matrices of substitution suitable to solve in an optimal way the alignment.
This brief introduction on mathematics does not give any idea of the complexity of the whole issue, but it is useful to focus on the considerable variety of computational aspects that concern molecular biology, to understand the need for multidisciplinary research teams.
Today’s computational resources allow, for example, to deal with very complex simulations to solve molecular sequences and structures thanks to the most advanced techniques of artificial intelligence, including deep learning.
An extraordinary example is AlphaFold, an open source project developed by DeepMind to solve the problem of Protein Folding, the 3D simulation of protein structures. Triumphant in the last edition of the already mentioned CASP, the well-known London-based AI Lab, financially supported by Google, has produced a result that is opening a new era in molecular biology, thanks to its ability to progressively reduce the costs of research, as well as making it extremely faster and more functional in responding to the growing needs in the medical and pharmacological fields.
Computational genomics is the science based on genome sequencing using a variety of techniques and methods based on statistical and computational analysis, to study the functions of regions of the genome itself, in order to guide research in biology, medicine and pharmacology.
The reconstruction of the human genome sequence represents in some ways the episode that has recognized computational genomics as a key discipline in the study of biological processes.
It is a discipline in continuous evolution, which in the space of a few years has seen scholars questioning various approaches, from the traditionally reductionist one, focused in a systemic way on single genes, to the most recent experimental methods, which exploit the progressive evolution of computational systems to obtain an increasing amount of genomic data with increasingly reduced time and costs.
To give an idea, the oft-mentioned reconstruction of the human genome sequence, originally required a commitment of billions of dollars in research, for over ten years of hard work, which today could be solved in a few days of calculation, with costs absolutely derisory in comparison.
One of the most fascinating aspects of computational genomics is the different possible approaches to research. The data-driven genomics is the perfect example to describe this feeling, as it results in the constant creation of new models and languages to analyze data in an increasingly interoperable way, in order to coordinate the effects of multiple research in parallel, in the direction of establishing a true Internet of Genomics, paraphrasing the most famous and inflated meaning of IoT (Internet of Things).
On these assumptions is based, for example, the research project GeCo of the Politecnico di Milano, which uses large public databases and can be used both at the Cineca Consortium and freely downloaded from the servers of the university.
Among the secondary objectives of GeCo is that of enabling healthcare facilities to use their datasets for precision medicine, in order to identify personalized care pathways based on the clinical data of each patient.
Artificial intelligence and genomics
The contribution of artificial intelligence systems to the analysis and cross-referencing of complex biomedical datasets is accelerating the pace of human genome research. According to a study by the PHG Foundation, within the University of Cambridge, in the future, the clinical application of genomics will also continue its path in parallel with the refinement of artificial intelligence techniques. According to the authors of the study, in the future, the clinical application of genomics will continue its path in parallel with the emergence not only of new scientific knowledge, but also with the progress of techniques that are part of the field of artificial intelligence studies. Provided, however, that – within the research laboratories – are fully exploited all the potential that today, AI is able to offer: only in this way, in fact, can be won the challenge that wants artificial intelligence increasingly present in clinical practice.
Biorobotics: robotics and biology meet
Among the most suggestive disciplines in which computer science contributes to give a solid evolutionary contribution we find undoubtedly biorobotics, a science that is inspired by nature to generate robots capable of adapting to increasingly extreme situations.
Biorobotics is based on the disciplinary contributions of robotics, bioengineering and artificial intelligence, with specific reference to forms of intelligence different from the human one. The observation of the behavior of plants and animals inspires robotic creations both in terms of form and functional aspects.
A biorobot can, for example, be inspired by the adaptive capacity of plants, capable of modifying themselves along their path, to detect the improper release of toxic substances, intervening to reduce the threat, without putting at risk the safety of human operators.
It is not uncommon to meet biorobots whose shape is inspired by that of a particularly well-known animal, as in the case of the Silver 2 and Octopus prototypes, respectively crab and octopus robots developed by the Institute of Biorobotics of the Scuola Superiore Sant’Anna in Pisa, with the aim of exploring the seabed, as well as cleaning it of pollutants.
Robots with the appearance and behavior inspired by animals also involve other species, including snakes, widely used in the case of endoscopes and control organisms whose fundamental requirement is the ability to move effectively even in narrow and tortuous spaces.
Also singular is the contribution of robotic bees, deputed both to support insects in the pollination of flowers, and to insert themselves inside our organism to treat cancerous diseases of the intestine, rather than cleaning blood vessels from dangerous accumulations.
Other examples of biorobotics are related to prostheses and exoskeletons, inspired by the external coatings of arthropods, capable of functioning as real limbs, both enhancing the movement of legs and arms, as well as giving support to the spine, and even being governed by the brain thanks to the connections of electrodes with the nerve terminals of the human body.
If bioinformatics successfully supports the disciplines related to biological studies, it is undeniable that its applications are able to trigger a profoundly innovative action, capable of broadening the horizons determined by traditional approaches, or to radically increase the efficiency of processes already in place. Let’s see two of the main applications of bioinformatics in medicine and pharmacology.
Precision medicine and preventive medicine
Precision medicine, otherwise known as personalized medicine, is a discipline that exploits the clinical knowledge of each patient to try to guarantee him the best possible therapeutic support, with a level of efficiency generally well above the generalist approaches.
Of course, this is not a new concept, but advances in genomics, due in large part to the contribution of bioinformatics, are greatly accelerating studies and research in the biomedical field.
The discoveries and studies carried out on the human genome are beginning to guarantee results on the front of diagnosis, treatment and, consequently, the rationalization of public spending, resulting from a more efficient use of available drugs.
The analysis of the functioning of individual organisms, together with the understanding of their metabolic capacities, can be decisive in designing a personalized therapeutic path, identifying the drugs that can offer a better response to the individual. The areas in which precision medicine can offer significant advantages are oncology, immunology, cardiovascular, psychiatry and neurology. The benefits of personalization are reflected on a large scale at the therapeutic level, right from the stages of its definition.
In the case of cancer patients, for example, it is possible to assess with greater certainty the dosages of invasive therapies such as radio and chemotherapy, avoiding overstretching toxicity. Also significant are the possible benefits deriving from the treatment of chronic inflammatory diseases, which require a decidedly prolonged intake of drugs. In these cases, it becomes essential to avoid possible interferences with the assumption of drugs that may be necessary to treat other diseases.
Personalized medicine also opens a front of activity from the ethical point of view, given the value of the market that genetic data collected by the companies of personalized genomics are able to feed.
Another fundamental application is predictive medicine, that branch of medicine which, thanks to the genetic knowledge of individuals, is able to determine the risks of developing a given disease in the short, medium and long term, facilitating periodic diagnostics, with particular attention to the therapies to be used both in the preventive phase and in the event of early symptoms of the disease.
In the case of respiratory diseases, especially because of their large and increasing diffusion, the predictive action translates into a particularly positive impact on the socio-economic front of the population.
The general knowledge of the problem now leads us to take almost for granted the main factors that cause lung diseases. They range from environmental risk (smoking, exposure to pollutants, diet, infections, etc.) to individual risk (general predisposition to the disease). Preventing risk factors can contribute significantly to prevent the disease or anticipate the onset of the disease in its acute phase.
The action of predictive medicine is effective especially in relation to individual risk factors. Thanks to the action on healthy people, it is possible to search for the factors of fragility that could give rise to the disease, especially where there are clues given by the possible heredity, found by active cases in the family.
Predictive medicine assumes therefore a probabilistic and individual value, requiring a considerable computational contribution, which makes it a privileged field of action of bioinformatics.
There are a very large number of drugs on the market today, yet laboratories are constantly searching for new solutions. Yet, with a greater awareness of protein structures, many of the currently available drugs could be effective in treating many diseases for which certain drugs have not been designed. The practical answer to this observation comes from the application of molecular biology, particularly that which studies protein folding.
By comparing the three-dimensional structures of the proteins of the drug with those of the altered structures of the organisms, it would be possible to verify the alignment that would allow the drug to counteract the adverse effects of the disease.
To design effective drugs, the protein designer uses software that allows him to visualize the 3D structure and move in the direction of an increasing level of personalization of the treatment. This is complementary to the precision medicine considerations in the previous paragraph.
Both for the development of new drugs, and to verify the efficacy of existing ones, pharmacological research therefore needs important contributions from computational genomics, whose potential is constantly growing thanks to the progress of bioinformatics.