Apart from the use of microbes in metal ore extraction, only in one area has ‘mining’ had much effect on modern cellular biology and that is the area of data mining. Data mining describes a suite of methods combining the intelligent storage, analysis and recognition of patterns in large data sets, for the purposes of turning data into information and to knowledge. Data mining typically makes use of the methods of multivariate statistics and of machine learning to find these hidden patterns, that might then be exploited for intellectual or commercial benefit. It is worth noting the difference between statistics and machine learning. As summarised by Breiman, statistics starts with a hypothesis and assesses the goodness of fit of available data to that hypothesis; by contrast machine learning starts with the data and finds the hypothesis that best fits the data, using methods of cross-validation to avoid over-fitting – which is otherwise a problem. This general distinction is similar to the inductive-deductive distinction in the philosophy of scientific reasoning.

Data mining in biology has come into its own following the ability to generate highly multivariate data from relatively small sample numbers using ’omics technologies, and is by now well established (albeit the necessary skills are not yet in every biologist’s toolkit, something we at BBSRC hope to remedy). Much less established, however, though of equal if not greater importance, is the subject of text mining. Text mining describes a variety of technologies for the high-quality extraction of information from text. It is clear that, given that the number of peer-reviewed papers in Pubmed/Medline alone (they differ) is increasing at the rate of ca 2 per minute, we are absolutely going to need computerised methods for helping us assimilate this material, so as to avoid the ‘balkanisation’ of thinking (and of literature citations) that is the typical response of individual scientists to the tsunami of papers they are not going to be able to read.

The processes of text mining, summarised in a short review I co-authored in 2006 with colleagues from the National Centre for Text Mining, involve (in order) information retrieval, information extraction and data/text mining itself. Some of these high-level steps can themselves involve sophisticated techniques such as natural language processing. (How does a computer deal sensibly with the clauses ‘Time flies like an arrow; fruit flies like a banana’…?!)

A number of other useful and pertinent reviews of text mining exist, including a book on text mining in biology and medicine, and papers by Krallinger & Valencia (2005), by Jensen et al. (2006), by Cohen & Hunter (2008), by Krallinger et al. (2008) and by Rzhetsky et al. (2008). The move towards Open Access Publishing (the Gold Road), and structured digital abstracts will make text mining considerably easier, and a number of publishers are beginning to make available their papers marked up in sophisticated ways that permit advanced text mining – Project Prospect at the Royal Society of Chemistry being a particularly nicely done programme. The process works both ways, as text mining techniques can, for instance, be used to produce controlled vocabularies.

As illustrated in the recent community-based yeast metabolic network reconstruction, it is possible to annotate Systems Biology models (marked up in SBML – see www.sbml.org) with literature references that provide the necessary evidence for particular reaction properties. It is possible to effect this semi-automatically, and as a community effort. Together with the necessary data visualisation tools (an example of literature clustering is given in a previous paper, as are means of validating such clusters), I consider that text mining is going to be a highly important part of the biologist’s armoury. (I have been known to comment that ‘three months in the lab can save a whole afternoon on the computer’…)

Overall, I am quite sure that these and related techniques are going to change the way that science is done and scientific knowledge acquired.

Related posts (based on tags and chronology):