The miners strike again – but these are text miners…
Apart from the use of microbes in metal ore extraction, only in one area has ‘mining’ had much effect on modern cellular biology and that is the area of data mining. Data mining describes a suite of methods combining the intelligent storage, analysis and recognition of patterns in large data sets, for the purposes of turning data into information and to knowledge. Data mining typically makes use of the methods of multivariate statistics and of machine learning to find these hidden patterns, that might then be exploited for intellectual or commercial benefit. It is worth noting the difference between statistics and machine learning. As summarised by Breiman, statistics starts with a hypothesis and assesses the goodness of fit of available data to that hypothesis; by contrast machine learning starts with the data and finds the hypothesis that best fits the data, using methods of cross-validation to avoid over-fitting – which is otherwise a problem. This general distinction is similar to the inductive-deductive distinction in the philosophy of scientific reasoning.
Data mining in biology has come into its own following the ability to generate highly multivariate data from relatively small sample numbers using ’omics technologies, and is by now well established (albeit the necessary skills are not yet in every biologist’s toolkit, something we at BBSRC hope to remedy). Much less established, however, though of equal if not greater importance, is the subject of text mining. Text mining describes a variety of technologies for the high-quality extraction of information from text. It is clear that, given that the number of peer-reviewed papers in Pubmed/Medline alone (they differ) is increasing at the rate of ca 2 per minute, we are absolutely going to need computerised methods for helping us assimilate this material, so as to avoid the ‘balkanisation’ of thinking (and of literature citations) that is the typical response of individual scientists to the tsunami of papers they are not going to be able to read.
The processes of text mining, summarised in a short review I co-authored in 2006 with colleagues from the National Centre for Text Mining, involve (in order) information retrieval, information extraction and data/text mining itself. Some of these high-level steps can themselves involve sophisticated techniques such as natural language processing. (How does a computer deal sensibly with the clauses ‘Time flies like an arrow; fruit flies like a banana’…?!)
A number of other useful and pertinent reviews of text mining exist, including a book on text mining in biology and medicine, and papers by Krallinger & Valencia (2005), by Jensen et al. (2006), by Cohen & Hunter (2008), by Krallinger et al. (2008) and by Rzhetsky et al. (2008). The move towards Open Access Publishing (the Gold Road), and structured digital abstracts will make text mining considerably easier, and a number of publishers are beginning to make available their papers marked up in sophisticated ways that permit advanced text mining – Project Prospect at the Royal Society of Chemistry being a particularly nicely done programme. The process works both ways, as text mining techniques can, for instance, be used to produce controlled vocabularies.
As illustrated in the recent community-based yeast metabolic network reconstruction, it is possible to annotate Systems Biology models (marked up in SBML – see www.sbml.org) with literature references that provide the necessary evidence for particular reaction properties. It is possible to effect this semi-automatically, and as a community effort. Together with the necessary data visualisation tools (an example of literature clustering is given in a previous paper, as are means of validating such clusters), I consider that text mining is going to be a highly important part of the biologist’s armoury. (I have been known to comment that ‘three months in the lab can save a whole afternoon on the computer’…)
Overall, I am quite sure that these and related techniques are going to change the way that science is done and scientific knowledge acquired.
- Ananiadou, S. & McNaught, J. (2006). Text mining in biology and biomedicine. Artech House, London
- Ananiadou, S., Kell, D. B. & Tsujii, J.-i. (2006). Text Mining and its potential applications in Systems Biology. Trends Biotechnol 24, 571-579
- Breiman, L. (2001). Statistical modeling: The two cultures. Stat Sci 16, 199-215
- Broadhurst, D. & Kell, D. B. (2006). Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics 2, 171-196
- Cohen, K. B. & Hunter, L. (2008). Getting started in text mining. PLoS Comput Biol 4, e20. Full text
- Handl, J., Knowles, J. & Kell, D. B. (2005). Computational cluster validation in post-genomic data analysis. Bioinformatics 21, 3201-3212
- Harnad, S., Brody, T., Vallieres, F., Carr, L., Hitchcock, S., Gingras, Y., Oppenheim, C., Hajjem, C. & Hilf, E. R. (2008). The access/impact problem and the green and gold roads to open access: An update. Serials Review 34, 36-40. E-print
- Herrgård, M. J., Swainston, N., Dobson, P., Dunn, W. B., Arga, K. Y., Arvas, M., Blüthgen, N., Borger, S., Costenoble, R., Heinemann, M., Hucka, M., Le Novère, N., Li, P., Liebermeister, W., Mo, M. L., Oliveira, A. P., Petranovic, D., Pettifer, S., Simeonidis, E., Smallbone, K., Spasic, I., Weichart, D., Brent, R., Broomhead, D. S., Westerhoff, H. V., Kirdar, B., Penttilä, M., Klipp, E., Palsson, B. Ø., Sauer, U., Oliver, S. G., Mendes, P., Nielsen, J. & Kell, D. B. (2008). A consensus yeast metabolic network obtained from a community approach to systems biology. Nature Biotechnol. 26, 1155-1160
- Hucka, M., Finney, A., Sauro, H. M., Bolouri, H., Doyle, J. C., Kitano, H., Arkin, A. P., Bornstein, B. J., Bray, D., Cornish-Bowden, A., Cuellar, A. A., Dronov, S., Gilles, E. D., Ginkel, M., Gor, V., Goryanin, II, Hedley, W. J., Hodgman, T. C., Hofmeyr, J. H., Hunter, P. J., Juty, N. S., Kasberger, J. L., Kremling, A., Kummer, U., Le Novere, N., Loew, L. M., Lucio, D., Mendes, P., Minch, E., Mjolsness, E. D., Nakayama, Y., Nelson, M. R., Nielsen, P. F., Sakurada, T., Schaff, J. C., Shapiro, B. E., Shimizu, T. S., Spence, H. D., Stelling, J., Takahashi, K., Tomita, M., Wagner, J. & Wang, J. (2003). The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19, 524-31
- Jensen, L. J., Saric, J. & Bork, P. (2006). Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 7, 119-29
- Kell, D. B. (2004). Metabolomics and systems biology: making sense of the soup. Curr. Op. Microbiol. 7, 296-307
- Kell, D. B. & Oliver, S. G. (2004). Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic era. Bioessays 26, 99-105
- Kostoff, R. N. (2002). Overcoming specialization. Bioscience 52, 937-941
- Krallinger, M. & Valencia, A. (2005). Text-mining and information-retrieval services for molecular biology. Genome Biol 6, 224. Full text
- Krallinger, M., Valencia, A. & Hirschman, L. (2008). Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol 9 Suppl 2, S8. Full text
- Rzhetsky, A., Seringhaus, M. & Gerstein, M. (2008). Seeking a new biology through text mining. Cell 134, 9-13
- Seringhaus, M. & Gerstein, M. (2008). Manually structured digital abstracts: a scaffold for automatic text mining. FEBS Lett 582, 1170
- Spasić, I., Schober, D., Sansone, S.-A., Rebholz-Schuhmann, D., Kell, D. B. & Paton, N. (2008). Facilitating the development of controlled vocabularies for metabolomics with text mining. BMC Bioinformatics 9, S5. Full text
Related posts (based on tags and chronology):

Frogs, bees, parasites and stress – data driven analysis of species decline and biological dynamics
15 December 2008

What’s in a name? A tag cloud of recent blogs
08 June 2009

What’s in a name? Guest, ghost and indeed quite imaginary authorships
23 February 2009

Institute partnerships, triennials and global universities
03 June 2013

Nitrogen, the autumn statement, executive meetings and new ways of working
10 December 2012
You can follow any responses to this entry through the comments RSS feed. You can leave a comment, or trackback from your own site.
2 comments to 'The miners strike again – but these are text miners…'
[...] The miners strike again – but these are text miners… [...]
[...] in which frequency is encoded by the font size of a word. More sophisticated versions are based on text mining, and recognise phrases and terms rather than single words alone. The picture below summarises all [...]
Leave a comment