Much has been written about the extent of the contribution that is expected of the author of a scientific paper, and I shall not add to it here, since the initial focus is on identifying the people who have been awarded (apparent) coauthorship of an article. Last week I wrote about text mining, which might be one general approach by which to find out. As with most activities (including those involving machine learning), this is not without its hazards, specifically that involving the unique identification of authors. The accurate identification of authors of scientific publications is one of the most important issues facing us as we begin to develop digital analyses of citation networks, coauthorship networks, scientific productivity, bibliometrics for purposes of the Research Excellence Framework (REF), and the like. It is not easy even for humans. Witness the paper cited in Pubmed as being authored by Mukherjee, B., Chir, B., Moon, J. C., Sandrasagra, M. & Pennell, D. J. – now, it is rather unlikely that the second author contributed much to this article as s/he is in fact the first author’s degree in surgery! Bear in mind that these articles are logged manually rather than automatically (and the eponymous Professor B. Chir has, according to PubMed, apparently authored no fewer than 13 articles, albeit – perhaps unsurprisingly, as a suffix – never achieving first author status…). Dr D. Phil has by the same token authored 22, including one on phage therapy; he is presumably here the first author’s doctorate from the University of Oxford. To be equally (un)fair to other bibliographic providers, at the time of writing the UK version of Web of Knowledge finds 27 articles by D. Phil, while Scopus finds no fewer than 79 – including some by the doubly doctored D. Litt et Phil. Dr D Ph can also be found as a PubMed author. The towns Ann Arbor and Milton Keynes have also apparently acted as authors on many occasions!

Things inevitably become much harder (and less accurate) when one tries to automate these kinds of process more or less completely, as is presumably the case with Google Scholar. This returned (when I wrote this) 636 articles authored by D. Phil, 646 by M. Phil, 389 by B. Chir, 1,410 by D Ph, 14,500 authored by Ph. D and 29,000 by an author called ‘Prof’.  But (as mentioned in Duncan Hull’s recent blog), it is with an author known as Forgotten Password that we can find, apparently, a particularly high level of productivity – 63,300 papers, in fact! I have no idea of the algorithms behind Google Scholar, but possibly they need a little tweaking here. Jacsó finds “suggested authors from a set of purportedly 2,9110,000 (sic) records on the topic of risk factor evaluation with the following names: P Population, R Evaluation, M Data, R Findings and M Results”.

It is even worse with foreign languages, and not only with authors. In one famous case, by dint of the unqualified use of an automatic translator, a Chinese restaurateur managed to name his Dining Hall  ‘Translate Server Error’. Mis-citation is likely to become more common as authors get lazier and propagate errors. The current record is probably held by attempts to cite the paper by Oliver Lowry and colleagues that used a modified Folin-Ciocalteu reagent to estimate protein concentrations via their tyrosine content. While almost none of those papers will have followed the actual protocol published therein, ca 65,500 do at least manage the correct volume (193) and first page (265), while WoK finds attempts to mis-cite that paper in more than 800 different ways, with some mis-citations (such as that using starting page 256) managing a considerably greater number of mis-citations (1,461 for that one at the time of writing) than most papers achieve as correct ones! One day, Digital Object Identifiers will have the potential to fix these kinds of problems, although not necessarily all the legacy ones.

Of course I am being mischievous here, and the wonderful online digital libraries and databases do give much potential for wordplay. However, these kinds of issues underpin exactly the reasons why we made so much of semantics in the yeast community consensus network reconstruction. It is probable that this issue of naming conventions will be even more true in the case of human metabolic networks.

Related posts (based on tags and chronology):