The introduction to most scientific papers will probably contain something along the lines of “It is widely accepted that….”, followed by the citation of a few more or less recent reviews of the topic. Last week’s blog noted the frequency of mis-citation, and this leads, surprisingly naturally, into asking the question ‘which reviews or papers might one then cite to bolster a view of present-day knowledge on a subject, and on what basis are these chosen?’ A partial linkage between these two issues (mis-citation and choice of material to cite) comes via what Merton (1968) (with a follow-up in 1988) called the Matthew Effect, on the basis of the lines in Matthew’s Gospel (25:29) that read “For unto every one that hath shall be given, and he shall have abundance: but from him that hath not shall be taken away even that which he hath”. As phrased by Goldstone, one variant is that “In scientific journals, and at scientific conferences, new articles and papers by already-prestigious scientists usually receive far more attention than articles by scientists still on the way up, regardless of the intrinsic merit of such contributions”. Strevens suggests a mechanistic explanation of why. (Note that while greater longevity may increase one’s fame, it may even work both ways – winning a Nobel Prize can apparently extend one’s longevity (statistically) by 1-2 years!)

The Matthew effect applies to journals and papers too – a highly cited journal or paper is likely to attract more citations (and mis-citations), probably for the simple psychological reasoning that ‘if so many people cite it, it must be a reasonable paper to cite’ (and such a paper is, by definition, more likely to appear in the reference list of another paper). Clearly that reasoning can be applied whether the paper has been read or otherwise. Simkin and Roychowdhury (2005 and 2007) note that a clear pointer to the citation of a paper one has not read is if it copies a mis-citation, and  an analysis of the frequency of such serial mis-citations allows one to estimate, statistically, what fraction of cited articles have actually been read – at least at or near the time of writing a paper – by the citing author. Their analyses show (at least for certain physics papers) that “about 70-90% of scientific citations are copied from the lists of references used in other papers”, and that a typical device is to start with a few recent ones plus their citations. Some aspects of this tendency in bibliometrics, especially with highly cited papers, can be detected from the power law form of the distribution of citation numbers, as in the Laws of Bradford and Lotka that I discussed before. Of course the mindless propagation of errors without checking sources properly is hardly confined to Science – a famous recent example with spoof data showed how some journalists simply copied Obituary material from Wikipedia!

Modern Web-based data mining tools and databases allow one to find the duplication of scientific (textual) content in papers  fairly easily. Scientists citing papers they have not read might do well to remember this. Out of cite, out of mined, one might say <ahem>.

Related posts (based on tags and chronology):