Ambiguous abbreviations and acronyms are annoyances when it comes to text search and data mining. As a writer-editor, I was always taught to spell out the long form (LF) of a short form (SF) at first mention in a document so that the reader would know that when I mentioned EBV I was referring to Epstein-Barr virus rather than estimated blood volume. Hopefully, I do that on Sciencebase so that no one is confused, but not all authors give the LF for their SF, especially if they are writing in a niche where the readership is unlikely to be confused by an abbreviation.
Short forms of lengthy technical and jargon phrases are commonly made using one of a handful of general rules: cutting off the end of a gives us admin for administration or administrator, first letter initialisation, makes an abdominal aortic aneurysm AAA. In chemical names it is quite useful to abbreviate using the initials of the word’s syllables, so benzodiazepine becomes BZD, there are also combination initialisations, ad libitum to ad lib and other substitutions so that primum atrial septal defect becomes ASD I. In the field of science the list is almost endless and the potential for overlap a sub-editor’s nightmare.
If an author were to mention AAA, would a reader outside that niche know they were referring to the American Automobile Association, US punk band Against All Authority, a type of electrical battery, the major professional association for anthropologists in America, an abdominal aortic aneurysm or any of dozens of other LFs for the SF AAA? More commonly, authors do give some form of LF for their SFs. They either spell out the phrase in parentheses or use a phrase like “stands for” after the short form is first mentioned, but not always. How can a data miner cope?
Informatics researcher Min Song of the New Jersey Institute of Technology, in Newark, and Hongfang Liu of the Department of Biostatistics, Bioinformatics, and Biomathematics, at Georgetown University, Washington DC, have come up with a new way to analyse a document and to extract appropriate chunks of words that can deduce the long forms and their corresponding short forms in biomedical text. In general, Song and Liu’s work focuses on the discovery of knowledge in large volumes of natural language data such as blogs, medical notes and scientific publications. One aspect of this is the LF-SF conundrum.
Other researchers have tried to extract the SFs and the LFs from text with varying degrees of success in the past. Accuracy has been relatively high for extraction within specific niches so that algorithms are available that can recognise SFs like 1H-NMR and determine that it means proton nuclear magnetic resonance spectroscopy. But, the measure of success in this earlier work may be subject to the closed nature of the studies. The researchers says that his approach differs in that it could be more generally applicable because it incorporates lexical analysis techniques into supervised learning for extracting abbreviations.
Song and Liu add that his proposed technique, known as LFXtractor, uses noun chunking together with a distance metric to detect SF- LF pairs regardless of the presence of parenthetical expressions. “The distance-based matching method proposed is more scalable compared with traditional pattern-matching methods,” he says. “Given a sentence, the text chunker detects grammatical phrases including noun phrases in the sentence and the (SF, LF) pair detector identifies the corresponding LF for an SF given a list of LF candidates by computing a distance between the SF and its LF candidates.”
Song and Liu has trained and tested LFXtractor using PubMed queries and developed a web-based interface. He has carried comparisons with other tools, ExtractAbbrev, a simple pattern-matching rule-based system, ALICE, a heuristic rule-based system, Acrophile and collocation. LFXtractor scored higher than all the other approaches on precision, recall and F-measure (a calculated single-value combination of the former two). F-measure values for the various approaches were as follows: ExtractAbbrev (0.59), ALICE (0.62), Acrophile (0.54), collocation (0.63) and LFXtractor (0.68).
In follow-up work, the team will develop an abbreviation server that connects to the PubMed system and retrieves MEDLINE records from which abbreviations are then extracted and analysed using LFXtractor.
Min Song and Hongfang Lui (2010). LFXtractor: Text chunking for long form detection from biomedical text International Journal of Functional Informatics and Personalised Medicine, 3 (2), 89-102