A Databasic Approach to Chemistry

By: David Bradley

There are almost too many resources for chemists hoping to unearth compound structures, organic reaction schemes, and even failed syntheses. David Bradley digs in to find out what can be mined.

When it comes to chemical facts, there is no better way to harness data than through a database. Out on the web, CD-ROM and proprietary systems there are available countless resources from selective/thematic databases, such as those offered by the likes of Accelrys to the broad compilations from Beilstein and the Chemical Abstracts Service (CAS). Whatever the focus and relevance or the breadth of coverage required there is a rifle or a shotgun to choose from.

Accelrys - Cambridge-based Accelrys reckons cheminformatics is a key technology for modern chemistry and offers several solutions for fishing in the data streams. As with most other products, data storage capability provides the core. The Oracle-based information management system can be used for storing, searching and retrieving chemical structures, biological and chemical data, along with experimental data and registration information. With the likes of Accord for Excel and Access, Accelrys' desktop productivity tools give industry-standard tools from Microsoft some semblance of chemical savvy. In addition, they have a range of chemistry databases, mostly reaction-based, which run on systems such as ISIS and Accord.

Accelrys offers access to the likes of the Royal Society of Chemistry's Methods in Organic Synthesis (MOS). This is a current awareness journal published monthly, which abstracts over 100 internationally recognised organic chemistry journals.

Accelrys' electronic version of MOS is a highly selective designer database, essentially picking up where Springer's ChemReact fails in that it focuses on novel synthetic methods from the literature. Hence, it picks up on functional group changes, carbon-carbon bond-forming reactions, new reagents and synthons, enzymatic and biotransformations, and ways of introducing protecting groups and important chiral, or handed, centres into a structure. The database adds about 3,300 reactions each year, is updated quarterly and currently stands at over 33,000 indexed reactions going back to 1991. The Accelrys version is ISIS and Accord enabled.

Intriguingly, Accelrys also offers the chemist a glimpse of failure in the form of its recently launched Failed Reactions database. This unique compilation lets chemists know about reactions that either reached a dead end and no product or simply produced an entirely unexpected result telling them where they were published and helping them avoid other people's mistakes. Coupled with MOS and used in conjunction with ChemReact, you could search for a particular target and dig out a reaction scheme from the many possible with the best outcome and least number of pitfalls. There are already thousands of Failed Reactions archived and Accelrys plans to add tens of thousands more over the next couple of years.

Beilstein - Crossfire Beilstein, a product of MDL, itself a subsidiary of Elsevier Science, offers an index of the chemical literature with structures, physical properties, reactions and literature citations for more than eight million compounds. Included are details on how to make each compound and how it behaves chemically, such information is critical to optimising a synthesis for known compounds and their analogues.

Also embedded in the database are physical properties, useful for identifying unknowns, as well as figuring out the relationship between structure and activity for a given compound. More recently, the publishers have added pharmacology, toxicology, and ecological chemistry information.

While Beilstein holds data on about eight million compounds there are also over 5 million chemical reactions and 35 million associated chemical property and bioactivity records, including data describing pharmacodynamics and environmental toxicology, transport, distribution, and fate, essential stuff in drug discovery for instance.

CAS - The Chemical Abstracts Service substance database, CAS Registry, on the other hand has some 38,238,129 chemical substance registrations, at last count. This makes it the largest file of substance information in the world, it contains structures and chemical names and is an essential stop on the substance identification trail. The SciFinder system from CAS also opens a gateway to a wider world of online databases beyond the CAS Registry itself. But, with each substance having its own unique CAS registry number it is possible to cross reference a substance through many public and private databases, chemical inventories and reference works. Cambridge Soft's web-based or standalone ChemFinder, which is part of its wider applications suite, for example, allows searches on various formats but critically, simply typing in a CAS registry number will find the single compound you require with no knowledge of its chemistry. Tap in 139755-83-2, and see what comes up.

The ChemFinder search service is available as a standalone product for accessing the proprietary databases sold by Cambridge Soft, but it also offers itself up in web server format. This may not seem particularly novel as there are already countless search engines out there that can perform something similar. But, Cambridge Soft reckon they have tightened up their chemical searching by working from a single master list of chemical compounds, so that users avoid the problems of misspelled "mehtyl" groups, identifying "aluminium" and "aluminum" compounds and doing it quickly to boot.

If you're after detailed knowledge of chemical behaviour, then Springer's ChemReact will provide several clues to say the least. It carries data on some 300,000 reactions abstracted from the chemical literature of 1974-1991, which obviously excludes much of the recent developments in chemical synthesis but nevertheless provides the stock in trade for the vast majority of reaction schemes that a chemist would employ. The database carries the reactant and product structures, necessary solvents, required reagents and catalysts and will also give you an idea of how high a yield you might expect and what side products may form.

Tripos - Tripos offers what it terms "discovery research software" for pharmaceutical and biotechnology researchers. They like other companies help users to bring together chemical data, structure searching and molecular analysis all within either a single program or a suite of programs. Tripos' databases can thus be used to track down commercially available compounds, molecules with the same or similar structural features, or to identify physically or biologically related compounds or ones that match a particular pharmaceutical model. It is also possible to narrow the search for a new lead by removing already-patented compounds or compounds with known side-effects from a search request.

Tripos produces a database of diverse and pure compounds that emerge from the recent advances in combinatorial chemistry and high-throughput screening. LeadQuest contains some 80,000 compounds and is still growing. The company uses its ChemSpace proprietary decision-making technology to design and select novel compounds at a rate of two trillion per hour allowing it to sieve out biologically relevant molecules very rapidly from the vast numbers of possible structures.

Tripos also makes available a special edition of the Chapman & Hall Databases in several configurations. The Dictionary of Pharmacological Agents, Dictionary of Organic Compounds, Dictionary of Natural Products, Dictionary of Inorganic and Organometallic Compounds, the US National Cancer Institute's database of structures tested by NCI for carcinogenic activity and the Derwent World Drug Index (WDI) are possible data-mining sources. Tripos offers a 2D structure database that carries associated property data, a 3D coordinates version with no property data and a complete, and likely the most useful but expensive option, that includes 2D and 3D data and each compound's properties. 3D coordinates are not, however, available for the Dictionary of Inorganic & Organometallic Compounds. The chemical structures within the databases are provided as pre-built and pre-indexed Unity format databases but are also in a format suitable for loading into Oracle Version 7.

NIST - The US government also plays a role in the provision of chemical data to the community through its National Institute of Standards and Technology (NIST) Chemistry Webbook. This freely available resource carries information on about 30000 compounds, covering formula, name, the ubiquitous CAS registry number, structure, and physical properties such as ion energetics, vibrational and electronic energies and molecular weight. As with all the best databases it is fully indexed and you can search on chemical formula and, of course, on structure or sub-structure. Simply, sketch your molecule in the box, for which there are included many standard rings in the templates dialog box. This feature is certainly not unique and indeed is not the most sophisticated implementation of structure searching but provides a quick and easy way to get to a molecule of interest while on the Web.

It is possible to source data from many information arenas - patents, the scientific literature and meetings, commercial sources and companies, the Web, software companies. The scale is huge and access is more facile than ever before. Structure searching, reaction type investigation, and even simple registry number retrieval are all possible with a plethora of tools. There are, of course, many other smaller companies offering databases and software to handle them with small specialist ranges of data and one-offs. Indeed, most of the companies I have mentioned provide many more products than we have space to include. There is simply no excuse for not finding the information you need whether it is for a specific individual compound or a range of targets.


The Beilstein literature abstracts are available free online at www.chemweb.com/databases/belabs. The full product range, which includes physical and chemical data is available from MDL at www.beilstein.com www.Google.com provides a quick inroad to many other database companies directory.google.com/Top/Science/Chemistry/Software/Database/