Gathering Data
Introduction
Gathering data, on a biochemical standpoint, can prove to be at first a daunting task for an inexperienced user. Databases containing information such as chemical structures, gene sequences and protein structures have flourished in the past years. Several of those databases support and provide APIs for remotely, through scripts, access and gather the data. Furthermore, scientists developed libraries allowing one to access the servers with greater ease thanks to some available functions. Here we aim to document several of our favorites databases supporting Python scripting and to present the libraries we use. We will try to implement some of the already existing libraries with scripts facilitating the database access allowing programmers to quickly get access to scripts. We do not pretend to be the first ones listing databases and will try to punctuate the document with existing articles to expand the reader's knowledge.
Chemistry
Here we will describe the databases that we found to be the most complete and where the data access is the most straightforward.
From previous authors
- Sixty-Four Free Chemistry Databases:The name says it all. This page describes most of the known chemical databases.
- A review from the Queen Mary University of London
Pubchem
The chemical database by excellence, this database provides a myriad of chemical structures and properties. Possessing a fairly intuitive library, pubchempy, scripting data Links:
ZINC
ZINC (Zinc Is Not Commercial - That sick acronym though - ) is a database gathering information on a large amount of commercially available compounds. Each compound entry comes with links for providers as well as, sometimes, physical, chemical and biological characteristics. The compounds can be downloaded directly as an optimized 3D coordinates for virtual docking applications. The database can be accessed thanks to the smilite package.
- Website
- Review on the database functionnalities
- The smilite library
I remember this package to be unpractical check if I had scriptes for this
ChemSpider
ChemSpider offers information of 57million structures gathered from 518 different sources (Jul 2016). This database is owned by the RSC (Royal Society of Chemistry) and received awards for the quality of the information available. A large pannel of information is freely available on each molecular entry. The database can be accessed using the wrapper ChemSpiPy.
- Chemspider website
- The review on the database structure
- ChemSpi library
Protein Structures
In this part we will describe ways, no only to gather pdb files and files containing crystal structures but also ressources allowing you to perform a BLAST or gather information on protein interaction.
Protein Data Bank
The Protein Data Bank, or more commonly known in the field as PDB, is the most massive crystal structure repository freely available on the web. Crystallographs are requested, upon paper submission to deposit quality, X-Ray structures of their published proteins on this website, constantly fueling it with quality data. With libraries such as BioPython, the user can query this database for .pdb files allowing to quickly gather homologues for comparison.
Expasy (SIB)
This ressource is maintained by the Swiss Institute of Bioinformatics (SIB) and is a good gateway to obtain not only peer-reviewed information on a gene/protein but also packs some powerful tools for BLAST, structure and sequence alignement. Through the main page you are able to request several databases that will give you large amounts of data. Again, the very versatile BioPython library can give you access through scripts to this material.
PubMed
PubMed, such as PubChem, is the child of the National Center for Biotechnology Information (NCBI) and hosts an extremely large database of scientific papers (their reference sadly most are not open), gene/protein information and
- The PubMed website