Gathering Data

From Hackuarium
Revision as of 21:50, 31 October 2016 by Hekkcess (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


Gathering data, on a biochemical standpoint, can prove to be at first a daunting task for an inexperienced user. Databases containing information such as chemical structures, gene sequences and protein structures have flourished in the past years. Several of those databases support and provide APIs for remotely, through scripts, access and gather the data. Furthermore, scientists developed libraries allowing one to access the servers with greater ease thanks to some available functions. Here we aim to document several of our favorites databases supporting Python scripting and to present the libraries we use. We will try to implement some of the already existing libraries with scripts facilitating the database access allowing programmers to quickly get access to scripts. We do not pretend to be the first ones listing databases and will try to punctuate the document with existing articles to expand the reader's knowledge.


Yann Pierson


Here we will describe the databases that we found to be the most complete and where the data access is the most straightforward.

From previous authors


The chemical database by excellence, this database provides a myriad of chemical structures and properties. Possessing a fairly intuitive library, pubchempy, scripting data


ZINC (Zinc Is Not Commercial - That sick acronym though - ) is a database gathering information on a large amount of commercially available compounds. Each compound entry comes with links for providers as well as, sometimes, physical, chemical and biological characteristics. The compounds can be downloaded directly as an optimized 3D coordinates for virtual docking applications. The database can be accessed thanks to the smilite package.

I remember this package to be unpractical check if I had scriptes for this


ChemSpider offers information of 57million structures gathered from 518 different sources (Jul 2016). This database is owned by the RSC (Royal Society of Chemistry) and received awards for the quality of the information available. A large pannel of information is freely available on each molecular entry. The database can be accessed using the wrapper ChemSpiPy.

Biological items

In this part we will describe ways, no only to gather pdb files and files containing crystal structures or genes but also ressources allowing you to perform a BLAST or gather information on protein interaction.

Protein Data Bank

The Protein Data Bank, or more commonly known in the field as PDB, is the most massive crystal structure repository freely available on the web. Crystallographs are requested, upon paper submission to deposit quality, X-Ray structures of their published proteins on this website, constantly fueling it with quality data. With libraries such as BioPython, the user can query this database for .pdb files allowing to quickly gather homologues for comparison.

Expasy (SIB)

This ressource is maintained by the Swiss Institute of Bioinformatics (SIB) and is a good gateway to obtain not only peer-reviewed information on a gene/protein but also packs some powerful tools for BLAST, structure and sequence alignement. Through the main page you are able to request several databases that will give you large amounts of data. Again, the very versatile BioPython library can give you access through scripts to this material.


PubMed, such as PubChem, is the child of the National Center for Biotechnology Information (NCBI) and hosts an extremely large database of scientific papers (their reference sadly most are not open), gene/protein information, bioassays and much more.


The BRENDA(BRaunschweig ENzyme DAtabase) is one of the most computationally comprehensive database. It covers and screens several other databases and papers to give the user an amount of information that can be sometimes overwhelming! The database also gives information on protein to ligand interactions with sometimes their affinities, kinetic information and possesses a clever nomenclature. The database gathers and compiles the data from the PDB, NCBI & EMBL just to cite a few (basically everything that was mentioned above).

Genetic sequences


Here we describe the most important features of our favorite libraries, most of them have been introduced previously.


If there would be one library to use to perform biochemical searches it would be this one. This library packs most of the functionalities a biologist would need. From sequence and structure gathering to alignement and translation the functions proposed in the package are powerful!



The OpenBabel software is one of the most practical ressource when it comes to convert file formats. Indeed, in the world of molecule and structures, during the development of computationally aided chemistry a wide amount of file formats emerged all presenting the information in various ways. Compatibility of files versus program is not always guaranteed and therefore converting these files can prove beneficial in order to ensure smooth computation. OpenBabel possesses a Python wrapper, it is a very valuable one but might be sometimes tricky to install.


In this section we will propose some good Python compatible molecular viewers allowing the user to model protein or molecular structures.


PyMOL is in my opinion one of the most "script friendly" interface. Any feature of a protein can be accessed and modified. It possesses a nice rendering script allowing the user to export professional grade images. Another powerful function of PyMOL is its ability to integrate some plugins giving it more functions.


MGLTools or also name autodock tools is part of the autodock 4/autodock vina suite. This software was developed by the Scripps Research Institute better known for its excellent program autodock vina (described later). This tool proposes as well a lot of structure optimisation algorithms that help minimize and clean protein structures.


Gathering data and possessing structures gets much more interesting if we are using them. Comparing chemical properties and comparing protein structures differences between several micro-organisms might reveal much more information. Here we try to propose and show some already existing applications one can make with data.


Docking consists of computing possible conformations a molecule might adopt in the binding pocket of a protein. Such information can be valuable when screening for possible drugs using in this case protein and 3d structures databases. Some algorithms are good at predicting a binding geometry whereas others are more suited for binding affinities calculations.

Geometry Algorithms

Geo effeicient algos

Energy calculation

Mutation Studies

Enzymes have evolved in time partly through random mutations and evolutionary pressure. These mutations are different between species and can pinpoint sites on the structure being more flexible for engineering.

Base alignements

Best algos to align DNA sequences

Residue alignements

Best algos to align AA sequences

Structure alignement

Best algos to align protein folds

Other Sources To Enquire

Here we list other Python based tools that might be of interest with a one sentence description. We might develop on some of them maybe later in time but will most probably leave these ressources to your curiosity

  • PyChem viewer of chemical data. Doesn't seem maintained.
  • FragIt is a python based tool that allows you to quickly fragment "any" molecule.
  • PyQuante is an open-source suite of programs for developing quantum chemistry methods.