|
B. VAUGHAN and Dr. J. Tony PEMBROKE |

Introduction
In the 1960's, the U.S. Department of Defence grappled with the problem of making a decentralised computer network so that it wouldn't have a single "point of failure", a network hub which could be targeted and disabled in nuclear attack. This experiment, administered by the Defence Advanced Research Projects Agency, became known as the ARPANet. Little did they know that ARPANet would become the basis for a more global computer network that is the modern Internet.
On September 1st 1994 the Internet celebrated 25 years of existence. In those 25 years it has been transformed
rather ironically from an obscure by-product of the Cold War into the high profile, high tech, world-wide Information
Super-highway that is pervading all aspects of our lives. This amazing ascendancy of interest in the Internet has
come about due to the advent of the personal computer age, where everyone can be hooked up at minimal cost. Estimates
as to the number of people accessing the Internet vary, some statisticians theorise it may be anywhere between
40 and 100 million users. What is certain is that the Internet is growing at an amazing rate and that as a resource
of information, it is becoming invaluable for more and more people world-wide, and especially to biologists.
What is BioInformatics?
Nowhere has the impact of the Internet been more apparent than in the scientific fields and in the Biological Sciences this has led to the development of a whole new area of expertise, that of BioInformatics. How do we define BioInformatics? One definition states that:-
"BioInformatics, sometimes, is used interchangeably with the term Computational Biology. Precisely, Computational
Biology is defined as the systematic development and application of computing systems and computational solution
techniques to models of biological phenomena; BioInformatics is defined as the systematic development and application
of computing systems and computational solution techniques analysing data obtained by experiments, modelling, database
search, and instrumentation regarding biological aspect."
Putting it more simply, BioInformatics is solving biological problems with computers. Before discussing BioInformatics
itself, a brief overview of the medium on which BioInformatics largely depends, the Internet and in particular
the World Wide Web is required.
Internet Basics
What is the Internet?
It is important to remember that the Internet is not one entity based in any one place, nor controlled by any
one group. It is a network of networks of thousands of computers (or Internet sites) across the planet, each holding
only a fraction of the overall information available. These computers use a common protocol or "language"
called Transmission Control Protocol/Internet Protocol (TCP/IP) and this language allows all types of computers
to communicate and transfer information with each other over the Internet.
Where is the information?
For information to know where it's going on the Internet, each site has a unique numeric address associated
with it (called an IP number). Since remembering IP numbers can be difficult, most computers also have a domain
name which maps onto its IP number, so instead of entering a string of digits, a more recognisable name can
be entered. An example of a domain name is www.ul.ie where ie means the computer is in Ireland (every
country has a two letter code assigned to it), the ul indicates the computer is in the University of Limerick
and the www is the chosen name for the computer (chosen by the owner) which here indicates that it contains
World Wide Web pages. There can be a lot of variation of domain names, but each has to be registered and recognised
as unique before it can connect to the Internet.
How do I get access to the information?
There are numerous different ways of accessing information on the Internet.
These are summarised in Box 1.
| Box 1. Accessing Information on the Internet Email - This is the backbone of communication on the Internet. Analogous to postal mail (which is known as "snail mail" on the Internet), messages can be sent to anyone as long as their unique address is known. The advantage over postal mail is that email gets there almost immediately and the same message can be sent to any number of people at the same time if desired. Email can also accommodate not only text. Documents, sounds, pictures and even video images can be attached if desired. An email address is similar to a domain name. For example, in the email address fred@ul.ie, the ie again signifies Ireland and the ul University of Limerick. The @ symbol indicates that it is an email address and the fred is the name of the account on that machine. Everyone on the Internet has a unique email address. FTP (File Transfer Protocol) - software that enables you to connect to other computers, and retrieve ("download") copies of files to your own computer. Guest or "anonymous" ftp is widely used on the Internet to allow access to software which is public domain and freely available. Gopher - software that makes accessing text documents, other gophers, telnet and FTP sites easy, because you simply make a choice from a menu. Telnet - software that enables you to connect to remote computers, and use them as if you were there. Usenet - Also knows as simply as "News", this is one of the primary sources of information on the Internet and provides a forum for discussion and questions. Usenet is divided into "newsgroups" which are areas dedicated to discussion of a particular subject. For example bionet.microbiology is a newsgroup where discussion on microbiology is carried out. There are thousands of newsgroups available on all manner of subjects. WWW (World Wide Web) - The World Wide Web is a multimedia, hypertext system which provides easier access to resources of various types on the Internet. Multimedia means that the World Wide Web has the capability to present text, graphics, moving images, and sound. Hypertext means that Web documents can contain links to other documents embedded in the text. Clicking on highlighted text will take you directly to another document, or to another kind of Internet resource. The World Wide Web is an example of client/server computing. In this type of system, you use a client or browser, generally stored on your own computer, to access resources stored on servers around the world. The server supplies the requested file to your client, and the client is then responsible for formatting and displaying the file. Thus, the same document may look different depending on what browser you are using, and how it is currently set up. Popular browsers for the World Wide Web include Netscape, Internet Explorer and Hot Java. Information on the Web itself is stored in documents known as pages. The central page for a given site is known as the home page. Each page can contain a mixture of text, graphics, and links to other resources. Links will display differently from regular text. This display is dependent on the client you are using, but generally speaking, links will be underlined and in a different colour. Graphics can also serve as links. Most World Wide Web clients can also access and display information in a number of other formats, including gopher, ftp, telnet and Usenet groups. Thus, a Web client can probably be used for most of your Internet exploring. |
It is the advent of the World Wide Web (WWW) which is having the single most profound effect on the Internet and consequently on BioInformatics. The visual aspects of the WWW, incorporating text, images and sounds, has appealed to professionals and the public alike. It's inherent connectivity of data, the fact that it can incorporate other Internet tools (e.g. FTP) and its structured nature, has resulted in it being the perfect platform for biological analysis which involves storage and manipulation of large amounts of data in numerous different locations. Complex analysis of data such as alignment of sequences or even theoretical protein structure modelling can be carried out locally using software freely available on the WWW. All that is required is the correct URL (see Box 2) and help usually accompanies any software provided. Most commercial biological journals are online and can be perused; online real-time discussions can be carried out in designated discussion areas; complex three-dimensional visualisations of biomolecules can be accessed and manipulated; information on almost all branches of science can be just a mouse click away.
| Box 2. Glossary of World Wide Web Terms URL - Uniform Resource Locator. The global address of documents and other resources on the WWW. All URLs are unique for each document on the web. An example of a URL is http://www.ul.ie/ulix.html where again www.ul.ie is the domain name and ulix.html is the name of the document being viewed. All URLs are of the same general format http:// - HyperText Transfer Protocol. This standard protocol defines how messages are transmitted and what actions Web servers and browsers should take in response to various commands. In effect, anything beginning with http:// is a WWW address. Hypertext - A system in which documents contain highlighted links that allow readers to move between areas of the document or onto other documents , following subjects of interest in a variety of different paths. HTML - HyperText Markup Language. The authoring language used to create pages on the WWW. It contains tags which inform a browser how to format a page e.g. when to italicise text, when to indicate a link etc. |
With such a large amount of information on the web, it follows that some method of searching for desired topics is necessary. This function is carried out by the numerous search engines (see box 3) available online. Search engines allow the user to enter in the desired subject which the engine will then try and locate in all the WWW sites it knows. Any successful matches or "hits" are then returned to the user in the form of hypertext links. Most engines allow refinement of searches if a single search returns too many hits. Search engines are updated regularly, often daily, to include all pages currently accessible on the WWW.
| Box 3. Common WWW Search Engines Altavista - http://www.altavista.com Lycos - http://www.lycos.com Yahoo - http://www.yahoo.com WebCrawler - http://webcrawler.com InfoSeek - http://www2.infoseek.com |
Putting Information on the WWW
One of the reasons for the rapid growth of the WWW is that creating a web page is not an overly difficult thing to do. There are a few primary requirements which are necessary. The first and most obvious of these is that wherever the pages are created, they must be accessible by the Internet public in general. Most Internet Access Providers will offer a service to their customers which will allow them to create their own web pages. Alternatively if using a computer system which is directly connected to the Internet which has a Web Server, your pages can be created there. The next vital ingredient is a knowledge of HTML -HyperText Markup Language. HTML allows formatting of web pages in a manner which a web browser can recognise and interpret. A HTML file is a plain text (or ASCII) file which operates using a system of "labels" or "markup tags". Tags allow structuring of the document text as desired as well as providing for insertion of pictures and sounds and the creation of hypertext links to other documents.
All tags are of the same format, a left angle bracket (<) the tag name and a right angle bracket (>). Most
tags operate in pairs with the second tag usually containing a forward slash (/) between the left angle bracket
and the tag name. This serves the purpose of cancelling the particular tag effect so only text contained within
the pair of tags is affected. For example, to italicise text, the tag <i> would be placed before the desired
text and </i> would follow it. Everything outside that tag would be normal text. There are numerous tags
which perform a multitude of functions and there are a variety of tutorials online which explain how to use them
e.g. The NCSA beginners guide to HTML can be found at http://www.ncsa.uiuc.edu/General/Internet/WWW/HTMLPrimerAll.html
There are also numerous HTML editors available which greatly simplify creation of a web page http://dir.yahoo.com/Computers_and_Internet/Software/Reviews/Titles/Internet/Web_Authoring_Tools/HTML_Editors/
The third element required to set up web pages is the one that allows viewing of the page, a Web Browser. There
are quite a number of browsers available and most can be downloaded for minimal charge or free from various web
sites and ftp sites around the world. Some of the foremost browsers are Netscape and Microsoft Internet Explorer
which is used by most of the Internet population.
http://home.netscape.com/comprod/mirror/client_download.html
http://microsoft.com
Since the WWW is a very visual medium, a final item which is very helpful in creating a web page is an image
editing tool. Again there are quite a few available on the Internet either free or with a minimal charge, and one
of the most powerful and easiest to use of these is Paint Shop Pro. http://www.jasc.com/
BioInformatics
Since Biology is such a diverse field of study, it follows that BioInformatics itself subdivides into a large
range of disciplines. BioInformatics encompasses all computational methods and theory applicable to molecular biology,
including software tools, packages and systems; algorithms; mathematical analysis of algorithms, and analysis that
can be expected to lead to new algorithms; software associated with instrumentation; and computer-based techniques
for solving biological problems. It is concerned with the development of techniques for the collection and manipulation
of biological data, and the use of such techniques to make biological discoveries (or at least, predictions).
By far the largest application of BioInformatics is retrieving and analysing DNA and protein sequences. By examining
and comparing sequences, researchers can uncover vital information on the function, structure and even evolutionary
history of a biomolecule. Sequence data acquisition is being driven by the rapid accumulation of data from undertakings
such as the Human Genome project (HUGO) where scientists are mapping all the DNA in a human chromosome. The goal
of this project is nothing less than determining the precise location and molecular details of all the genes and
interconnecting segments which make up the human chromosomes. Since genes are the tiny but complex chemical segments
that control the activities of the cells and hence those of the entire organism, such knowledge would be an extraordinary
powerful tool to explore what are now unfathomable mysteries of human development and disease.
Databases
Obviously before analysing any of these sequences, they must first be located and retrieved. Sequences are stored in vast databases world-wide. These databases are usually either nucleotide (DNA) databases or protein databases. Sequences entered in databases are given their own unique accession numbers which serve as permanent references to the data and a means of citation. Sequences are entered into databases in two main ways. They can be directly submitted, electronically or manually, by the people who discover them or they can be extracted by annotators from the literature. Indeed, quite a few journals now prefer sequence accession numbers to be submitted to them rather than the actual sequence itself, as this cuts down on paper volume and, thus, cost.
By far the biggest nucleotide databases are the GenBank in Bethesda USA, the EMBL in Heidelberg Germany
and the DNA Data Bank of Japan (DDBJ) in Mishima. Each of the three collects a portion of the total reported
sequence data and exchanges it with the others on a daily basis.
Protein databases available include the SWISS-PROT database, maintained collaboratively by the EMBL Data Library
and Amos Bairoch of the University of Geneva and the Protein Identification Resource (PIR) in Washington, D.C.
Close collaboration with genome project databases has resulted in refined procedures for automatic inclusion of
genome sequence data into most databases. In fact, genome projects now compromise 20% of all sequence entries into
the EMBL Data Library alone.
Sequence Analysis
Since sequence analysis is the backbone of BioInformatics, it is not surprising that there are a great number of WWW resources dedicated to it (Box 4). Here is an example of a set of analyses which could be carried out on a DNA sequence. Initially, the desired DNA sequence can be downloaded from GenBank or EMBL using the Sequence Retrieval System (SRS) available from the European BioInformatics Institute (EBI). The EBI is based at Hinxton Hall in Cambridge where it shares the site with the and the Human Genome Resource (which works on the HUGO project) and the Sanger Centre (which works on HUGO and also sequencing the genome of other organisms) and is an outstation of the EMBL.
| Box 4. Useful Sites for Sequence Analysis GenBank http://www.ncbi.nlm.nih.gov/Genbank/index.html EMBL http://www.embl-heidelberg.de SWISS-PROT http://www.expasy.ch/sprot/sprot-top.html PIR http://www.bis.med.jhmi.edu/Dan/proteins/pir-last.html SRS http://srs.ebi.ac.uk:5000/ Gene Finder http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html BLAST http://www2.ebi.ac.uk/blast2/ Clustal http://www2.ebi.ac.uk/clustalw/ Grail http://avalon.epm.ornl.gov/Grail-bin/EmptyGrailForm Translate http://www.expasy.ch/tools/dna.html pI/Mw tool http://www.expasy.ch/tools/pi_tool.html SAPS http://ludwig-sun1.unil.ch:8080/software/SAPS_form.html ProtParam (predicts physiochemical properties of a protein) http://www.expasy.ch/tools/protparam.html ProtScale (hydrophobicity etc.) http://www.expasy.ch/cgi-bin/protscale.pl GOR (protein secondary structure prediction) http://molbiol.soton.ac.uk/compute/GOR.html Swiss-Model (3D molecular modelling) http://www.expasy.ch/swissmod/SWISS-MODEL.html |
SRS is a powerful retrieval tool which allows interrogation of all the major databases (both nucleotide and
protein) either individually or collectively. Searches can be based on a large number of criteria e.g. the name
of the gene, the accession number, organism name, authors, even sequence length or any combination of these.
Once the sequence has been retrieved, the next step might be to try and characterise the sequence. There are numerous
examinations which could be carried out including determining where actual coding regions are, predicting internal
exons, splicing sites, restriction sites etc.;
and there are a wide variety of WWW sites dedicated to this (e.g. Gene Finder) and there are also a large
number of FTP sites where free software which can perform these tasks can be downloaded.
To further characterise the sequence, a homology search can be performed. Homology searching involves comparing
a known sequence against any of the databases to identify similar sequences. This can be done using the BLAST
tool available from several locations including the EBI, the NCBI or Stanford University in the USA. By identifying
similar sequences, it may be possible to draw parallels with your own sequence and predict similar coding regions
(and hence perhaps similar function).
When a similar sequence (or several similar sequences) has been identified, the original can then be aligned with
one or more of those sequences to identify the regions of similarity. A powerful tool for doing this is Clustal,
which attempts to align the sequences to maximise the similar regions. A consensus sequence can be generated which
prints out the sequence stretches which are identical between all and a phylogenetic tree can be drawn which predicts
the evolutionary similarity of all the sequences.
Once such analyses have been carried out on a DNA sequence, the next progression might be to translate the sequence
into the protein it codes for. Analyses of the protein coding potential of a DNA sequence can be carried out using
the Grail server at ORNL. For translation, there are several sites where this can be performed, including
the Translate tool on the ExPASy server (which is where SWISS-PROT is housed).
Once the protein has been identified, it too can be characterised for properties such as hydrophobicity, flexibility,
charge distribution, isoelectric point and molecular weight. Again there are numerous sites dedicated to these
tasks (e.g. SAPS - Statistical Analysis of Protein Sequences). The secondary structure of the protein can
then be predicted and analysed to help understand the function of different domains. It can even be compared with
the secondary structure of other proteins to investigate similar function due to structure.
Even the overall three dimensional structure can possibly be predicted using Swiss Model the SWISS-PROT 3D modelling tool.

Example of a three-dimensional structure of a protein downloaded from SWISS-PROT
Trying to perform such tasks in a traditional laboratory setting would take months of work. With a computer connected
to the WWW, this can be accomplished in a timeframe of hours. Also, use of the software which performs all these
tasks is free!
Beyond Sequence Analysis
However important sequence analysis is in the world of BioInformatics, it is not the only area of interest.
As its name suggests BioInformatics is chiefly concerned with information and the WWW provides a platform for access
to information on just about all branches of the Life Sciences.
Here are a few examples of interesting places to visit:
Elementary
An interactive Periodic Table of the Elements can be accessed and a wealth of information including electronegativities,
radii and properties of some compounds of any of the elements viewed by clicking on it: http://www.shef.ac.uk/uni/academic/A-C/chem/web-elements/web-elements-home.html
Units
A complete list of SI units can be viewed: http://www.chemie.fu-berlin.de/chemistry/general/si.html
Learning on the Web
The area of Internet education is a rapidly growing field. Tutorials can be viewed on a variety of subjects
including Virology http://www.bocklabs.wisc.edu/Tutorial.html,
Analytical Chemistry,
http://www.scimedia.com/chem-ed/analytic/ac-basic.htm
and Cytology.
Online courses with full texts and recognised certificates upon completion can be attended on everything from
Medical Bacteriology and Microbiology http://www.medmicro.mds.qmw.ac.uk/underground/
to Protein Structure, http://www.cryst.bbk.ac.uk/PPS/index.html
to Recombinant DNA technology.
http://www.kadets.d20.co.edu/~lundberg/index.html
Another very interesting site is the Biology Place. This is a learning centre for both second and third level
education and covers a wide range of biological subjects. There is a very reasonable yearly membership fee, but
visitors can peruse most of the site and there is also a free 7 day trial option.
http://www.biology.com/
Laboratory Aid
Wondering about the exact procedure for a molecular biology experiment? The Materials and Methods site with
a comprehensive list of protocols may be just the place to look.
gopher://ftp.bio.indiana.edu/1m/Molecular-Biology/Materials%2bMethods
Dictionary Online
For explanations of scientific terms, a good place to look is the Access Excellence Glossary of Scientific Terms
which allows the user to input the term and returns a link to an explanation.
http://outcast.gene.com/ae/search.html
Diseases
If you want all the latest information on diseases and their causes, then the Centre for Disease Control and
Prevention is the place to look: http://www.cdc.gov/
Or if you want to keep track of the lethal Ebola Virus, then the Ebola Information Headquarters has all the answers:
http://www.geocities.com/CapeCanaveral/Lab/5738/

The Infamous Ebola Virus
Health
For health matters, there are a multitude of dedicated sites.
AIDS information can be found in numerous sites around the world, one of which is the AIDS Daily Summary Database
which contains thousands of articles on AIDS from numerous sources from 1988 to the present day: http://www.cdcnpin.org/
A sound diet can be devised with the aid of the Food and Nutrition Information Center. This also contains a link
to the Food and Drug Administration's page on foodborne illnesses.
http://www.nal.usda.gov/fnic/
Feeling ill? Wondering if it's worth the bother calling a doctor? Medical advice can be viewed and action suggested
on a large number of problems ranging from back pain to hoarseness to sun burn. http://www.scl.ncal.kaiperm.org/GetCare3.htm
Comprehensive health advice on subjects like allergies, CPR and prenatal care can be also accessed.
http://www.scl.ncal.kaiperm.org/healthinfo/index.html
Anatomy
Information on human anatomy is widespread on the web, but one of the more interesting undertakings is the Visible
Human Project by the U.S. National Library of Medicine. The aim is to create complete, anatomically detailed, three-dimensional
representations of the male and female human body. They even offer you the option of making your own visible woman!
http://www.nlm.nih.gov/research/visible/visible_human.html
Ecology
The Environment is also well represented on the Internet with extensive ecological sites online. Check out the
latest happenings from Environmental groups such as Greenpeace, http://www.greenpeace.org
help support the conservation of the Atlantic Wild Salmon, http://www.asf.ca/
look at EarthKids, a child friendly website aimed at helping younger people learn how to protect the Earth. The
site is currently maintained by a 12 year old! http://www.earthkids.com/
or join the World Wildlife Fund's Living Planet campaign to celebrate it's 35th anniversary:
http://www.livingplanet.org/
Plants
Anyone with an interest in plants will enjoy browsing through the wealth of botany related pages.
http://www.ou.edu/cas/botany-micro/www-vl/
or http://www.botany.net/IDB/
Why not try Scotts botanical links with information on virtually everything including chocolate.
http://www.ou.edu/cas/botany-micro/bot-linx/
BSE
Worried about your beef consumption? Never fear, the facts on BSE and CJD are on hand.
http://www.gene.ucl.ac.uk/~dcurtis/lectures/bsecjd.html
Journals
There are a large number of commercial scientific journals online. Some even have complete texts of articles
online for non-subscribers and most offer a search facility for back issues. Look for the journal of your choice
at http://www.public.iastate.edu/~pedro/rt_journals.html
Conclusion
With the increasing availability of the Internet to scientists, students and the wider community it is only
natural that the continued evolution of relatively new areas of study such as BioInformatics is set to advance
at a rapid pace. In the future it may be possible to perform most common experimental work on a computer prior
to ever approaching a laboratory in order to determine it's feasibility. Whether that comes about or not, one thing
is certain. The Internet is here to stay and it is a very valuable resource well worth exploiting.
References
Teaching Microbiology with the World Wide Web. Terry, T. ASM News Vol 61 No 8 401-405 (1995)
Bioinformatics and Computational Biology at George Madison University http://www.science.gmu.edu/~michaels/Bioinformatics/index.html
Bioinformatics Internet Resources http://www.expasy.ch/
The World-Wide Web Virtual Library: Biosciences http://mcb.harvard.edu/BioLinks.html
Recent applications of hyperactive chemistry and the World Wide Web: Towards an integrated chemistry information
environment. Rzepa et al http://www.ch.ic.ac.uk/rzepa/cc96/.index.html
Genome Analysis: A Laboratory Manual - Cold Spring Harbour Laboratory http://www.ncbi.nlm.nih.gov/Baxevani/CSH/index.html
DNA Learning Centre at Cold Spring Harbour Laboratory http://vector.cshl.org/
Fox Chase Cancer Centre - Scientific Information and Databases http://www.fccc.edu/
MedWeb: Genetics and molecular biology http://www.medweb.emory.edu/MedWeb/
Human Genome Project Information http://www.ornl.gov/TechResources/Human_Genome/home.html
Computing for Molecular Biology Information Service http://www1.elsevier.nl/journals/genecombis/
PC Webopaedia - hypertext http://www.sandybay.com/pc-web/hypertext.htm
Pedro's BioMolecular Research Tools http://www.public.iastate.edu/~pedro/research_tools.html
Yahoo! - Science:Biology http://www.yahoo.com/Science/Biology/
Note: Since URL's change and sites become obsolete with time these links may become dated. All Links associated
with this article have been checked in October 1999