DNA fingerprinting in forensics: past, present, future
1Department of Forensic Genetics, Institute of Legal Medicine and Forensic Sciences, Charité - Universitätsmedizin Berlin, Berlin, Germany
Lutz Roewer: email@example.com
Author information ►Article notes ►Copyright and License information ►
Received 2013 Oct 8; Accepted 2013 Oct 8.
Copyright © 2013 Roewer; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
This article has been cited by other articles in PMC.
DNA fingerprinting, one of the great discoveries of the late 20th century, has revolutionized forensic investigations. This review briefly recapitulates 30 years of progress in forensic DNA analysis which helps to convict criminals, exonerate the wrongly accused, and identify victims of crime, disasters, and war. Current standard methods based on short tandem repeats (STRs) as well as lineage markers (Y chromosome, mitochondrial DNA) are covered and applications are illustrated by casework examples. Benefits and risks of expanding forensic DNA databases are discussed and we ask what the future holds for forensic DNA fingerprinting.
Keywords: DNA fingerprinting, Forensic DNA profiling, Short tandem repeat, Lineage markers, Ancestry informative markers, Forensic DNA database, Privacy rights, Short tandem repeat
The past - a new method that changed the forensic world
'“I’ve found it! I’ve found it”, he shouted, running towards us with a test-tube in his hand. “I have found a re-agent which is precipitated by hemoglobin, and by nothing else”,’ says Sherlock Holmes to Watson in Arthur Conan Doyle’s first novel A study in Scarlet from1886 and later: 'Now we have the Sherlock Holmes’ test, and there will no longer be any difficulty […]. Had this test been invented, there are hundreds of men now walking the earth who would long ago have paid the penalty of their crimes’ .
The Eureka shout shook England again and was heard around the world when roughly 100 years later Alec Jeffreys at the University of Leicester, in UK, found extraordinarily variable and heritable patterns from repetitive DNA analyzed with multi-locus probes. Not being Holmes he refrained to call the method after himself but 'DNA fingerprinting’ . Under this name his invention opened up a new area of science. The technique proved applicable in many biological disciplines, namely in diversity and conservation studies among species, and in clinical and anthropological studies. But the true political and social dimension of genetic fingerprinting became apparent far beyond academic circles when the first applications in civil and criminal cases were published. Forensic genetic fingerprinting can be defined as the comparison of the DNA in a person’s nucleated cells with that identified in biological matter found at the scene of a crime or with the DNA of another person for the purpose of identification or exclusion. The application of these techniques introduces new factual evidence to criminal investigations and court cases. However, the first case (March 1985) was not strictly a forensic case but one of immigration . The first application of DNA fingerprinting saved a young boy from deportation and the method thus captured the public’s sympathy. In Alec Jeffreys’ words: 'If our first case had been forensic I believe it would have been challenged and the process may well have been damaged in the courts’ . The forensic implications of genetic fingerprinting were nevertheless obvious, and improvements of the laboratory process led already in 1987 to the very first application in a forensic case. Two teenage girls had been raped and murdered on different occasions in nearby English villages, one in 1983, and the other in 1986. Semen was obtained from each of the two crime scenes. The case was spectacular because it surprisingly excluded a suspected man, Richard Buckland, and matched another man, Colin Pitchfork, who attempted to evade the DNA dragnet by persuading a friend to give a sample on his behalf. Pitchfork confessed to committing the crimes after he was confronted with the evidence that his DNA profile matched the trace DNA from the two crime scenes. For 2 years the Lister Institute of Leicester where Jeffreys was employed was the only laboratory in the world doing this work. But it was around 1987 when companies such as Cellmark, the academic medico-legal institutions around the world, the national police, law enforcement agencies, and so on started to evaluate, improve upon, and employ the new tool. The years after the discovery of DNA fingerprinting were characterized by a mood of cooperation and interdisciplinary research. None of the many young researchers who has been there will ever forget the DNA fingerprint congresses which were held on five continents, in Bern (1990), in Belo Horizonte (1992), in Hyderabad (1994), in Melbourne (1996), and in Pt. Elizabeth (1999), and then shut down with the good feeling that the job was done. Everyone read the Fingerprint News distributed for free by the University of Cambridge since 1989 (Figure 1). This affectionate little periodical published non-stylish short articles directly from the bench without impact factors and resumed networking activities in the different fields of applications. The period in the 1990s was the golden research age of DNA fingerprinting succeeded by two decades of engineering, implementation, and high-throughput application. From the Foreword of Alec Jeffreys in Fingerprint News, Issue 1, January 1989: 'Dear Colleagues, […] I hope that Fingerprint News will cover all aspects of hypervariable DNA and its application, including both multi-locus and single-locus systems, new methods for studying DNA polymorphisms, the population genetics of variable loci and the statistical analysis of fingerprint data, as well as providing useful technical tips for getting good DNA profiles […]. May your bands be variable’ .
Cover of one of the first issues of Fingerprint News from 1990.
Jeffreys’ original technology, now obsolete for forensic use, underwent important developments in terms of the basic methodology, that is, from Southern blot to PCR, from radioactive to fluorescent labels, from slab gels to capillary electrophoresis. As the technique became more sensitive, the handling simple and automated and the statistical treatment straightforward, DNA profiling, as the method was renamed, entered the forensic routine laboratories around the world in storm. But, what counts in the Pitchfork case and what still counts today is the process to get DNA identification results accepted in legal proceedings. Spectacular fallacies, from the historical 1989 case of People vs. Castro in New York  to the case against Knox and Sollecito in Italy (2007–2013) where literally DNA fingerprinting was on trial , disclosed severe insufficiencies in the technical protocols and especially in the DNA evidence interpretation and raised nolens volens doubts on the scientific and evidentiary value of forensic DNA fingerprinting. These cases are rare but frequent enough to remind each new generation of forensic analysts, researchers, or private sector employees that DNA evidence is nowadays an important part of factual evidence and needs thus intense scrutiny for all parts of the DNA analysis and interpretation process.
In the following I will briefly describe the development of DNA fingerprinting to a standardized investigative method for court use which has since 1984 led to the conviction of thousands of criminals and to the exoneration of many wrongfully suspected or convicted individuals . Genetic fingerprinting per se could of course not reduce the criminal rate in any of the many countries in the world, which employ this method. But DNA profiling adds hard scientific value to the evidence and strengthens thus (principally) the credibility of the legal system.
The technological evolution of forensic DNA profiling
In the classical DNA fingerprinting method radio-labeled DNA probes containing minisatellite  or oligonucleotide sequences  are hybridized to DNA that has been digested with a restriction enzyme, separated by agarose electrophoresis and immobilized on a membrane by Southern blotting or - in the case of the oligonucleotide probes - immobilized directly in the dried gel. The radio-labeled probe hybridizes to a set of minisatellites or oligonucleotide stretches in genomic DNA contained in restriction fragments whose size differ because of variation in the numbers of repeat units. After washing away excess probe the exposure to X-ray film (autoradiography) allows these variable fragments to be visualized, and their profiles compared between individuals. Minisatellite probes, called 33.6 and 33.15, were most widely used in the UK, most parts of Europe and the USA, whereas pentameric (CAC)/(GTG)5 probes were predominantly applied in Germany. These so-called multilocus probes (MLP) detect sets of 15 to 20 variable fragments per individual ranging from 3.5 to 20 kb in size (Figure 2). But the multi-locus profiling method had several limitations despite its successful application to crime and kinship cases until the middle of the 1990s. Running conditions or DNA quality issues render the exact matching between bands often difficult. To overcome this, forensic laboratories adhered to binning approaches , where fixed or floating bins were defined relative to the observed DNA fragment size, and adjusted to the resolving power of the detection system. Second, fragment association within one DNA fingerprint profile is not known, leading to statistical errors due to possible linkage between loci. Third, for obtaining optimal profiles the method required substantial amounts of high molecular weight DNA  and thus excludes the majority of crime-scene samples from the analysis. To overcome some of these limitations, single-locus profiling was developed . Here a single hypervariable locus is detected by a specific single-locus probe (SLP) using high stringency hybridization. Typically, four SLPs were used in a reprobing approach, yielding eight alleles of four independent loci per individual. This method requires only 10 ng of genomic DNA  and has been validated through extensive experiments and forensic casework, and for many years provided a robust and valuable system for individual identification. Nevertheless, all these different restriction fragment length polymorphism (RFLP)-based methods were still limited by the available quality and quantity of the DNA and also hampered by difficulties to reliably compare genetic profiles from different sources, labs, and techniques. What was needed was a DNA code, which could ideally be generated even from a single nucleated cell and from highly degraded DNA, a code, which could be rapidly generated, numerically encrypted, automatically compared, and easily supported in court. Indeed, starting in the early 1990s DNA fingerprinting methods based on RFLP analysis were gradually supplanted by methods based on PCR because of the improved sensitivity, speed, and genotyping precision . Microsatellites, in the forensic community usually referred to short tandem repeats (STRs), were found to be ideally suited for forensic applications. STR typing is more sensitive than single-locus RFLP methods, less prone to allelic dropout than VNTR (variable number of tandem repeat) systems , and more discriminating than other PCR-based typing methods, such as HLA-DQA1 . More than 2,000 publications now detail the technology, hundreds of different population groups have been studied, new technologies as, for example, the miniSTRs  have been developed and standard protocols have been validated in laboratories worldwide (for an overview see ). Forensic DNA profiling is currently performed using a panel of multi-allelic STR markers which are structurally analogous to the original minisatellites but with much shorter repeat tracts and thus easier to amplify and multiplex with PCR. Up to 30 STRs can be detected in a single capillary electrophoresis injection generating for each individual a unique genetic code. Basically there are two sets of STR markers complying with the standards requested by criminal databases around the world: the European standard set of 12 STR markers  and the US CODIS standard of 13 markers . Due to partial overlap, they form together a standard of 18 STR markers in total. The incorporation of these STR markers into commercial kits has improved the application of these markers for all kinds of DNA evidence with reproducible results from as less than three nucleated cells  and extracted even from severely compromised material. The probability that two individuals will have identical markers at each of 13 different STR loci within their DNA exceeds one out of a billion. If a DNA match occurs between an accused individual and a crime scene stain, the correct courtroom expression would be that the probability of a match if the crime-scene sample came from someone other than the suspect (considering the random, not closely-related man) is at most one in a billion . The uniqueness of each person’s DNA (with the exception of monozygotic twins) and its simple numerical codification led to the establishment of government-controlled criminal investigation DNA databases in the developed nations around the world, the first in 1995 in the UK . When a match is made from such a DNA database to link a crime scene sample to an offender who has provided a DNA sample to a database that link is often referred to as a cold hit. A cold hit is of value as an investigative lead for the police agency to a specific suspect. China (approximately 16 million profiles, the United States (approximately 10 million profiles), and the UK (approximately 6 million profiles) maintain the largest DNA database in the world. The percentage of databased persons is on the increase in all countries with a national DNA database, but the proportions are not the same by the far: whereas in the UK about 10% of the population is in the national DNA database, the percentage in Germany and the Netherlands is only about 0.9% and 0.8%, respectively .
Multilocus DNA Fingerprint from a large family probed with the oligonucleotide (GTG)5 (Courtesy of Peter Nürnberg, Cologne Center for Genomics, Germany).
Lineage markers in forensic analysis
Lineage markers have special applications in forensic genetics. Y chromosome analysis is very helpful in cases where there is an excess of DNA from a female victim and only a low proportion from a male perpetrator. Typical examples include sexual assault without ejaculation, sexual assault by a vasectomized male, male DNA under the fingernails of a victim, male 'touch’ DNA on the skin, and the clothing or belongings of a female victim. Mitochondrial DNA (mtDNA) is of importance for the analyses of low level nuclear DNA samples, namely from unidentified (typically skeletonized) remains, hair shafts without roots, or very old specimens where only heavily degraded DNA is available . The unusual non-recombinant mode of inheritance of Y and mtDNA weakens the statistical weight of a match between individual samples but makes the method efficient for the reconstruction of the paternal or maternal relationship, for example in mass disaster investigations  or in historical reconstructions. A classic case is the identification of two missing children of the Romanov family, the last Russian monarchy. MtDNA analysis combined with additional DNA testing of material from the mass grave near Yekaterinburg gave virtually irrefutable evidence that the two individuals recovered from a second grave nearby are the two missing children of the Romanov family: the Tsarevich Alexei and one of his sisters . Interestingly, a point heteroplasmy, that is, the presence of two slightly different mtDNA haplotypes within an individual, was found in the mtDNA of the Tsar and his relatives, which was in 1991 a contentious finding (Figure 3). In the early 1990s when the bones were first analyzed, a point heteroplasmy was believed to be an extremely rare phenomenon and was not readily explainable. Today, the existence of heteroplasmy is understood to be relatively common and large population databases can be searched for its frequency at certain positions. The mtDNA evidence in the Romanov case was underpinned by Y-STR analysis where a 17-locus haplotype from the remains of Tsar Nicholas II matched exactly to the femur of the putative Tsarevich and also to a living Romanov relative. Other studies demonstrated that very distant family branches can be traced back to common ancestors who lived hundreds of years ago . Currently forensic Y chromosome typing has gained wide acceptance with the introduction of highly sensitive panels of up to 27 STRs including rapidly mutating markers . Figure 4 demonstrates the impressive gain of the discriminative power with increasing numbers of Y-STRs. The determination of the match probability between Y-STR or mtDNA profiles via the mostly applied counting method  requires large, representative, and quality-assessed databases of haplotypes sampled in appropriate reference populations, because the multiplication of individual allele frequencies is not valid as for independently inherited autosomal STRs . Other estimators for the haplotype match probability than the count estimator have been proposed and evaluated using empirical data , however, the biostatistical interpretation remains complicated and controversial and research continues. The largest forensic Y chromosome haplotype database is the YHRD (http://www.yhrd.org) hosted at the Institute of Legal Medicine and Forensic Sciences in Berlin, Germany, with about 115,000 haplotypes sampled in 850 populations . The largest forensic mtDNA database is EMPOP (http://www.empop.org) hosted at the Institute of Legal Medicine in Innsbruck, Austria, with about 33,000 haplotypes sampled in 63 countries . More than 235 institutes have actually submitted data to the YHRD and 105 to EMPOP, a compelling demonstration of the level of networking activities between forensic science institutes around the world. That additional intelligence information is potentially derivable from such large datasets becomes obvious when a target DNA profile is searched against a collection of geographically annotated Y chromosomal or mtDNA profiles. Because linearly inherited markers have a highly non-random geographical distribution the target profile shares characteristic variants with geographical neighbors due to common ancestry . This link between genetics, genealogy, and geography could provide investigative leads for investigators in non-suspect cases as illustrated in the following case :
Screenshot of the 16169 C/T heteroplasmy present in Tsar Nicholas II using both forward and reverse sequencing primers (Courtesy of Michael Coble, National Institute of Standards and Technology, Gaithersburg, USA).
Correlation between the number of analyzed Y-STRs and the number of different haplotypes detected in a global population sample of 18,863 23-locus haplotypes.
In 2002, a woman was found with a smashed skull and covered in blood but still alive in her Berlin apartment. Her life was saved by intensive medical care. Later she told the police that she had let a man into her apartment, and he had immediately attacked her. The man was subletting the apartment next door. The evidence collected at the scene and in the neighboring apartment included a baseball cap, two towels, and a glass. The evidence was sent to the state police laboratory in Berlin, Germany and was analyzed with conventional autosomal STR profiling. Stains on the baseball cap and on one towel revealed a pattern consistent with that of the tenant, whereas two different male DNA profiles were found on a second bath towel and on the glass. The tenant was eliminated as a suspect because he was absent at the time of the offense, but two unknown men (different in autosomal but identical in Y-STRs) who shared the apartment were suspected. Unfortunately, the apartment had been used by many individuals of both European and African nationalities, so the initial search for the two men became very difficult. The police obtained a court order for Y-STR haplotyping to gain information about the unknown men’s population affiliation. Prerequisites for such biogeographic analyses are large reference databases containing Y-STR haplotypes also typed for ancestry informative single nucleotide markers (SNP) markers from hundreds of different populations. The YHRD proved useful to infer the population origin of the unknown man. The database inquiry indicated a patrilineage of Southern European ancestry, whereas an African descent was unlikely (Figure 5). The police were able to track down the tenant in Italy, and with his help, establish the identity of one of the unknown men, who was also Italian. When questioning this man, the police used the information retrieved from Y-STR profiling that he had shared the apartment in Berlin with a paternal relative. This relative was identified as his nephew. Because of the close-knit relationship within the family, this information would probably not have been easily retrieved from the uncle without the prior knowledge. The nephew was suspected of the attempted murder in Berlin. He was later arrested in Italy, where he had committed another violent robbery.
Screenshot from the YHRD depicting the radiation of a 9-locus haplotype belonging to haplogroup J in Southern Europe.
Information on the biogeographic origin of an unknown DNA could also be retrieved from a number of ancestry informative SNPs (AISNPs) on autosomes or insertion/deletion polymorphisms [37,38] but perhaps even better from so-called mini-haplotypes with only <10 SNPs spanning small molecular intervals (<10 kb) with very low recombination among sites . Each 'minihap’ behaves like a locus with multiple haplotype lineages (alleles) that have evolved from the ancestral human haplotype. All copies of each distinct haplotype are essentially identical by descent. Thus, they fall like Y and mtDNA into the lineage-informative category of genetic markers and are thus useful for connecting an individual to a family or ancestral genetic pool.
Benefits and risks of forensic DNA databases
The steady growth in the size of forensic DNA databases raises issues on the criteria of inclusion and retention and doubts on the efficiency, commensurability, and infringement of privacy of such large personal data collections. In contrast to the past, not only serious but all crimes are subject to DNA analysis generating millions and millions of DNA profiles, many of which are stored and continuously searched in national DNA databases. And as always when big datasets are gathered new mining procedures based on correlation became feasible. For example, 'Familial DNA Database Searching’ is based on near matches between a crime stain and a databased person, which could be a near relative of the true perpetrator . Again the first successful familial search was conducted in UK in 2004 and led to the conviction of Craig Harman of manslaughter. Craig Harman was convicted because of partial matches from Harman’s brother. The strategy was subsequently applied in some US states but is not conducted at the national level. It was during a dragnet that it first became public knowledge that the German police were also already involved in familial search strategies. In a little town in Northern Germany the police arrested a young man accused of rape because they had analyzed the DNA of his two brothers who had participated in the dragnet. Because of partial matches between crime scene DNA profiles and these brothers they had identified the suspect. In contrast to other countries, the Federal Constitutional Court of Germany decided in December 2012 against the future court use of this kind of evidence.
Civil rights and liberties are crucial for democratic societies and plans to extend forensic DNA databases to whole populations need to be condemned. Alec Jeffreys early on has questioned the way UK police collects DNA profiles, holding not only convicted individuals but also arrestees without conviction, suspects cleared in an investigation, or even innocent people never charged with an offence . He also criticized that large national databases as the NDNAD of England and Wales are likely skewed socioeconomically. It has been pointed out that most of the matches refer to minor offences; according to GeneWatch in Germany 63% of the database matches provided are related to theft while <3% related to rape and murder. The changes to the UK database came in the 2012’s Protection of Freedoms bill, following a major defeat at the European Court of Human Rights in 2008. As of May 2013 1.1 million profiles (of about 7 million) had been destroyed to remove innocent people’s profiles from the database. In 2005 the incoming government of Portugal proposed a DNA database containing samples from every Portuguese citizen. Following public objections, the government limited the database to criminals. A recent study on the public views on DNA database-related matters showed that a more critical attitude towards wider national databases is correlated with the age and education of the respondents . A deeper public awareness on the benefits and risks of very large DNA collections need to be built and common ethical and privacy standards for the development and governance of DNA databases need to be adopted where the citizen’s perspectives are taken into consideration.
The future of forensic DNA analysis
The forensic community, as it always has, is facing the question in which direction the DNA Fingerprint technology will be developed. A growing number of colleagues are convinced that DNA sequencing will soon replace methods based on fragment length analysis and there are good arguments for this position. With the emergence of current Next Generation Sequencing (NGS) technologies, the body of forensically useful data can potentially be expanded and analyzed quickly and cost-efficiently. Given the enormous number of potentially informative DNA loci - which of those should be sequenced? In my opinion there are four types of polymorphisms which deserve a place on the analytic device: an array of 20–30 autosomal STRs which complies with the standard sets used in the national and international databases around the world, a highly discriminating set of Y chromosomal markers, individual and signature polymorphisms in the control and coding region of the mitochondrial genome , as well as ancestry and phenotype inference SNPs . Indeed, a promising NGS approach with the simultaneous analysis of 10 STRs, 386 autosomal ancestry and phenotype informative SNPs, and the complete mtDNA genome has been presented recently  (Figure 6). Currently, the rather high error rates are preventing NGS technologies from being used in forensic routine , but it is foreseeable that the technology will be improved in terms of accuracy and reliability. Time is another essential factor in police investigations which will be considerably reduced in future applications of DNA profiling. Commercial instruments capable of producing a database-compatible DNA profile within 2 hours exist  and are currently under validation for law enforcement use. The hands-free 'swab in - profile out’ process consists of automated extraction, amplification, separation, detection, and allele calling without human intervention. In the US the promise of on-site DNA analysis has already altered the way in which DNA could be collected in future. In a recent decision the Supreme court of the United States held that 'when officers make an arrest supported by probable cause to hold for a serious offense and bring the suspect to the station to be detained in custody, taking and analyzing a cheek swab of the arrestee’s DNA is, like fingerprinting and photographing, a legitimate police booking procedure’ (Maryland v. Alonzo Jay King, Jr.). In other words, DNA can be taken from any arrestee, rightly or wrongly arrested, as a part of the normal booking procedure. Twenty-eight states and the federal government now take DNA swabs after arrests with the aim of comparing profiles to the CODIS database, creating links to unsolved cases and to identify the person (Associated Press, 3 June 2013). Driven by the rapid technological progress DNA actually becomes another metric of quick identification. It remains to be seen whether rapid DNA technologies will alter the way in which DNA is collected by police in other countries. In Germany for example the DNA collection is still regulated by the code of the criminal procedure and the use of DNA profiling for identification purposes only is excluded. Because national legislations are basically so different, a worldwide system to interrogate DNA profiles from criminal justice databases seems currently a very distant project.
Schematic overview of Haloplex targeting and NGS analysis of a large number of markers simultaneously. Sequence data are shown for samples from two individuals and the D3S1358 STR marker, the rs1335873 SNP marker, and a part of the HVII region of mtDNA...
At present the forensic DNA technology directly affects the lives of millions people worldwide. The general acceptance of this technique is still high, reports on the DNA identification of victims of the 9/11 terrorist attacks , of natural disasters as the Hurricane Katrina , and of recent wars (for example, in former Yugoslavia ) and dictatorship (for example, in Argentina ) impress the public in the same way as police investigators in white suits securing DNA evidence at a broken door. CSI watchers know, and even professionals believe, that DNA will inevitably solve the case just following the motto Do Not Ask, it’s DNA, stupid! But the affirmative view changes and critical questions are raised. It should not be assumed that the benefits of forensic DNA fingerprinting will necessarily override the social and ethical costs .
This short article leaves many of such questions unanswered. Alfred Nobel used his fortune to institute a prize for work 'in ideal direction’. What would be the ideal direction in which DNA fingerprinting, one of the great discoveries in recent history, should be developed?
The author declares that he has no competing interests.
- Doyle AC. A study in scarlet, Beeton’s Christmas Annual. London, New York and Melbourne: Ward, Lock & Co; 1887.
- Jeffreys AJ, Wilson V, Thein SL. Individual-specific “fingerprints” of Human DNA. Nature. 1985;314:67–74. doi: 10.1038/314067a0.[PubMed][Cross Ref]
- Jeffreys AJ, Brookfield JF, Semeonoff R. Positive identification of an immigration test-case using human DNA fingerprints. Nature. 1985;317:818–819. doi: 10.1038/317818a0.[PubMed][Cross Ref]
- University of Leicester Bulletin Supplement August/September 2004.
- Jeffreys AJ. Foreword. Fingerprint News. 1989;1:1.
- Lander ES. DNA fingerprinting on trial. Nature. 1989;339:501–505. doi: 10.1038/339501a0.[PubMed][Cross Ref]
- Balding DJ. Evaluation of mixed-source, low-template DNA profiles in forensic science. Proc Natl Acad Sci U S A. 2013;110:12241–12246. doi: 10.1073/pnas.1219739110.[PMC free article][PubMed][Cross Ref]
- The innocence project. [ http://www.innocenceproject.org]
- Jeffreys AJ, Wilson V, Thein SL. Hypervariable 'minisatellite’ regions in human DNA. Nature. 1985;314:67–73. doi: 10.1038/314067a0.[PubMed][Cross Ref]
- Schäfer R, Zischler H, Birsner U, Becker A, Epplen JT. Optimized oligonucleotide probes for DNA fingerprinting. Electrophoresis. 1988;9:369–374. doi: 10.1002/elps.1150090804.[PubMed][Cross Ref]
- Budowle B, Giusti AM, Waye JS, Baechtel FS, Fourney RM, Adams DE, Presley LA, Deadman HA, Monson KL. Fixed-bin analysis for statistical evaluation of continuous distributions of allelic data from VNTR loci, for use in forensic comparisons. Am J Hum Genet. 1991;48:841–855.[PMC free article][PubMed]
- Roewer L, Nürnberg P, Fuhrmann E, Rose M, Prokop O, Epplen JT. Stain analysis using oligonucleotide probes specific for simple repetitive DNA sequences. Forensic Sci Int. 1990;47:59–70. doi: 10.1016/0379-0738(90)90285-7.[PubMed][Cross Ref]
- Wong Z, Wilson V, Patel I, Povey S, Jeffreys AJ. Characterization of a panel of highly variable minisatellites cloned from human DNA. Ann Hum Genet. 1987;51:269–288. doi: 10.1111/j.1469-1809.1987.tb01062.x.[PubMed][Cross Ref]
- Jobling MA, Hurles ME, Tyler-Smith C. Human Evolutionary Genetics. Abingdon: Garland Science; 2003. Chapter 15: Identity and identification; pp. 474–497.
- Edwards A, Civitello A, Hammond HA, Caskey CT. DNA typing and genetic mapping with trimeric and tetrameric tandem repeats. Am J Hum Genet. 1991;49:746–756.[PMC free article][PubMed]
- Budowle B, Chakraborty R, Giusti AM, Eisenberg AJ, Allen RC. Analysis of the VNTR locus D1S80 by the PCR followed by high-resolution PAGE. Am J Hum Genet. 1991;48:137–144.[PMC free article][PubMed]
- Saiki RK, Bugawan TL, Horn GT, Mullis KB, Erlich HA. Analysis of enzymatically amplified beta-globin and HLA-DQ alpha DNA with allele-specific oligonucleotide probes. Nature. 1986;324:163–166. doi: 10.1038/324163a0.[PubMed][Cross Ref]
- Coble MD, Butler JM. Characterization of new miniSTR loci to aid analysis of degraded DNA. J Forensic Sci. 2005;50:43–53.[PubMed]
- Butler JM. Forensic DNA Typing: Biology, Technology, and Genetics of STR Markers. 2. New York: Elsevier Academic Press; 2005.
- Gill P, Fereday L, Morling N, Schneider PM. The evolution of DNA databases - Recommendations for new European STR loci. Forensic Sci Int. 2006;156:242–244. doi: 10.1016/j.forsciint.2005.05.036.[PubMed][Cross Ref]
- Budowle B, Moretti TR, Niezgoda SJ, Brown BL. CODIS and PCR-based short tandem repeat loci: law enforcement tools. Madison, WI: Promega Corporation; 1998. pp. 73–88. (Proceedings of the Second European Symposium on Human Identification).
- Nagy M, Otremba P, Krüger C, Bergner-Greiner S, Anders P, Henske B, Prinz M, Roewer L. Optimization and validation of a fully automated silica-coated magnetic beads purification technology in forensics. Forensic Sci Int. 2005;152:13–22. doi: 10.1016/j.forsciint.2005.02.027.[PubMed][Cross Ref]
- Martin PD, Schmitter H, Schneider PM. A brief history of the formation of DNA databases in forensic science within Europe. Forensic Sci Int. 2001;119:225–231. doi: 10.1016/S0379-0738(00)00436-9.[PubMed][Cross Ref]
- ENFSI survey on DNA Databases in Europe. December 2011, published 2012-08-18. [ http://www.enfsi.eu]
- Roewer L, Parson W. In: Encyclopedia of Forensic Sciences. 2. Siegel JA, Saukko PJ, editor. Amsterdam: Elsevier B.V; 2013. Internet accessible population databases: YHRD and EMPOP.
- Calacal GC, Delfin FC, Tan MM, Roewer L, Magtanong DL, Lara MC, Rd F, De Ungria MC. Identification of exhumed remains of fire tragedy victims using conventional methods and autosomal/Y-chromosomal short tandem repeat DNA profiling. Am J Forensic Med Pathol. 2005;26:285–291. doi: 10.1097/01.paf.0000177338.21951.82.[PubMed][Cross Ref]
- Coble MD, Loreille OM, Wadhams MJ, Edson SM, Maynard K, Meyer CE, Niederstätter H, Berger C, Berger B, Falsetti AB, Gill P, Parson W, Finelli LN. Mystery solved: the identification of the two missing Romanov children using DNA analysis. PLoS One. 2009;4:e4838. doi: 10.1371/journal.pone.0004838.[PMC free article][PubMed][Cross Ref]
- Haas C, Shved N, Rühli FJ, Papageorgopoulou C, Purps J, Geppert M, Willuweit S, Roewer L, Krawczak M. Y-chromosomal analysis identifies the skeletal remains of Swiss national hero Jörg Jenatsch (1596–1639) Forensic Sci Int Genet. 2013;7:610–617. doi: 10.1016/j.fsigen.2013.08.006.[PubMed][Cross Ref]
- Ballantyne KN, Keerl V, Wollstein A, Choi Y, Zuniga SB, Ralf A, Vermeulen M, de Knijff P, Kayser M. A new future of forensic Y-chromosome analysis: rapidly mutating Y-STRs for differentiating male relatives and paternal lineages. Forensic Sci Int Genet. 2012;6:208–218. doi: 10.1016/j.fsigen.2011.04.017.[PubMed][Cross Ref]
- Budowle B, Sinha SK, Lee HS, Chakraborty R. Utility of Y-chromosome short tandem repeat haplotypes in forensic applications. Forensic Sci Rev. 2003;15:153–164.[PubMed]
- Roewer L, Kayser M, de Knijff P, Anslinger K, Betz A, Caglià A, Corach D, Füredi S, Henke L, Hidding M, Kärgel HJ, Lessig R, Nagy M, Pascali VL, Parson W, Rolf B, Schmitt C, Szibor R, Teifel-Greding J, Krawczak M. A new method for the evaluation of matches in non-recombining genomes: application to Y-chromosomal short tandem repeat (STR) haplotypes in European males. Forensic Sci Int. 2000;114:31–43. doi: 10.1016/S0379-0738(00)00287-5.[PubMed][Cross Ref]
- Andersen MM, Caliebe A, Jochens A, Willuweit S, Krawczak M. Estimating trace-suspect match probabilities for singleton Y-STR haplotypes using coalescent theory. Forensic Sci Int Genet. 2013;7:264–271. doi: 10.1016/j.fsigen.2012.11.004.[PubMed][Cross Ref]
- Willuweit S, Roewer L. International Forensic Y Chromosome User Group. Y chromosome haplotype reference database (YHRD): update. Forensic Sci Int Genet. 2007;1:83–87. doi: 10.1016/j.fsigen.2007.01.017.[PubMed][Cross Ref]
- Parson W, Dür A. EMPOP - a forensic mtDNA database. Forensic Sci Int Genet. 2007;1:88–92. doi: 10.1016/j.fsigen.2007.01.018.[PubMed][Cross Ref]
- Roewer L, Croucher PJ, Willuweit S, Lu TT, Kayser M, Lessig R, de Knijff P, Jobling MA, Tyler-Smith C, Krawczak M. Signature of recent historical events in the European Y-chromosomal STR haplotype distribution. Hum Genet. 2005;116:279–291. doi: 10.1007/s00439-004-1201-z.[PubMed][Cross Ref]
- Roewer L. Male DNA Fingerprints say more. Profiles in DNA. 2004;7:14–15.
- Phillips C, Fondevila M, Lareu MV. A 34-plex autosomal SNP single base extension assay for ancestry investigations. Methods Mol Biol. 2012;830:109–126. doi: 10.1007/978-1-61779-461-2_8.[PubMed][Cross Ref]
- Pereira R, Phillips C, Pinto N, Santos C, dos Santos SE, Amorim A, Carracedo A, Gusmão L. Straightforward inference of ancestry and admixture proportions through ancestry-informative insertion deletion multiplexing. PLoS One. 2012;7:e29684. doi: 10.1371/journal.pone.0029684.[PMC free article][PubMed][Cross Ref]
- Pakstis AJ, Fang R, Furtado MR, Kidd JR, Kidd KK. Mini-haplotypes as lineage informative SNPs and ancestry inference SNPs. Eur J Hum Genet. 2012;20:1148–1154. doi: 10.1038/ejhg.2012.69.[PMC free article][PubMed][Cross Ref]
- Maguire CN, McCallum LA, Storey C, Whitaker JP. Familial searching: A specialist forensic DNA profiling service utilising the National DNA Database® to identify unknown offenders via their relatives - The UK experience. Forensic Sci Int Genet. 2013;8:1–9.[PubMed]
- Jeffreys A. Genetic Fingerprinting. Nat Med. 2005;11:1035–1039. doi: 10.1038/nm1005-1035.[PubMed][Cross Ref]
- Machado H, Silva S. Would you accept having your DNA profile inserted in the National Forensic DNA database? Why? Results of a questionnaire applied in Portugal. Forensic Sci Int Genet. 2013. Epub ahead of print. [PubMed]
- Parson W, Strobl C, Strobl C, Huber G, Zimmermann B, Gomes SM, Souto L, Fendt L, Delport R, Langit R, Wootton S, Lagacé R, Irwin J. Evaluation of next generation mtGenome sequencing using the Ion Torrent Personal Genome Machine (PGM) Forensic Sci Int Genet. 2013;7:632–639. doi: 10.1016/j.fsigen.2013.09.007.[PubMed][Cross Ref]
- Budowle B, van Daal A. Forensically relevant SNP classes. Biotechniques. 2008;44:603–608. 610. [PubMed]
- Allen M, Nilsson M, Havsjö M, Edwinsson L, Granemo J, Bjerke M. Haloplex and MiSeq NGS for simultaneous analysis of 10 STRs, 386 SNPs and the complete mtDNA genome. Melbourne; 2013. (Presentation at the 25th Congress of the International Society for Forensic Genetics). 2–7 September 2013.
- Bandelt HJ, Salas A. Current next generation sequencing technology may not meet forensic standards. Forensic Sci Int Genet. 2012;6:143–145. doi: 10.1016/j.fsigen.2011.04.004.[PubMed][Cross Ref]
- Tan E, Turingan RS, Hogan C, Vasantgadkar S, Palombo L, Schumm JW, Selden RF. Fully integrated, fully automated generation of short tandem repeat profiles. Investigative Genet. 2013;4:16. doi: 10.1186/2041-2223-4-16.[PMC free article][PubMed][Cross Ref]
- Biesecker LG, Bailey-Wilson JE, Ballantyne J, Baum H, Bieber FR, Brenner C, Budowle B, Butler JM, Carmody G, Conneally PM, Duceman B, Eisenberg A, Forman L, Kidd KK, Leclair B, Niezgoda S, Parsons TJ, Pugh E, Shaler R, Sherry ST, Sozer A, Walsh A. DNA Identifications after the 9/11 World Trade Center Attack. Science. 2005;310:1122–1123. doi: 10.1126/science.1116608.[PubMed][Cross Ref]
- Dolan SM, Saraiya DS, Donkervoort S, Rogel K, Lieber C, Sozer A. The emerging role of genetics professionals in forensic kinship DNA identification after a mass fatality: lessons learned from Hurricane Katrina volunteers. Genet Med. 2009;11:414–417. doi: 10.1097/GIM.0b013e3181a16ccc.[PubMed][Cross Ref]
- Huffine E, Crews J, Kennedy B, Bomberger K, Zinbo A. Mass identification of persons missing from the break-up of the former Yugoslavia: structure, function, and role of the International Commission on Missing Persons. Croat Med J. 2001;42:271–275.[PubMed]
- Corach D, Sala A, Penacino G, Iannucci N, Bernardi P, Doretti M, Fondebrider L, Ginarte A, Inchaurregui A, Somigliana C, Turner S, Hagelberg E. Additional approaches to DNA typing of skeletal remains: the search for “missing” persons killed during the last dictatorship in Argentina. Electrophoresis. 1997;18:1608–1612. doi: 10.1002/elps.1150180921.[PubMed][Cross Ref]
- Levitt M. Forensic databases: benefits and ethical and social costs. Br Med Bull. 2007;83:235–248. doi: 10.1093/bmb/ldm026.[PubMed][Cross Ref]
Articles from Investigative Genetics are provided here courtesy of BioMed Central
Among the many benefits of the Human Genome Project are new and powerful tools such as the genome-wide hybridization devices referred as microarrays. Initially designed to measure gene transcriptional levels, microarray technologies are now used for comparing other genome features among individuals and their tissues and cells. Results provide valuable information on disease subcategories, disease prognosis, and treatment outcome. Likewise, reveal differences in genetic makeup, regulatory mechanisms and subtle variations are approaching the era of personalized medicine. To understand this powerful tool, its versatility and how it is dramatically changing the molecular approach to biomedical and clinical research, this review describes the technology, its applications, a didactic step-by-step review of a typical microarray protocol, and a real experiment. Finally, it calls the attention of the medical community to integrate multidisciplinary teams, to take advantage of this technology and its expanding applications that in a slide reveals our genetic inheritance and destiny.
Keywords: Biological Marker, Gene Expression Regulation, Gene Dosage, DNA Methylation, Pathogen Detection
Genomics approaches have changed the way we do research in biology and medicine. Nowadays, we can measure the majority of mRNAs, proteins, metabolites, protein-protein interactions, genomic mutations, polymorphisms, epigenetic alterations, and micro RNAs in a single experiment. The data generated by these methods together with the knowledge derived by their analyses was unimaginable just a few years ago. These techniques however, produce such amounts of data that making sense of them is a difficult task. So far, DNA microarray technologies are perhaps the most successful and mature methodology for high-throughput and large-scale genomic analyses.
DNA microarray technologies were initially designed to measure the transcriptional levels of RNA transcripts derived from thousands of genes within a genome in a single experiment. This technology has made possible to relate physiological cell states to gene expression patters for studying tumors, diseases progression, cellular response to stimuli, and drug target identification. For example, subsets of genes with increased and decreased activities (referred as transcriptional profiles or gene expression “signatures”) have been identified for acute lymphoblast leukaemia (1), breast cancer (2), prostate cancer (3), lung cancer (4), colon cancer (5), multiple tumour types (6), apoptosis-induction (7), tumorigenesis (8), and drug response (9). Moreover, because the published data is increasing every day, integrated analysis of several studies or “meta-analysis”, have been proposed in the literature (10). These approaches detect generalities and particularities of gene expression in diseases.
More recent uses of DNA microarrays in biomedical research are not limited to gene expression. DNA microarrays are being used to detect single nucleotide polymorphisms (SNPs) of our genome (Hap Map project) (11), aberrations in methylation patterns (12), alterations in gene copy-number (13), alternative RNA splicing (14), and pathogen detection (15, 16).
In the last 10 or 15 years, high quality arrays, standardized hybridization protocols, accurate scanning technologies, and robust computational methods have established DNA microarray for gene expression as a powerful, mature, and easy to use essential genomic tool. Although the identification of the most relevant information from microarray experiments is still under active research, very well established methods are available for a broad spectrum of experimental setups. In this publication we present the most common uses of DNA microarray technologies, provide an overview of their frequent biomedical applications, describe the steps of a typical laboratory procedure, guide the reader through the processing of a real experiment to detect differentially expressed genes, and list valuable web-based microarray data and software repositories.
2. Technology description
It is well known that complementary single stranded sequences of nucleic acids form double stranded hybrids. This property is the basis of the very powerful molecular biology tools such as Southern and Northern blots, in situ hybridization and Polymerase Chain Reaction (PCR). In these, specific single stranded DNA sequences are used to probe for its complementary sequence (DNA or RNA) forming hybrids. This same idea is also used in DNA microarray technologies. The aim is however not only to detect but also to measure the expression levels of not a few but rather thousands of genes in the same experiment. For this purpose, thousands of single-stranded sequences that are complementary to target sequences are bound, synthesized, or spotted to a glass support whose size is similar to typical microscope slides. There are mainly two types of DNA arrays depending on the type of spotted probes. One use small single-stranded oligonucleotides (~22nt) synthesized in situ whose leading provider is Affymetrix©. The other type of arrays use complementary DNA (cDNA) obtained by reverse transcription of the genes’ messenger RNAs (mRNA), completion of the second strand, cloning of the double stranded DNAs and typically PCR amplification of their open reading frames (ORF), which become the bound probes. One of the limitations of using large ORF or cDNA sequences is an uneven optimal melting temperature caused by differences in their sizes and content of GC-paired nucleotides. A second problem is cross-hybridization of closely-related sequences, overlapped genes, and splicing variants. In oligo-based DNA arrays, the targeted nucleic acid specie is redundantly detected by designing several complementary oligonucleotides spanning each entire target sequence by segments. The oligonucleotides are designed in such a way to avoid the cDNA probe drawbacks and to maximize the specificity for the target gene. Initially, DNA arrays were based on nylon membranes that are still in use. However, the glass provides an excellent support for attaching the nucleotide sequences, is less sensitive to light than membranes, and is non-porous, allowing the use of very small amounts of sample. There is a more recent and different technology that uses designed oligonucleotide probes attached to beads that are deposited randomly in a support. The position of each bead and hence the sequence it carries is determined by a complex pseudo-sequencing process. These types of arrays, provided by Illumina® (www.illumina.com) are mainly used for genotyping, copy-number measurements, sequencing and detecting loss of heterzygosity (LOH), allele-specific expression and methylation. A recent review of this technology has been published elsewhere (17). For clinical research, however, the preferred technology so far is the oligo-based microarrays whose leader provider is Affymetrix®.
The general process in microarray experiments is depicted in figure 1. Fluorescent dyes are used to label the extracted mRNAs or amplified cDNAs from the tissue or cell samples to be analyzed. The DNA array is then hybridized with the labeled sample(s) by incubating usually overnight and then washing to remove non-specific hybrids. A laser excites the attached fluorescent dyes to produce light which is detected by a (confocal) scanner. The scanner generates a digital image from the excited microarray. The digital image is further processed by specialized software to transform the image of each spot to a numerical reading. This process is performed, first, finding the specific location and shape of each spot, followed by the integration (summation) of intensities inside the defined spot, and finally estimating the surrounding background noise. Background noise is generally subtracted from the integrated signal. This final reading is an integer value assumed to be proportional to the concentration of the target sequence in the sample to which the probe in the spot is directed to. In competitive two-dye assays, the reading is transformed to a ratio equal to the relative abundance of the target sequence (labeled with one type of fluorochrome) from a sample respect to a reference sample (labeled with another type of fluorochrome). In the one-dye Affymetrix technologies, the fluorescence is commonly yellow whereas in two-dyes technologies the colors used are green for reference and red for sample (although a replicate using dye-swap is common). The choice of the technology that is more appropriate depends on experimental design, availability, costs, and the expected number of expression changes. In general, when only a minority of the genes is expected to change, a two-dye or reference design is more suitable, otherwise a one-dye technology may be more appropriate.
Schematic representation of a gene expression microarray assay.
Finally, at the end of the experiment an important issue derived from statistical tests in microarray data is the concept of the real significance of results and the concomitant need for multiplicity of tests. For example, when applying a t-test, the result is the probability that the observed values are given by chance. Commonly, we call a result significant when the probability is smaller than 5%. For large-scale data, a t-test would be performed thousands of times (one for each gene) which means that from 10,000 t-tests at 5% of significance level, we will call 500 genes differentially expressed merely by chance which is very close or even higher than those actually selected from experiments. Therefore, a correction to attempt to control for false positives should be performed. The most common correction method is the False Discovery Rate proposed originally by Benjamini and Hochberg (18) and extended by Storey and Tibshirani (19).
3. Applications in biomedical research
The ultimate output from any microarray assay, independently of the technology, is to provide a measure for each gene or probe of the relative abundance of the complementary target in the examined sample. In this section we revise the most common applications of the data derived from clinical studies using microarrays irrespective of the technology employed.
a. Relating Gene Expression to Physiology: Differential Expressed Genes
The most common and basic question in DNA microarray experiments is whether genes appear to be down-regulated (in which the expression has decreased) or up-regulated (in which the expression has increased) between two or more groups of samples. This type of analysis is essential because it provides the simplest characterization of the specific molecular differences that are associated to a specific biological effect. These signatures can be used to generate new hypothesis and guide the design of further experiments. A statistical test is used to assess each gene whether the expression is statistically different between two or more groups of samples (figure 2). When comparing population of individuals, a large number of samples per class are needed to avoid interference from variation due to individuals rather than experimental group. For laboratory controlled samples, such as cell-lines or strains, at least three biological replicates are recommended to compute a good estimate of the variance, hence the statistical confidence (as more replicates, as more confidence and less false positives). Using a statistical technique called power analysis it is possible to estimate the number of samples required to identifying a high percentage of truly differentially regulated genes. Although the use of this approach is common practice in the design of biological experiments it is still not widely spread in the microarray community.
Detection of Differential Expressed Genes.
To detect differential expressed genes, intuitive and formal statistical approaches have been proposed. The most famous intuitive approach, proposed in early microarray studies, is the fold change in fluorescence intensity (20, 21) expressed as the logarithm (base 2 or log2) of the sample divided by the reference (ratios). In this way, fold change equal to 1 means that the expression level has increased two-fold (up-regulation), fold change equal to −1 means that the expression level has decreased two-fold (down-regulation) whereas 0 means that the expression level has not changed. Larger values account for larger fold changes. Genes whose fold change is larger that certain (arbitrary) value, are selected for further analyses. Although fold change is a very useful measure, the weaknesses of this criterion are the overestimation for low expressed genes in the reference (denominators close to zero tends to elevate the value of the ratio), the value that determines a “significant” change is subjective, and the tendency to omit small but significant changes in gene expression levels. For these reasons, nowadays the most sensible option is following formal statistical approaches to select differentially expressed genes. For two groups of samples the common t-test is the easiest option, whilst not the best, for analyzing two-dyes microarrays whose log2 ratios generates normal-like distributions after normalization (see next section), and the ANOVA (analysis of variance) test for more than two groups of samples. These options apply for both one- and two-dye microarrays. If the data is non standardized, Wilcoxon or Mann-Whitney tests may be applied. A comparison of differential expression statistical tests, including t-test, has been published elsewhere (22).
The approaches we have described are univariate. That is, one gene is tested at the time independently of any other gene. There are multivariate procedures however, where genes are tested in combinations rather than isolated. Whilst being more powerful (23–26), these approaches require a more complex analysis.
b. Biomarker Detection: Supervised Classification
Disease type and severity are often determined by expert physicians or pathologists on the basis of patient symptoms or by analyzing features of the diseased tissue obtained by biopsy inspection. This categorization may allow the choice of appropriate pharmacological or surgical therapy. In this context, the availability of molecular markers associated to clinical outcome have been useful in allowing monitoring disease onset at a very early stage and complementing the clinical and histo-pathological analysis. The more recent application of DNA microarrays in clinical research has been a very important step towards the development of more complex markers based on multi-gene signatures. The identification of gene expression “signatures” associated to diseases categories is called biomarker detection or supervised classification (figure 3).
The fundamental difference between identifying differentially expressed genes and identifying a set of genes of real diagnostic or prognostic value is that a biomarker needs to be predictive of disease class or clinical outcome. For this reason it must be possible to associate, to a given set of marker genes, a rule that allow deciding the identity of an unknown sample. The classification accuracy of the biomarker should also need to be determined with robust statistical procedures. Therefore, during the biomarker selection procedure, a substantial fraction of the samples are set aside to evaluate independently the accuracy of the selected biomarkers (in terms of sensitivity and specificity). Thus, such studies require a relatively large number of samples.
We already explained that unlike differential expression, in biomarker selection for diagnostics, a rule is needed to make predictions. This rule is generated by a classifier, a statistical model that assigns a sample to a certain category based on gene expression values. For example, a sensible classifier for diabetes is whether sugar levels in serum reach certain value. In statistics, this classifier is referred as univariate. That is, only one variable (sugar level) is needed in the rule. Nevertheless, for DNA microarray studies, it is common to obtain a large gene list useful for disease discrimination. Multiple genes provide robustness in the estimation and consider potential synergy between genes. Therefore, multivariate classifiers are commonly used. For example, it is well known that obesity and parental predisposition to diabetes, in addition to sugar levels in serum, is a more precise diabetes diagnosis criteria. Multivariate classifier can be designed using genes selected either by a univariate method such as t-test, ANOVA, Wilcoxon, PAM (27), Golub’s centroid (1), or by a multivariate method (23–26).
Thus, the possibility to characterize the molecular state of diseased tissues has lead to an improvement of prognosis and diagnosis as well as providing evidence of the existence of distinct disease sub-classes in previously considered homogeneous diseases.
c. Describing the Relationship between the Molecular State of Biological Samples: Unsupervised Classification
One key issue in the analysis of microarray data is finding genes with a similar expression profile across a number of samples. Co-expressed genes have the potential to be regulated by same transcriptional factors or to have similar functions (for example belonging to the same metabolic or signaling pathways). The detection of co-expressed genes may therefore reveal potential clinical targets, genes with similar biological functions, or expose novel biological connections between genes. On the other hand, we may want to describe the degree of similarity of biological samples at the transcriptional level (28). We may expect such analysis to confirm that samples with similar biological properties (for example samples derived from patients affected by the same disease) tend to have a similar molecular profile. Although this is true it has also been demonstrated that the molecular profile of samples is also reflecting disease heterogeneity and therefore it is useful in discovering novel diseases sub-classes (5). From the methodological prospective, these questions can be addressed using unsupervised clustering methods.
In this context, hierarchical clustering is, among several options (29), one of the most used unsupervised classification methods (figure 4). Other methods are available in several software packages such as R (http://cran.r-project.org), GEPAS (30), TIGR T4 (31), (32), GeneSpring (33) and Genesis (34). The core concept behind hierarchical clustering is the progressive construction of gene or sample cluster by adding one element (gene, sample, or a smaller cluster) at the time. In this way, more similar elements are added early to small clusters whereas less similar elements are added lately forming larger clusters. In order to decide which element is more similar to another, it is important to rely on a similarity or dissimilarity measure. Commonly used measures include Euclidean distance (defined as the geometrical distance between two elements in an n-dimensional space) and correlation distance. The result of the hierarchical clustering is therefore a hierarchical organization of patterns, similar to a phylogenetic tree. For example, in figure 4B the most similar genes 5 and 6 are first merged to form a cluster, then genes 1 and 2 form a different cluster which is lengthened later on by adding the next more similar gene 3; and the process continues until all genes have been included in a cluster and all clusters have been merged. For large-scale microarray data, it is common to use a simultaneous hierarchical clustering for samples and genes (32). Typically, genes are represented in the y-axis whereas samples are drawn in the x-axis..A color-coded matrix (heatmap), where samples and genes are sorted according to the results of the clustering, is used to represent the expression values for each gene in each sample. This two-dimensional clustering procedure is particularly suitable to explore the results of a large microarray experiment (figure 4).
Unsupervised classification and detection of co-expressed genes.
d. Identification of Prognostic Genes Associated to Risk and Survival
In medicine, the association of prognostic factors related to survival times is invaluable. The link between gene expression levels and survival times may provide a useful tool for early diagnosis, prompt therapeutic, intervention, and in designing patient-specific treatments. Consequently, the selection of biomarkers that correlates with survival times is a very important objective in the analysis of microarray data. To date, a number of approaches have been developed. The most commonly used procedures incorporate genes into exponential, poison, or cox regression models using a univariate variable selection procedure (testing one gene at a time independently of any other; then, the most significant genes are selected as risk factors) (35). The gene selection procedure is summarized in figure 5. The selected genes combined with clinical classes can then be used to detect variations in survival times using both the Kaplan-Meier method and statistical tests. Often researchers are interested in finding subgroups of samples (independently of the recorded clinical data) whose survival times are significantly different. This information can then be used to prescribe specific treatments. In previous sections we have shown how unsupervised data exploration methods (such as cluster analysis) can be used to identify sub-groups of samples within a previously considered homogeneous disease. Once these sub-groups have been identified, survival analysis can be used to test whether they are characterized by different clinical outcomes (35).
Selection procedure for genes associated to survival times as risk factors.
e. Association of Genes to disease surrogate markers: Regression Analysis
An interesting question in the analysis of microarray data derived from clinical studies is whether there is an association between gene expression and an ordinal variable that represent a response, or more generally, a measure of disease progression (a surrogate marker). Examples of these variables are the concentration of metabolites, proteins in serum, response to treatment or dosage, growth, or any other clinical measure whose numerical representation makes sense progressively. The approach, depicted in figure 6, is conceptually similar to that introduced in the Survival analysis section of this review. The mathematical model in these case that relates the independent variable (such as time, levels of metabolites, protein, treatment) to dependent variables (genes) is, commonly, a linear regression model. Nevertheless, such model can be modified to include other available information.
Selection procedure for genes associated to outcome.
f. Genetic Disorders: Gene Copy Number and Comparative Genomic Hybridization
It is well known that several inherited diseases are a consequence of genetic rearragements such as gene duplications, translocations, and deletions. Moreover, these alterations are as well observed in cancer cells. A specific microarray technique used to detect these abnormalities in a single hybridization experiment is called Comparative Genomic Hybridization (CGH) (Pollack, 1999). The core concept in CGH is the use of genomic DNA (gDNA) in the hybridization to compare the gDNA from a disease sample versus that of a healthy individual. Hence, a typical microarray design can be used in this approach (figure 1). The signal intensity in all probes in the microarray should be therefore very similar for healthy samples. Thus, differences in gene copy number are easily detected by changes in signal intensity. Using this technology, Zhao et al., (2005) have recently characterized the variations of gene copy number in several cell lines derived from prostate cancer and Braude et al., (36) confirmed an alteration in chronic myeloid leukaemia.
g. Genetic Disorders: Epigenetics and Methylation
Around 80% of CpG-dinucleotides are naturally methylated at the fifth position of the cytosine pyrimidine ring (37). The patterns of cytosine methylation along with histone acetylation and phosphorylation controls the activation and deactivation of genes without changing the nucleotide sequence (38). These regulatory mechanisms are known as the epigenetic phenomena. In particular, genes methylated in their promoters become inactive irrespective of the presence of the transcriptional activators. Aberrations in any of these epigenetic patterns cause several syndromes and may predispose carriers to cancer (39). To detect patterns of methylation using microarrays, two main methods have been proposed (40). One is based in the enrichment of the unmethylated fraction of CpG islands and the other focuses on the hypermethylated fraction. Both methods make use of methylation sensitive restriction enzymes to generate fragments enriched in either, unmethylated or methylated CpG sites (figure 7). In the first method, sample and control gDNA are cleaved with methylation-sensitive enzymes that cut unmethylated CpG sites generating protruding shorter fragments leaving methylated CpG sites unaltered. Specific adaptors are then linked to these protruding ends. Methylated fragments are subsequently cut by a CpG specific enzyme. The remaining fragments that contain the adaptor, those that were originally unmethylated, are amplified by PCR using and primers complementary to the adaptors sequence. The result is that genes belonging to the unmethylated fraction are associated to higher fluorescent intensities on the microarray. On the other hand, in the second method, the gDNA from sample and control samples are cleaved with a restriction enzyme to generate small protruding fragments. Fragments are then linked to adaptors and cut by methylation sensitive restriction enzymes leaving methylated flanked fragments unaltered which are amplified using PCR. The result is that the methylated fraction is amplified and detected in the microarray. The microarrays used in these experiments are therefore specially designed to include such fragments. Using the methods described, methylation patterns have been screened for several types of cancers (41–46).
Detection of altered methylated patterns and DNA polymorphisms in genomic DNA.
h. Genetic Disorders and Variability: Gene Polymorphism and Single Nucleotide Polymorphism
The human genome carries at least 10 million nucleotide positions that vary in at least 1 of 100 individuals in a population (47). The identification of these single nucleotide polymorphisms (SNPs) is an important tool for identifying genetic loci linked to complex disorders (47). Although there are commercially available microarrays to detect SNP, these technologies are still in their infancy and the widespread distribution is still halt because of the relatively high cost per sample. So far, the number of SNPs stored in public databases is more than 2 millions whereas the available microarrays for SNPs detection only cover 10,000 SNPs. The three major strategies for SNP genotyping using microarrays are all based on primer extension techniques depicted in figure 8. The primer included in the microarray probe hybridizes to the target sequence precisely adjacent to its SNP. The first strategy (figure 8a) consists of mini-sequencing the primer specific for each polymorphism immobilized in the microarray support. PCR products, DNA polymerase, and different color fluorescent labeled nucleotides are added in the hybridization-one-base-extension to detect the SNPs in parallel. The genotype is detected by color combinations. The second strategy (figure 8b) uses the same concept of primer specific hybridization, though combined with only one dye and more than one base extension. The genotype is revealed by signal strength. The third strategy (figure 8c) makes one-base extension in solution combined with different color fluorescent-labeled nucleotides. Primers are then captured by hybridization in the microarray. The genotype is detected by color combinations. Recent studies have produced genome-wide SNP characterization for a number of tumor types (48–50).
Major techniques for detection of SNPs using microarrays.
i. Chromatin Immunoprecipitation: Genetic Control and Transcriptional Regulation
Transcription factors (TF) are regulatory proteins that can bind specific DNA sequences (usually promoters) to control the level of gene expression. Mutations or alterations in the expression or activation of TF are known in several diseases (51). For example, abnormal over-expression of the TF c-Myc is found in 90% of gynaecological cancers, 80% of breast cancers, 70% of colon cancers, and 50% of hepatocarcinomas (52). Therefore establishing the link between TF and their targets is essential to characterize and design better cancer therapies. To identify these targets, DNA fragments are incubated with a selected TF that has been tagged (figure 9). The complex DNA-TF is precipitated using a quite specific antibody against the tagged peptide. Precipitated DNA is then labeled and hybridized in DNA microarrays to reveal genome-wide targets for the selected TF (figure 9). An experimental overview and computational methods for the analysis of these data have been revised elsewhere (53, 54).
Chromatin immune-precipitation (ChIP-on-chip) assay.
j. Pathogen detection
Classically, pathogen detection is achieved through a series of clinical tests which detect, generally, single pathogens. A battery of clinical assays is therefore performed to typify a sample. A radical recent approach uses DNA microarrays to test for the presence of hundreds of pathogens in a single experiment (15, 16). For this, known sequences from each pathogen are collected and those being pathogen-specific are selected (figure 10). The collection of specific sequences is used to build a purpose-specific microarray. Then genomic DNA from a patient biopsy or from a food sample suspected to be infected is extracted and hybridized to the microarray. Pathogen detection is simply revealed by spot intensity.
Multi-pathogen detection using DNA microarrays.
4. An Overview of a Typical Microarray Experiment
In this section we provide a brief description of the typical workflow of a microarray experiment and its data analysis (see figure 1).
RNA can be extracted from tissue or cultured cells using molecular biology laboratory procedures (although several commercial kits are available). The amount of mRNA required is about 0.5μg which is equivalent to 20μg of total RNA, though there is some variation depending on the microarray technology. When the amount of mRNA (or DNA) is scarce, an amplification step, for example by PCR amplification of reverse transcribed cDNA, is needed before labeling.
mRNA is retro-transcribed using reverse transcriptase to generate cDNA. Labeling is achieved by including in the reaction (or in a separate reaction) modified fluorescent nucleotides that are made fluorescent by excitation at appropriate wavelengths. The most common fluorescent dyes used are Cy3 (green) and Cy5 (red). The unincorporated dyes are usually removed by column chromatography or ethanol precipitation.
Hybridization is carried out according to conventional protocols. Hybridization solution contains saline sodium citrate (SSC), sodium dodecyl sulphate (SDS) as detergent, non-specific DNA such as yeast DNA, salmon sperm DNA, or repetitive sequences, blocking reagents like bovine serum albumin (BSA) or Denhardt’s reagent, and labeled cDNA from the samples. Hybridization temperatures range from 42°C to 45°C for cDNA-based microarrays and from 42°C to 50°C for oligo-based microarrays. Hybridization volumes vary between 20μl to one ml depending on the microarray technology. A hybridization chamber is usually needed to keep temperature and humidity constant.
After hybridization, the microarray is washed in salt buffers of decreasing concentration and dried by slide centrifugation or by blowing air after immersion in alcohol. The slide is then read by a scanner which consists of a device similar to a fluorescence microscope coupled with a laser, robotics, and digital camera to record the fluorescent excitation. The robotics focuses on the slide, lens, camera, and laser by rows similar to a common desktop scanner. The amount of signal (color) detected is presumed to be proportional to the amount of dye at each spot in the microarray and hence proportional to the RNA concentration of the complementary sequence in the sample. The output is, for each fluorescent dye, a monochromatic (non-colored) digital image file typically in TIFF format. False-color images (red, green and yellow) are reconstructed by specialized software for visualization purposes only.
The goal in this step is to identify the spots in the microarray image, quantify the signal, and record the quality of each spot. Depending on the software used, this step may need some degree of human intervention. The digital images are loaded in specialized software with a pre-loaded design of the microarray (grid layout) which instructs the software to consider number, position, shape, and dimension of each spot. The grid is then accommodated to the actual image automatically or manually. Fine-tuning of spot positions and shapes is usually performed to avoid any bias in the robotic construction of the microarray. Human involvement is needed to mark those spots that could be artifacts such as bubbles or scratches which are common. Finally, an automated integration function is performed using the software to convert the actual spot readings to a numerical value. The integration function considers the signal and background noise for each spot. The output of the image analysis may be commonly a tab-delimited text file or a specific file format. Common image analysis software include ScanArray® (PerkinElmer), GenePix® (Axon), TIGR-SpotFinder/TM4 (www.tigr.org), and GeneChip® (Affymetrix) This process varies, from automatic or semi-automatic to manual, depending on the microarray technology, scanner and software used.
Systematic errors are introduced in labeling, hybridization and scanning procedures. The main aims of normalization is to correct for these errors preserving the biological information and to generate values that can be compared between experiments, especially when they were generated in, and with, different times, places, reagents, microarrays, or technicians. There are two types of normalization, within and between array normalization. Within array normalization refers to normalization applied in the same slide and it is applicable, generally, to two-dye technologies. For this, let us define M=Log2(R/G) and A=Log2(R*G)/2 where R and G are the red and green readings respectively. Under the assumption that the majority of genes have not been differentially expressed, the majority of the M values should oscillate around zero. Within normalization is finally performed shifting the imaginary line produced by the values of M (in vertical axis) to zero along the values of A (in horizontal axis). This kind of normalization, sometimes called loess, is usually performed by spatial blocks to avoid any bias in the microarray printing process (called print-tip-loess). Between normalization is necessary when at least two slides are analyzed to guarantee that both slides are measured in the same scale and that its values are independent from the parameters used to generate the measurements. The goal is to transform the data in such a way that all microarrays have the same distribution of values. For two-dye technologies this is optional and is commonly done through scaling or standardizing the values once within normalization has been performed. For one-dye microarrays, between normalization is usually performed using methods to equalize distributions such as quantile-normalization (55) after log2 transformation. There are, however, a number of normalization methods. The right choice is usually data-dependent. A comparison of the results of different normalization methods is recommended.
The image analysis process (generally in spotted microarrays) does not always generate a value for a gene because the spot was defective or manually marked as faulty. This is not a major issue when genes are replicated in several spots in the microarray, because the reading of the gene can still be estimated using the remaining spots. If the value in a spot is systematically missing in several arrays, it should be removed from the analysis. If the number of missing values is low, the corresponding spots can be simply not considered in all arrays. However, when the number of arrays is large this could lead to the removal of several spots. To avoid these problems, one must use only those methods that can deal with missing values, or, use algorithms to infer those values (30). Results should, therefore, be interpreted considering that some values were inferred.
Current microarrays contain more than 10,000 genes, spots, or probes. Dealing with large amount of data may require expensive computational resources and large processing times. A common practice is to remove genes that have not shown significant changes across samples, genes with several missing data, or those whose average expression is very low (because low expressed genes are more susceptible to noise). The most common approaches use statistical tests (lower), signal to noise estimations (higher), variability (higher), and average (higher).
The numerical values from image analysis are commonly integer numbers between 1 and 32,000 for both signal and background. The background is normally subtracted from the signal. The distribution of these values is however concentrated in a narrow range and is therefore transformed using logarithms (base 2 generally) which generates normal-like distributions. Negative values resulting from subtraction may raise problems in transformations which are resolved by restricting the values or performing more robust transformations such as the generalized logarithm.
The procedure after image analysis and data processing depends mainly on the particular biological issue and data available. These procedures have been described in the section 3 (Applications).
5. Illustrating the detection of differentially expressed genes: the case of term placenta
In previous sections we have introduced the experimental and data analysis methods used in common microarray experiments. In order to illustrate these procedures we will use a case study designed to identify genes that are preferentially expressed in placenta. This study, currently ongoing in our laboratory, is part of a larger project whose results are expected to assist further research revealing molecular mechanisms involved in fetus development, placental function, and pathologies related to pregnancy. In order to identify genes specific for human placenta, we used a two-color microarrays. In this experiment, mRNA extracted from two normal human placentas was compared with a pool of mRNA extracted from several normal tissues not including placenta. In order to gain information on the variability expected from experimental errors, we also compared two aliquots of the reference mRNA in the same array. An overview of the process is depicted in figure 11. A brief description of the detailed procedure follows.
Experimental design of the placenta microarray experiment.
Step 1: mRNA extraction and Microarray Hybridization
Human total term placenta RNA (isolated using proteinase K-phenol based protocol as described in (56)) and a pool of total RNAs from several human tissues not including placenta (commercially available) were part of the set of reagents utilized in the EMBO-INER Advanced Practical Course 2005 (http://www.embo.org/courses_workshops/mexico.html and http://chipskipper.emblde/iner-embo-course/index.htm). They were quality controlled by running them in a RNA 6000 Nano Assay from Agilent. First strand cDNA was synthesized from each RNA (5 μg) sample by reverse transcription using an oligo-dT primer with a TT-promoter sequence attached to its 5′ end, while second strand resulted from treating the first strands with RNase H plus DNA polymerase I (Message Amp aRNA kit from Ambion). Column purified double-stranded cDNAs were transcribed (in vitro transcription) with T7 RNA polymerase and the amplified RNAs (aRNAs) were purified also by column binding and subsequent elution. Fluorescent labels were indirectly attached to the hybridization probes by a two-step procedure. The first step consisted of a reverse transcription of the aRNA using this time a mixture of all four desoxiribonucleotides and including aminoallyl-dUTP. In the second step, N-hydroxysuccinimide-activated fluorescent dyes (Cy3 and Cy5) were coupled to the cDNAs by reaction with the amino functional groups. Probes were preincubated with blocking reagents (human Cot DNA at 1 μg/ml and poly-dA DNA also at μg/ml) and then hybridized to prehybridized (6X SSC, 0.5% SDS and 1% BSA) slides in hybridization buffer (50% formamide, 6X SCC, 0.5% SDS and 5X Denhardt’s solution). Slides were washed (once in 2X SSC/0.1% SDS at 65°C for 5 minutes, twice in 0.1X SSC/0.1% SDS but first at 65°C for 10 minutes and then at room temperature for 2 minutes, and finally in isopropanol also at room temperature, with slide centrifugation between each washing step) and stored in the dark until scanning. Fluorescent probes were hybridized to cDNA microarrays (laboratory made oligo-based microarray containing half of the probes in each of two slides).
Step 2: Microarray Scanning, Spot Finding and Image Processing
Microarrays were scanned using (ScanArray Express, PerkinElmer). Images obtained were analyzed using ChipSkipper ® (http://www.embl-em.de) to obtain a single value for each spot representing the ratio (in log2 scale) of the mRNA expression level from placenta to the reference mRNA from the pool of non-placenta tissues. A value of 0 represents similar expression level in both mRNA samples. A value of 1 represents twofold over-expression in placenta whereas a value of −1 represents two-fold down-regulation in placenta. One placental sample was hybridized in duplicate into the two microarrays using a dye-swap design. In this approach the labeling scheme is reversed in two separate microarrays. To gain information on the variability associated to experimental error, two aliquots of the reference pool mRNA were compared on the same microarray. Likewise the comparison between experimental and control samples, the comparison between the two control samples was performed in duplicate using the dye-swap design. To summarize, the experiment was performed using six microarrays (two placentas samples compared with a reference in duplicate and two reference mRNA as controls, see figure 11).
Step 3: Quality Assessment, Processing and Normalization
To ensure that all microarrays were comparable in scale, we performed print-tip loess normalization (shifting the imaginary M line to zero, see figure 12). We processed the dataset removing from the analysis all control and empty spots. Representative plots before and after within normalization and processing for both placenta and control experiments are shown in figure 12. Note that, as expected, there are important differences in ratio values (M value in figure 12c–d) for highly expressed genes (A value) in placenta compared with the reference (figure 12c), whereas ratios in the control experiment are very close to zero (figure 12d) indicating a very high reproducibility of the technology.
Step 4: Detection of Differential Expressed Genes
Duplicated spots were averaged to generate a unique measure per gene per array. To detect differential expressed genes we used a one-sample t-test under the null hypothesis of no differential expression (mean ratio equal zero). Resulted p-values were adjusted for multiplicity tests using the False Discovery Rate (FDR) approach (18, 57). Because of the small number of samples, we treated the replicated biological samples as independent* to increase the level of confidence in the statistical tests, and we limited the selection of differentially expressed genes to those that fulfill two conditions: firstly, genes whose FDR value is less than 0.10 (10% corresponding to raw p-values less than 0.0000118), and secondly, genes whose absolute fold expression is at least two. Using these criteria, 350 (out of 21,456) were selected. A subset of 205 genes is depicted in figure 13 (see step 5).
Genes differentially expressed in placenta compared to other tissues
Step 5: Validation
In order to verify the process of selection, we made two comparisons. First, as negative control, we followed the same selection criteria for the control microarrays that made use of the reference sample in both channels. The result was that no genes match the criteria. Second, we performed a comparison using T1dbase (http://www.t1dbase.org/page/TissueHome) from Tissue Specific Expression Tool (58). This tool makes use of Gene Expression Atlas (59), SAGEmap (60), and TissueInfo (58), integrating all measurements in a single score (58). This score, estimated for several tissues, represents whether the expression for a gene is tissue-specific. Scores closer to 1 are meant to be tissue-specific whereas scores closer to 0 represents no-tissue-specificity. From the 350 genes resulted in step 4, we selected only those that are also included in this database. The result was 201 genes. Several genes that seem to be over-expressed in the placentas processed here (darker colors in figure 13a) shows consistently higher placenta-specific scores in T1dbase (darker colors in figure 13b). These results suggest that the experiment is coherent and valid.
Step 6: Analysis
Once genes have been selected, further computational, literature and laboratory analyses are needed to confirm, expand, or restrain the results. Here, the analysis only dealt with comparing the results with T1dbase-Tissue Specific Expression Tool. However, queries to Gene Ontology, KEEG pathways, Pubmed, Blasts, or any other pertinent database resource should be considered a compulsory step.
6. Conclusions and Trends
DNA microarrays are a powerful, mature versatile and easy-to-use genomic tool that can be applied for biomedical and clinical research. The research community is expanding the use of this approach for novel applications. The main advantage is the genomic-wide information provided at reasonable costs. Biological interpretation however requires the integration of several sources of information. In this context, a new discipline referred as Systems Biology is emerging that integrates biological knowledge, clinical information, mathematical models, computer simulations, biological databases, imaging, and high-throughput “omic” technologies, such as microarray experiments. Therefore, multidisciplinary groups involving clinicians, biologists, statisticians, and recently bioinformaticians are being formed and expanded in all important research institutions. Subsequently, virtually all biology-related research areas are moving from merely describing cellular and molecular components in a qualitative manner, towards a more quantitative approach. These new teams are generating huge amounts of data and more convincing models to ultimately reveal hidden pieces in the biological puzzle. This new knowledge is having a crucial impact on the treatment of diseases, since among other things individualizes subtypes of pathologies, disease risks and survival, treatment prognosis and outcome, quickly approaching biomedical research to the era of personalized medicine.
HABS thanks the Staff of the Microarray Technology EMBO-INER Advanced Practical Course for enjoyable course lessons, materials and results; Peter Davies, Nancy and Greg Shipley of UT Medical School for additional laboratory training; Albert Sasson for critical reading of the manuscript and the offices of the Dean of his school and of the President of his University for support. VT thanks Darwin Trust of Edinburgh and CONACyT for his PhD scholarship, and ITESM for support,.
Microarray-Related Public Resources
To date, hundreds of microarray studies have been published in the literature. Researchers tend to make microarray data available through internet, generally in proprietary web-sites or public microarray database repositories (Table 1). These public repositories generally follow the minimum information about a microarray experiment (MIAME) compliance format (61) or microarray gene expression markup language (MAGE-ML) (62). As these repositories may contain unpublished data, it is important to consult these public repositories before embarking on a new microarray project. There are dozens of software tools for analyzing microarray data and still there is a tendency for publishing new software every day. A list of general common software is provided in Table 2. These tools are created by multidisciplinary groups across the world to solve particular problems. There is however several software “flavours”, that is, several tools to solve the same problem in a slightly different way (statistical model, data format, or user-interface). This generates the pitfall that software is isolated or not easy to connect with others in a pipeline. The result is that users commonly have a large collection of tools with many manual or complex steps to transfer information between them.
Common public microarray repositories
Common free software containing collections of tools
*For preliminary purposes only. The effect of this exercise is a slight underestimation of the variance in favor of more sensible results.
1. Golub TR, Slonim DK, Tamayo P, et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–537.[PubMed]
2. van’t Veer LJ, Dai HY, van de Vijver MJ, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–536.[PubMed]
3. Singh D, Febbo PG, Ross K, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002;1:203–209.[PubMed]
4. Wang T, Hopkins D, Schmidt C, et al. Identification of genes differentially over-expressed in lung squamous cell carcinoma using combination of cDNA subtraction and microarray analysis. Oncogene. 2000;19:1519–1528.[PubMed]
5. Alon U, Barkai N, Notterman DA, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America. 1999;96:6745–6750.[PMC free article][PubMed]
6. Ramaswamy S, Tamayo P, Rifkin R, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences of the United States of America. 2001;98:15149–15154.[PMC free article][PubMed]
7. Brachat A, Pierrat B, Brungger A, Heim J. Comparative microarray analysis of gene expression during apoptosis-induction by growth factor deprivation or protein kinase C inhibition. Oncogene. 2000;19:5073–5082.[PubMed]
8. Bonner AE, Lemon WJ, You M. Gene expression signatures identify novel regulatory pathways during murine lung development: implications for lung tumorigenesis. Journal of Medical Genetics. 2003;40:408–417.[PMC free article][PubMed]
9. Brachat A, Pierrat B, Xynos A, et al. A microarray-based, integrated approach to identify novel regulators of cancer drug response and apoptosis. Oncogene. 2002;21:8361–8371.[PubMed]