Tum Chalo (LeadIndia - New Anthem)
Thursday, October 23, 2008
Wednesday, October 22, 2008
Genetic markers: How accurate can genetic data be?
Molecular markers come in different flavours—blood groups, allozymes, RFLPs, AFLPs, RAPDs, STRs, SNPs, you name them. Whether the focus is on specific populations or on worldwide patterns (Cavalli-Sforza et al., 1994), genetic data have become prominent in recent decades and have fundamentally changed our views on human evolution and prehistory. But what if some of these markers were biased? What if genetic markers, far from being more objective than other types of data, were producing a distorted view of human diversity, and, as a consequence, of human origins? And if that were the case, would it be possible to identify the best and least biased data sets around? These important questions are at the heart of an article by Romero et al. (2008) recently published in Heredity.
In technical terms, the issue addressed by Romero et al. is called ascertainment bias and it has been around for some time (Garrod, 1902). It refers to a statistical bias introduced during the collection (or ascertainment) of data, and started to catch the eye of human population geneticists some 15 years ago (Bowcock et al., 1994). In population genetic studies the main cause of ascertainment bias is an economic one. Genetic markers are usually selected on the basis that they should be polymorphic (that is, variable) in a reference sample. Understandably, their costly development is rarely carried out on large samples and, once identified, it would be hard to imagine colleagues who would be happy to spend their research budget genotyping whole populations at markers for which most individuals will be identical.
The first and most obvious consequence of this selection process is that, by eliminating the least variable markers, genetic diversity is overestimated. In itself this is not necessarily a major problem, if one keeps track of the markers that were eliminated. A second consequence is that genetic diversity is usually inflated in the reference population/s as shown by Bowcock et al. (1994) in humans. This effect was particularly strong in Europe compared to other regions with nuclear RFLPs (restriction fragment length polymorphism), allozymes and blood groups, but weak or absent in microsatellites. They wrote that 'a reasonable explanation [...] is the bias introduced by their initial selection in Europeans.' They added that this 'bias is likely to be less serious for markers with large numbers of alleles such as microsatellites'. Interestingly, this second ascertainment problem is very general. Using cattle and sheep, Ellegren et al. (1997) elegantly showed that microsatellite markers developed in one species produced shorter repeats and lower diversity estimates in the other species. Importantly, this explained why humans appeared to have longer microsatellites than other apes without invoking directional selection in humans.
A third and more subtle consequence of ascertainment bias arises even when the reference sample comprises individuals from the whole species range, as is the case in the protocols used for single-nucleotide polymorphism (SNP) discovery in humans. The critical issue is that the number of individuals in the so-called 'discovery panel' is usually very small. Thus, rare alleles tend to be missed and selected SNPs typically have alleles with similarly high or medium frequencies (SNPs are typically biallelic). This is problematic because many demographic events leave specific signatures in the allele frequency distribution. For instance, population bottlenecks tend to eliminate rare alleles, whereas expanding populations exhibit more loci with rare alleles. Similarly, directional or balancing selection also either favour one allele or maintain the allele frequencies at some equilibrium value, respectively. In other words, this type of ascertainment bias can mimic balancing selection or demographic bottlenecks. It can thus either generate false signatures or mask existing ones.
What makes the study of Romero et al. important is that they not only try to identify biases in genomic data sets but they also suggest a way to identify 'unbiased' data sets. As an example, Romero et al. cite a study by Ray et al. (2005) who tried to infer the region of origin of modern humans using a large single-tandem repeats (STRs) data set and massive spatial simulations. Ray et al. (2005) found that the most likely region of origin was North Africa, a region for which there was no known support from archaeological or anthropological data. Their guess was that a bias similar to that identified by Bowcock et al. (1994) was somehow shifting the centre of origins towards Europe or regions genetically close to Europe. After correcting for this bias, East Africa became the most likely region. Although Ray et al. (2005)'s final result is very sensible, Romero et al. were not fully convinced that the STR markers used were biased in any particular way.
Romero et al.'s results can be divided into three main points. First by comparing three existing genomic data sets, namely 783 STRs, 2834 SNPs and 210 insertion deletion polymorphisms (indels), they showed that there are significant differences between them, and hence not all may properly reflect human neutral diversity. Then, by generating a new set of 16 STR markers in the least biased way possible, they used these new STRs as a benchmark against which the three genomic data sets could be compared. Finally, their comparisons showed that the genomic data set least biased was the STR data that Ray et al. had used.
Does that mean, as the authors claim, that the 783 STR markers 'suffer no discernable bias'? We need here to go back to the selection process followed to generate the 16 STRs. Romero et al. actually started by identifying 70 independent STRs. The difficulty to obtain reliably amplifying loci led to the elimination of 46 loci. Among the 24 remaining loci, eight (one-third) proved to be nearly monomorphic, and were discarded from the rest of the analyses. It is thus fair to ask whether discarding these loci would not affect parameter inference beyond the obvious overestimation of genetic diversity in human populations. In fact, there are good reasons to think that this would create a bias when populations have either gone through a bottleneck or a population expansion, because the very proportion of monomorphic loci is providing us with information on such events as I noted above. This had already been noticed by Beaumont (1999) in a bottlenecked population and has since been confirmed on other real data sets. As a quick test I also performed some simulations (not shown), in which I had a set of 24 loci from which I then selected two sets of 16 loci: one by discarding the eight least variable loci, and the other by discarding eight loci randomly. I found that in an admixture model the admixture proportions did not seem to be biased, whereas in the population size change models the selection of the 16 most variable loci seemed to produce biases for some parameters, but not all. Altogether the previous studies and these (admittedly very limited) simulations thus suggest that even the STRs identified by Romero et al. are likely to produce some biases.
To conclude, Romero et al. have clearly demonstrated that significant problems exist with both indels and SNPs, and they have also shown that the STRs are probably the best loci available today (but see Nielsen et al. (2004) for possible corrections for SNPs). One should probably take with a pinch of salt their claim that their STRs were unbiased or that the biases identified by Ray et al. (2005) were not real. But clearly, Romero et al.'s study is a significant step towards proper population genetics inference.
In technical terms, the issue addressed by Romero et al. is called ascertainment bias and it has been around for some time (Garrod, 1902). It refers to a statistical bias introduced during the collection (or ascertainment) of data, and started to catch the eye of human population geneticists some 15 years ago (Bowcock et al., 1994). In population genetic studies the main cause of ascertainment bias is an economic one. Genetic markers are usually selected on the basis that they should be polymorphic (that is, variable) in a reference sample. Understandably, their costly development is rarely carried out on large samples and, once identified, it would be hard to imagine colleagues who would be happy to spend their research budget genotyping whole populations at markers for which most individuals will be identical.
The first and most obvious consequence of this selection process is that, by eliminating the least variable markers, genetic diversity is overestimated. In itself this is not necessarily a major problem, if one keeps track of the markers that were eliminated. A second consequence is that genetic diversity is usually inflated in the reference population/s as shown by Bowcock et al. (1994) in humans. This effect was particularly strong in Europe compared to other regions with nuclear RFLPs (restriction fragment length polymorphism), allozymes and blood groups, but weak or absent in microsatellites. They wrote that 'a reasonable explanation [...] is the bias introduced by their initial selection in Europeans.' They added that this 'bias is likely to be less serious for markers with large numbers of alleles such as microsatellites'. Interestingly, this second ascertainment problem is very general. Using cattle and sheep, Ellegren et al. (1997) elegantly showed that microsatellite markers developed in one species produced shorter repeats and lower diversity estimates in the other species. Importantly, this explained why humans appeared to have longer microsatellites than other apes without invoking directional selection in humans.
A third and more subtle consequence of ascertainment bias arises even when the reference sample comprises individuals from the whole species range, as is the case in the protocols used for single-nucleotide polymorphism (SNP) discovery in humans. The critical issue is that the number of individuals in the so-called 'discovery panel' is usually very small. Thus, rare alleles tend to be missed and selected SNPs typically have alleles with similarly high or medium frequencies (SNPs are typically biallelic). This is problematic because many demographic events leave specific signatures in the allele frequency distribution. For instance, population bottlenecks tend to eliminate rare alleles, whereas expanding populations exhibit more loci with rare alleles. Similarly, directional or balancing selection also either favour one allele or maintain the allele frequencies at some equilibrium value, respectively. In other words, this type of ascertainment bias can mimic balancing selection or demographic bottlenecks. It can thus either generate false signatures or mask existing ones.
What makes the study of Romero et al. important is that they not only try to identify biases in genomic data sets but they also suggest a way to identify 'unbiased' data sets. As an example, Romero et al. cite a study by Ray et al. (2005) who tried to infer the region of origin of modern humans using a large single-tandem repeats (STRs) data set and massive spatial simulations. Ray et al. (2005) found that the most likely region of origin was North Africa, a region for which there was no known support from archaeological or anthropological data. Their guess was that a bias similar to that identified by Bowcock et al. (1994) was somehow shifting the centre of origins towards Europe or regions genetically close to Europe. After correcting for this bias, East Africa became the most likely region. Although Ray et al. (2005)'s final result is very sensible, Romero et al. were not fully convinced that the STR markers used were biased in any particular way.
Romero et al.'s results can be divided into three main points. First by comparing three existing genomic data sets, namely 783 STRs, 2834 SNPs and 210 insertion deletion polymorphisms (indels), they showed that there are significant differences between them, and hence not all may properly reflect human neutral diversity. Then, by generating a new set of 16 STR markers in the least biased way possible, they used these new STRs as a benchmark against which the three genomic data sets could be compared. Finally, their comparisons showed that the genomic data set least biased was the STR data that Ray et al. had used.
Does that mean, as the authors claim, that the 783 STR markers 'suffer no discernable bias'? We need here to go back to the selection process followed to generate the 16 STRs. Romero et al. actually started by identifying 70 independent STRs. The difficulty to obtain reliably amplifying loci led to the elimination of 46 loci. Among the 24 remaining loci, eight (one-third) proved to be nearly monomorphic, and were discarded from the rest of the analyses. It is thus fair to ask whether discarding these loci would not affect parameter inference beyond the obvious overestimation of genetic diversity in human populations. In fact, there are good reasons to think that this would create a bias when populations have either gone through a bottleneck or a population expansion, because the very proportion of monomorphic loci is providing us with information on such events as I noted above. This had already been noticed by Beaumont (1999) in a bottlenecked population and has since been confirmed on other real data sets. As a quick test I also performed some simulations (not shown), in which I had a set of 24 loci from which I then selected two sets of 16 loci: one by discarding the eight least variable loci, and the other by discarding eight loci randomly. I found that in an admixture model the admixture proportions did not seem to be biased, whereas in the population size change models the selection of the 16 most variable loci seemed to produce biases for some parameters, but not all. Altogether the previous studies and these (admittedly very limited) simulations thus suggest that even the STRs identified by Romero et al. are likely to produce some biases.
To conclude, Romero et al. have clearly demonstrated that significant problems exist with both indels and SNPs, and they have also shown that the STRs are probably the best loci available today (but see Nielsen et al. (2004) for possible corrections for SNPs). One should probably take with a pinch of salt their claim that their STRs were unbiased or that the biases identified by Ray et al. (2005) were not real. But clearly, Romero et al.'s study is a significant step towards proper population genetics inference.
Friday, October 3, 2008
Tissue sample suggests HIV has been infecting humans for a century
48-year-old lymph node biopsy reveals the history of the deadly virus.
A biopsy taken from an African woman nearly 50 years ago contains traces of the HIV genome, researchers have found. Analysis of sequences from the newly discovered sample suggests that the virus has been plaguing humans for almost a century.
Although AIDS was not recognized until the 1980s, HIV was infecting humans well before then. Researchers hope that by studying the origin and evolution of HIV, they can learn more about how the virus made the leap from chimpanzees to humans, and work out how best to design a vaccine to fight it.
In 1998, researchers reported the isolation of HIV-1 sequences from a blood sample taken in 1959 from a Bantu male living in Léopoldville1 — now Kinshasa, the capital of the Democratic Republic of the Congo. Analysis of that sample and others suggested that HIV-1 originates from sometime between 1915 and 19412.
Now, researchers report in Nature that they have uncovered another historic sample, collected in 1960 from a woman who also lived in Léopoldville3
“It's as if you had a nice pearl necklace of DNA and RNA and protein and you clumped it together, drenched it in glue and then dried it out.”
It took evolutionary biologist Michael Worobey of the University of Arizona in Tucson and his colleagues eight years of searching for suitable tissue collections originating in Africa before they tracked down the 1960 lymph node biopsy at the University of Kinshasa.
Drenched in glue
The samples had all been treated with harsh chemicals, embedded in paraffin wax and left at room temperature for decades. The acidic chemicals had broken the genome up into small fragments. Formalin, a chemical used to prepare samples for microscopy, had crosslinked nucleic acids with protein. "It's as if you had a nice pearl necklace of DNA and RNA and protein and you clumped it together, drenched it in glue and then dried it out," says Worobey.
The samples had all been treated with harsh chemicals, embedded in paraffin wax and left at room temperature for decades. The acidic chemicals had broken the genome up into small fragments. Formalin, a chemical used to prepare samples for microscopy, had crosslinked nucleic acids with protein. "It's as if you had a nice pearl necklace of DNA and RNA and protein and you clumped it together, drenched it in glue and then dried it out," says Worobey.
The team worked out a combination of methods that would allow them to sequence DNA and RNA from the samples; another lab at Northwestern University in Chicago, Illinois, confirmed the results, also finding traces of the HIV-1 genome in the lymph node biopsy.
This photo shows Kinshasa around 1885, shortly after its founding. The growth of Kinshasa and other cities in the region may have been crucial to the emergence of HIV/AIDS.Royal Museum for Central Africa
Using a database of HIV-1 sequences and an estimate of the rate at which these sequences change over time, the researchers modelled when HIV-1 first surfaced. Their results showed that the most likely date for HIV's emergence was about 1908, when Léopoldville was emerging as a centre for trade.
Although that date will not surprise most HIV researchers, the new data should help persuade those who were unconvinced by the 1959 sample, says Beatrice Hahn, an HIV researcher at the University of Alabama at Birmingham.
The sequences of the 1959 and 1960 samples - the earliest that have ever been found - show a difference of about 12%. "This shows very clearly that there was tremendous variation even then," says Simon Wain-Hobson, a virologist at the Pasteur Institute in Paris.
A virus ready for its close-up
However, it may never be possible to pinpoint exactly how HIV crossed from chimpanzees into humans, Hahn cautions. She and her collaborators previously tracked the likely source of HIV-1 to chimpanzees living in southeast Cameroon4, hundreds of kilometres from Kinshasa, and it is tempting to hypothesize that trade routes contributed to the virus's infiltration of the city. But even by 1960, HIV-1 had infected only a few thousand Africans. It is unlikely that it will be possible to track down samples from the very earliest victims, Hahn notes.
Meanwhile, Worobey plans to continue his search through old tissue collections in the hope of finding additional samples. In time, he says, it may even be possible to reconstruct the historic HIV viruses for further study.
Collecting information about old strains of HIV — even those that disappeared over time — can help researchers learn how successful strains broke through, says Wain-Hobson. "For every star in Hollywood there are fifty starlets," he says. "We would love to know what it was that caused this strain to move out of starlet phase and to the big time."
Subscribe to:
Posts (Atom)