FHS 100K Genome Scan Genotyping
Genotyping was performed by Dr. Norman Gerry in the Genetics and Genomics Department at Boston University using GeneChip Human Mapping 100K microarrays from Affymetrix. All of the genotypes generated by our study have been returned to the National Heart, Lung and Blood Institute (NHLBI) and are freely available for any IRB-approved project. More information about requesting the 100K Scan genotypes and other Framingham Heart Study limited access data can be found at http://www.nhlbi.nih.gov/about/framingham/policies/index.htm
Introduction to the Affymetrix 100K Platform
The Affymetrix GeneChip Human Mapping 100K microarrays were used in the Framingham Heart Study Offspring Cohort 100K Genome Scan project. Each 11 micron wide position on the Genechips contains DNA sequence that matches and hybridizes to either the A or B allele of a given SNP. Thus, for each SNP, an individual's genomic DNA will show hybridization to only the A position (AA homozygote), only the B position (BB homozygote) or to both (AB heterozygote).
The genotype of each SNP is determined by the hybridization pattern to forty different DNA sequences which allows for a high degree of accuracy. The arrays contain over four million DNA sequences in total. The 100K Genechips have a median inter-SNP distance of 8 kb and a mean inter-SNP distance of 22.5 kb.
The assays are performed as outlined in the figure below: genomic DNA is digested, amplified by PCR, labeled with a fluorescent tag and hybridized to the Genechips. The hybridization signals are then read using a laser scanner.


Affymetrix Genotyping Assay
For every sample genotyped, the strengths of the fluorescent signal emanating from each DNA probe for a particular SNP are combined to give a genotype "call" for that SNP. Since the SNPs are all biallelic, an individual genotyped with the array can be either AA (A allele from both parents), AB or BB.

Genotyping Assay Readout
Genotype Cleaning
In studies run within our department, SNPs that are not polymorphic along with X and Y chromosome SNPs were removed. PedCheck [O'Connell & Weeks, 1998] and Merlin [Abecasis et al., 2002] were used to identify genotyping errors as either Mendelian errors or unlikely recombinants, respectively. SNPs with more than two such errors were removed.
GTYPE Genotyping Performance
A critical determinant of the utility of the genotypes generated by our genotyping is the completeness of the genotypes. We have analyzed the fraction of callable genotypes both on a per SNP basis and a per sample basis. We first analyzed the genotypes as they were called using the GTYPE algorithm. As the SNPs are distributed across two arrays -- one of which uses the restriction enzyme Hind III to generate the genomic representation and the other of which uses the enzyme Xba I -- we have considered the two arrays separately. 94.5% of the SNPs and 96.2% of the samples have genotype call rates > 90%. We also observe (as others have found) that the performance of the Hind III array is lower relative to the Xba I array.
Call rate performance by SNP. The mean call rate per SNP we achieved 97.1% with 5.5% of SNPs having a call rate < 90%

% Genotypes called per SNP
Call rate performance by sample. There is some variability in the genotype call rate achieved per sample. As seen in the distributions below, the call rate on the Xba I arrays is higher than on the Hind III arrays. Affymetrix has set a call rate of 90% as the routinely achievable combined call rate with these arrays and our mean call rate on the Hind III and Xba I arrays using the GTYPE algorithm is 95.4% and 97.3% respectively. For the FHS samples, approximately 6.2% of the Hind III arrays and 2.9% of the Xba I arrays failed to achieve a 90% call rate. However the combined call rate for both arrays has a mean of 96.4% with 3.8% of the samples having a combined call rate of < 90%. The relatively low call rate for the Hind III arrays has been seen by many groups working with the Affymetrix 100K Array set. The fraction of FHS samples with an overall call rate < 90% is also similar to that seen by other groups.

% Genotypes called per sample, Hind III arrays

% Genotypes called per sample, Xba I arrays

% Genotypes called per sample, both arrays combined
GTYPE Genotyping Accuracy
There are three different situations under which we have been able to multiply determine the SNP genotype of a DNA sample. We have analyzed these genotypes for discordant calls made by GTYPE and have used this to estimate the frequency at which genotypes are miscalled and the types of errors that these discordant calls represent. The three situations under which we have repeated genotypes for a particular DNA sample are: 1) a single reference DNA sample that we have genotyped repeatedly throughout the course of sample processing; 2) there are several SNPs that are genotyped twice per sample on the genotyping arrays; and 3) there are several FHS samples for which we have repeated the genotyping process. We can also detect some errors as Mendelian inconsistencies and unlikely recombination events.
Repeated genotyping of reference DNA. We have repeatedly genotyped a common reference DNA alongside the genotyping of the FHS samples. The primary purpose of genotyping this reference DNA is to determine if failures are due to the quality of the new DNA samples being processed or with the sample processing. To date, we have genotyped this reference DNA twelve times. As such, it also provides a rich dataset for analyzing discordant calls. The overall error rate for the genotyping of this DNA is 0.25%
| Hind III | Xba I | TOTAL | |
| Genotype Calls | 643,441 | 667,670 | 1,321,111 |
| Errors | 2,415 | 890 | 3,505 |
| Error Rate | 0.38% | 0.13% | 0.25% |
Further analysis of this data shows that the most common GTYPE error is heterozygote genotypes being miscalled as homozygotes. To determine this, we've assumed that the most frequent genotype across all twelve hybridizations of the reference DNA is the true genotype.
| Error Type | Hind III | Xba I | TOTAL |
| Heterozygote called as homozygote | 81.2% | 80.9% | 81.1% |
| Homozygote called as heterozygote | 17.2% | 18.9% | 17.9% |
| Homozygote called as wrong homozygote | 1.6% | 0.2% | 0.1% |
Comparison of replicate SNPs. There are 31 SNPs in common between the Hind III and Xba I array which results in these SNPs being genotyped twice for each sample. We have looked for discordant calls across the FHS samples for these 31 SNPs to determine if the error rate for the FHS samples is grossly dissimilar from the error rate observed with the reference DNA. For this analysis it is necessary that a genotype is determined for the SNP on both the Hind III and Xba I arrays for each sample. We observe a replicate SNP error rate (0.23%) that is quite similar to the reference DNA error rate (0.25%). This indicates that any difference in DNA quality between the reference DNA and the FHS sample DNAs does not greatly affect the accuracy of the genotyping.
| Genotypes | 55,118 |
| Informative Genotypes | 52,152 |
| Errors | 118 |
| Error Rate | 0.23% |
Repeat genotyping of FHS samples with low call rates. We have encountered a small number of samples where either the Hind III or Xba I array gave a low call rate and we have repeated the genotyping for that array. Determining the error rate for those SNPs that were genotyped in both rounds of genotyping allows us to determine if the error rate increases as a function of decreased call rate. We repeated the Hind III array genotyping for three samples and the Xba I array genotyping for two samples. For these genotypes we observe an overall error rate of 0.13%. This indicates that low call rates per sample do not result in an elevated error rate.
| Hind III | Xba I | TOTAL | |
| Informative Genotypes | 253,636 | 224,244 | 477,880 |
| Errors | 556 | 67 | 623 |
| Error Rate | 0.22% | 0.03% | 0.13% |
Detection of genotyping errors by genetic consistency. We have done a family-based analysis of error rates using 328 pedigrees that have multiple family members from a set of 889 genotyped samples. This represents greater than 75.4 million genotypes. Within this analysis we find 9,923 Mendelian inconsistencies (where the observed offspring genotype is impossible given the observed parental genotypes) and 21,160 unlikely recombinants (where the observed offspring genotype requires a double-crossover event between nearby SNPs). These correspond to observed error rates of < 0.015% and < 0.03% respectively. These genetic error rates are much lower than the concordance-derived error rates due to the more limited number of conditions in which these errors can be detected.
To address whether the genetic consistency errors arise from the same sort of genotyping errors as detected through discordancy, we determined that there is a general positive relationship between the two error rates on a per SNP basis. The following graph shows the distribution of unlikely recombinant errors detected during the genotyping of the FHS samples (y-axis) as a function of the concordancy errors detected during the repeated genotyping of the Affymetrix reference DNA (x-axis). The high independently determined error rate of some SNPs indicates that not all SNPs are equally likely to be genotyped correctly and that these can be detected both as a result of concordancy errors and genetic consistency errors.

Genetic consistency errors as a function of concordancy errors
We are in the process of performing similar analyses of genotypes called by the BRLMM algorithm and hope to post this analysis as well as a comparison of the genotype calls from GTYPE and BRLMM in the near future.
references
Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin--rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 2002, 30:97-101.
O'Connell JR, Weeks DE: PedCheck: a program for identification of genotype incompatibilities in linkage analysis. Am J Hum Genet 1998, 63:259-266.