For example, the SNP 6.0 array from Affymetrix has a resolution of 100-200kb. What are the benefits of using high resolution cytogenetic microarrays? Can you give any examples from Affymetrix platforms ?
For oligo and cDNA microarrays, the average resolution would be the size of the genome divided by the number of non-redundant cDNA clones or oligos or your array platform. It can vary depending of the gene density of a particular chromosome. For BAC arrays, if clones cover both genic and intergenic regions, resolution would be independent of gene density. If the BAC collection covers the whole genome, resolution would be maximal or absolute, and overlap in some loci.
I don't know where your numbers come from (100-200kb resolution). The average median SNP + CNV inter-marker distance is 680 base pairs on the SNP 6.0 array. We have identified aberrations down to 5kb using the SNP 6.0 array, but of course this needs very good profiles and the use of specific algorithms, as you don't see such small aberrations using the standard Affymetrix GTC software.
I am not an expert on microarray but what I have gathered from the discussions is that since micoarray came on the scenario around a decade back it has made significant progress and improved as a technique, thus the aberration number of 5kb interval as mentioned by Holzmann seems plausible and realistic, while 200-100 kb mentrioned by Aziz appears to be on higher side, which I guess is from some old microarray work published on cytogenetics. Since the genetic database, probe designing methods and probing techniques have also improved data reliability, reproducibility and sensitivity, the aberrations reported for genomic/kb intervals must have certainly improved and it is anybody's guess what it might be in the future decade. I am sure any microarray performed 10 yrs back might have significant difference in their numbers and also in terms of data interprtetation, if performed today using current technology. Kindly correct me if I have any misunderstanding regarding this.
I think you are getting caught up on this word 'resolution'. The resolution of any array platform with be a direct reflection of several experimental conditions, including DNA quality and experimental noise. Furthermore, the resolution of your platform will depend on the size and degree of copy number change present in you sample. In the context of cancer, the degree of clonal heterogeneity and the level of normal cell contamination will also be important. The larger the copy number change (defined by the inclusion of more array features) the easier it will be to detect. A small change (defined by the inclusion of only a small number of array features) will be more difficult to detect. The degree of copy number change is also important. Amplifications and bi-allelic deletions are easier to detect because the change in copy number is more significant compared to the reference genome. A single copy number duplication is only a 50% change in copy number and hence will be more difficult to detect.
Resolution is the smallest change that can be confidently detected, and it therefore not a direct reflection of the number of array features. After all, with oligonucleotide arrays, more than one feature is used to define an aberration. The number of array features used to define an aberration will vary based on data quality and the software algorithm. Some use a number a features, some mean over a genomic region. The type of bioinformatic approach will also have a huge effect on resolution.
So my take home message is that it is all variable, and resolution will differ within and between samples. It is critical that you confirm uncertain results with other approaches.
100-200Kb seems reasonable. We have run about 400 SNP6 arrays, and with good quality data we can identify 40Kb deletions that we have confirmed with other molecular approaches.
The resolution of a microarray, rather than by the number of probes, is given by their spacing, ie the distance between the genomic position of a probe and the position of the next one. It is also very important to distinguish between the real and the theoretical resolution of an array.
If the average spacing between the probes is, for example, about 20 kb (theoretical resolution), and we choose to report only the microdeletions shown by at least three consecutive probes, the effective resolution of our array will be about 60 kb. There may also be a difference between the resolution for microdeletions and microduplications. In fact, if we choose to report only the microduplications shown by at least 4 consecutive probes, in this case the resolution will be 80 kb.
Resolution indicates the number of probes on a microarray platform. The number of probes determines the resolution with which you can view the genome. That is, as the number of probes increases the coverage for the genome increases and the space between the probes decreases. When the space between the probes decreases, it will enable you to detect micro deletions and amplifications (focal aberrations). For an example Agilent arrays are available in many platforms which differ in resolution. The 60K resolution arrays have 60,000 probes with a space of 40kb between the probes. While for an example a 40K array will have 40,000 probes and the space of 70kb between the probes. The concept is quite simple, hope it is of some help.
For example, the SNP 6.0 array from Affymetrix has a resolution of 100-200kb. What are the benefits of using high resolution cytogenetic microarrays? Can you give any examples from Affymetrix platforms ?
Coming to your question, your SNP6.0 arrays can detect focal aberrations (very very tiny aberrations) that is micro aberrations as small as 100 kilo base pairs. The benefit from it is it will enable you look for for copy number changes involved in the focal aberrations (which might harbor putative oncogenes or tumor suppressor genes or genes of your interest).
Dear Sujatha Rao, unfortunately, the things are not as simple as you say. The resolution of an array expresses the ability to detect aberrations of a given size, for example “resolution of 75 Kb” means that that system is capable of detecting aberrations from the size of 75 Kb. The number of probes present on an array is only one of the elements that contributes to determining the resolution, but it is not the only one. More important than the absolute number is the spacing between the probes, in fact in many platforms, the probes are not spaced uniformly, but are concentrated in the regions of greatest interest (so often we speak of “average” resolution). In some platforms, then, the probes are replicated, to increase diagnostic confidence. In others, in the same array coexist probes for different applications (eg SNP + CNV). Another factor that contributes to the resolution is the number of probes that we decide to consider to validate an aberration. Usually with oligo-based platforms are reported only the deletions involving at least three consecutive probes, while usually are reported only those duplications involving at least four consecutive probes. These parameters, however, may vary for technical reasons or due to a personal choice of the researcher, so that each laboratory should declare its specific resolution with the platform it uses.
Thanks for the information you have provided. I work with CGH arrays (8X60K). I know a bit about them, I really have got no idea regarding how SNP+CNV arrays work. I have not understood your point which says resolution of 75kb, I could make out that it enables detection of aberrations with size of 75kb. Does it mean even the size of each probe matters? because the idea i have is as the number of probes increases the coverage for the genome increases and the average space between the probes decreases and as you told at last we need to apply specific change point method to define gains and deletions which has to be decided based on the size of aberrations we are interested in.
So i understand from your reply that resolution of an array is defined by 1. number of probes, 2. Average spacing between the probes 3. ? (Could you please fill in the numbers)
It was a very knowledgeable reply. If possible could you please explain the other resolution factors.
Sorry for the delay in the answer, but I was out of the office for a conference.
I will try to explain the unclear points. First of all, your idea that as the number of probes increases, the spacing decreases (and thus the resolution increases) is correct, but the matter is complicated by the fact that the more recent arrays tend to have a non-uniform spacing. In fact, the probes used have a smaller spacing in areas of greater interest (important genes, regions whose unbalance is known to cause pathologies, etc..). There is a consortium (ISCA), which has just established the regions of interest and many arrays were standardized to study in greater detail precisely the regions indicated by this consortium. The rest of the genome is covered with probes at a lower density (wider spacing, lower resolution) to represent a “backbone” that allows to identify even imbalances in regions less critical, but using a smaller number of probes.
A further complication is the fact that some probes in an array can act as a quality control, or the probes may be used in replicated, or they may be partly used to genotype, as occurs in platforms CGH + SNP. For example, in an array Agilent 4x180K CGH + SNP, in each of the 4 areas of the slide there are about 180,000 probes, of which 110,712 are for CGH (of which 600 in replicated 5x), 59,647 are for SNPs, 8,121 are for quality control . With these arrays, it is possible to detect copy number changes as well as copy neutral aberrations, such as loss of heterozygosity (LOH) and uniparental disomy (UPD), but the resolution is lower than the resolution of an array 4x180K "CGH only".
“Resolution of 75 Kb” precisely means that the system used is capable of detecting aberrations from a size of 75 Kb. The presence of a deletion or a duplication is indicated by the deviation of each probe with respect to a baseline, which corresponds to 0 (this deviation in turn depends on the ratio between red and green fluorescence in each single spot of the array). Consider that the most common probes used in the arrays are oligomers (60-mer), so they have very small size, and that, for technical reasons, a single probe may present a deviation from the baseline even in the absence of rearrangements. For this reason we can consider as true rearrangements only those unbalances indicated by several consecutive probes that have the same deviation from the baseline, both in the positive range (duplication) that in the negative range (deletions). Some guidelines recommend to consider only the deletions indicated by at least three consecutive probes and duplications indicated by at least four consecutive probes. Depending on the quality of the experiment, or for personal choices by the individual researcher, this minimum number can vary, and consequently varies the resolution, which is given by the average spacing of the probes multiplied by the minimum number of probes that we decide to consider. For example, if the average spacing of the probes in our array is 25 Kb, and we decide to consider three probes for deletions and four probes for duplications, the resolution will be of 25x3 = 75 Kb for deletions and 25x4 = 100 Kb for duplications.
For these reasons the number of probes on the array is important, but more important is the spacing.
I hope I have been able to explain the situation a little better.
Thanks for the reply. Sorry I was pretty busy all these days. I will read your above writing soon. I just have few queries regarding normalization of CGH data (8X60K). I was actually given feature extracted and normalized data and was told that the normalization used was linear global normalization. By any chance if you have any idea regarding it could you please explain me what does this normalization do and what are its advantages compared to others. Could you also let me know different steps of feature extraction in brief. Please i am in urgent need of answers for these questions. I will be thankful if you let me know ASAP.
Data normalization is a topic really very complicated! It is necessary as the ratio between red and green fluorescence, and the overall intensity of the fluorescence of the individual spots is the result of the quantitative ratios between test DNA and reference DNA, but also of technical factors. There are several methods, each with advantages and disadvantages. Sorry, I cannot say anything more precise on the linear global normalization method.
Feature Extraction consists of two major processes: image analysis to place the grid and locate spots and data analysis to define and measure spot features. For all images, the protocols define features and background regions, remove pixels, and flag features and background regions that may affect the reliability of the results. In addition, protocols set up the extraction to subtract background information from features and make a background adjustment on signals of low intensity. The software then calculates a reliable log ratio, p- value, and log ratio error for each feature to give you a confidence measure in the measured log ratio.
I agree with Dr Lonardo that data analysis is one of the trickiest parts (apart from the perfect wet-lab part, and apart from the perfect operator-dependent retrieval and extraction of samples :-) in aCGH analysis
Indeed, all calculations based on average spacing are a bit misleading. If a sample is noisy, you may well have to use more probes to call a variant even though fewer probes may be needed for very clean (i.e. high signal/noise ratio) specimens.
Also, algorythms vary according to the application you wanna use aCGH microarrays for: germline, somatic, tumor samples with subclones, sorted samples, cell lines, sorted cell lines etc etc). For example, while segmentation algorythms generally work better with messy biology (eg cell lines or somatic tumors, where you have a plethora of alterations), hidden Markov modeling-based methods are considered superior for variant germline detection, in a context where the vast majority of probes is expected to give what corresponds to a diploid signal (assuming you're working with mammals), or where density is variable (classical case --> Agilent chips).
Truth is, you should run preliminary experiments with a true positive (i.e. amplified or deleted sample according to what you're looking at) and a true negative (i.e. a real diploid sample, maybe a universal reference), and try different dilution with replicates to see what is the threshold of probe sets and other paramenters which gives you the real result most od the time. This is seldom done however, due to economical and practical (this often means hurry) reasons; this is also the reason why agrerement between different analyses of the same raw file is so low (please see this disconforting paper which is, however, intriguing: I attached it).
Thanks for your reply. Your information has been of help to me.
If anyone is familiar with the p values i would like to clear a query regarding p value. i have an observation where in i did survival analysis (univariate log rank test) and the survival curves were represented by kaplan meier. During this i ended up with a confusing observation. The two groups were completely separated at a value of 0.065. Which means the probability of results to be by chance is 6.5% but the curves are widely separated. So what conclusion can be drawn from this? Shall I conclude that the survival analysis was insignificant or was insignificantly associated with better survival with a value of 0.065? I will be thankful if someone provide me an appropriate answer for it.