I am solving a structure of which I have X-ray data up to 1.5 A resolution. However, there are not so many spots in the highest resolution region, so that when I solve the structure (with a good model I have from a closely related protein) everything looks very good in terms of density fitting and stats, but the overall R and freeR factors are still quite high (0.23 and 0.26).
The only way I found to lower the R factors is to discard the higher res data. For instance, if I don't consider the data
Higher resolution is always better (although I still have nightmares about the 1A structure I refined).
However you need to give more details. Look carefully at the processing statistics and use these to judge what cutoff you should use - it's not just about where you can see spots. What is the R-merge, completeness and redundancy in the highest resolution shell?
The aim of refinement is not to reduce the R-factors. In any case, when solving a structure (for example using anomalous scattering), it is preferable to process the data so that the best possible data are obtained. When refining a structure that was solved, then one can go back to the initial frames with a view to extract as much information (but not integrate noise thinking the noise are Bragg peaks). For this, a couple of tools are useful: the novel statistical indicator CC1/2 (produced by XDS and Aimless) plus a plot of the Rsym as a function of resolution - there is an inflexion point where the curve shoots up, an indication of the high resolution limit.
But again, the aim of refinement is not to reduce the R-factors. The aim is to generate a model that fits at best all available data (including the biochemical data, the restraints obtained from structural libraries etc)
Have a look at this article:
http://www.sciencedirect.com/science/article/pii/S0969212602007438
Last year I published a structure at 1.85A with rfree of 0.22. For me that was the best balance I thought I could get.
Hi Matteo,
The quick answer to your question is that it is preferable to have higher resolution rather than lower R-factors. Limiting your resolution will lower the R-factors because the data in lower resolution shells generally have a higher signal to noise ratio. The R-factors are not excessively high, but there may be things missing in the model. Have you have added all solvent molecules in your model? Have you done TLS refinement? Have you modelled disordered residues?
The recommended cut off of the high resolution limit of the X-ray data depends upon the crystallographer, and it has been discussed recently on the CCP4BB. Some say cut when I/sigI < 2, others say do not cut, others cut more severely.
What is clear is that if the maps look poorer at lower resolution then you are throwing out good data! My advice to you is to keep the high resolution data unless it makes the maps look worse. The higher resolution data are generally weaker and should only improve the map. Ice rings and powder rings can pollute maps if they are integrated into the reflection and not discarded in the scaling procedures.
For more information read: Acta Cryst D 69 1204-1214 "How good are my data and what is the resolution?" Philip R. Evans and Garib N. Murshudov
Good luck with the rebuilding and refinement.
Cheers,
Bill
Contrary to your suggestion, they are not in conflict! Good resolution and good R factor go hand in hand.
If you have usable high resolution data, it will be much easier to make a good structural model, and due to the better model, your R-factor will be lower than for a bad crystal that does not diffract far.
Of course, after obtaining a good high resolution model you can make it "seem" even better by removing the weakest observations and thereby further lowering R. But that is cheating and does not improve the structure.
Have a look at this publication and then reprocess your data to see what the true resolution cutoff is. Also by IUCR standards your overall completeness should be >93% and at least >70% in the last resolution shell.
http://www.ncbi.nlm.nih.gov/pubmed/22628654
The R factor resembles the statistical goodness e.g. proximity among your structural proposal and the use of electronic density of your refined data. Thus resolution depends on the quality that you have collected your data and not on the type of proposed molecular model solution. Hope this helps.
You are still working on your structure, so what do your current R-factors really mean? You could try to finish your model first at a modest resolution using you favorite cut-off on I/sigI, Rmerge, Rpim or whatever and then use the method described by Karplus and Diederichs to see how far you can push your resolution.
Note that if you use higher resolution data, you may see more waters and alternate conformations, which you then need to build to lower your R-factors. You probably also need to change your refinement parameters. You may even be able to move to anisotropic B-factors.
If it were my dataset, I would make the model feature complete (i.e. build loops and obvious waters, solvent molecules, etc) to 1.9 or 1.8 A and then throw the thing into PDB_REDO with all the data. It will use the Karplus-Diederichs method to find a proper resolution cut-off and then optimize your refinement parameters for this new cut-off. It will also do some rebuilding with the new (potentially higher-resolution) maps. Then continue from there. Note that this is a shameless plug of my own work ;)
Are you talking about solving the structure or refining the it? Those two are not the same.
As Alice pointed out correctly: higher resolution is always better and you should always include ALL data you collected and not cut the data at an arbitrary resolution limit in order to improve your residual values. Of course one should not refine against noise either, so a sensible high-resolution cutoff is indeed needed. You should base this on I/sigma and merging R-values. For example it seems generally accepted that if the I/sigma values for all reflections higher than a specific resolution are below 1.5, the data should be cut at that resolution. Similarly, merging R-values (Rsigma, Rint, Rpim) consistently higher than 40% to 40% indicate noise rather than data.
Never cut your data for the sole reason of improving the R-values of the refinement. That is not good science. Frederik said quite correctly that reducing the R-values is not the primary goal of the refinement.
If you want to see what effect exclusion of the high resolution has, you can check the effect of omitting the high-resolution data on standard uncertainties and the terms in the correlation matrix. You will notice that even though the R-values go down when high-irresolution data are excluded, the standard uncertainties (if calculated) will increase and the correlation effects will be stronger. This clearly indicates that the model based on all data is better than the one refined against only the low-resolution reflections in spite of the lower R-values of the latter model.
@Peter, I disagree with the 1.5 sigma or 40% R-value statement. Most datasets do have true data as low as 0.5 sigma when looking at them by CC(1/2). The Rfactors are rocket high as much as 300 % but you may still have 100% completeness in that shell. Would you throw away that data ?
@ Juergen: Yes, I would throw this away, but I would not consider it "data" but rather "noise". I can collect 100% completeness to ANY resolution even without an actual crystal present if I am satisfied with I/sigma of less than 1 and and merging R-values above 60%.
I asked a similar question about a week ago on the CCP4BB.
I was made aware of the 2012 Karplus and Diederichs paper and started to reprocess all my unpublished data with XDS and resolve them from scratch to keep the R-free column clean.
Even the heavily anisotropic structures (
It is always a compromize, when you say that there are not a lot of reflections in the high resolution what exactly you mean? what is the completeness? One way to decide the cutoff is the completenes at the last resolution range. Also you need to actually see the Rs of the last resolution range. I would not use data with an R-factor exceeding the 0.6 at the last resolution bin. Hope this helps.
Matteo,
Your R-factors are really a little high, even if you cut down the resolution. If you have NCS, don't restrain it anymore above ~2.0A. Also, try refining the B-factors anisotropically without TLS. You should have enough data to do so and this should give you the more accurate fit. Also make sure you build all solvent atoms and all alternate sidechains conformations that you see in the map.
I would first make sure the spacegroup is correct and discard all data whose I/sigma is lower than 1 (although some more conservative crystallographers might say use 2 as I/sigma cutoff), and this will fix your highest resolution. Once the I/sigma in the highest resolution bin is acceptable, make sure you have enough reflections in the Rfree dataset (>1000). Then, during refinement Rfactors can be higher than usual for many reasons (crystal quality, proper cryoprotection and crystal handling), but I wouldn't care about it (the R values you mention are kind of high but not ridiculously high), as long as there is not over-refinement (Rfree > Rfactor +5%).
Check out this article http://www.sciencedirect.com/science/article/pii/S0969212602007438 to have a better idea of what is "too high" and "too low".
At the end of the day what counts is the quality of the density, since that will determine how you build the model which then ultimately guides your science. One can judge the quality of the density visually quite well. Focus in on a critical area, see at what resolution you get the best density and go with that.
@Stefan: With local NCS (at least in the Buster and Refmac implementation), it is not sensible to not use NCS restraints. Even at high resolution they still help and any genuine difference will cause the related restraints to be removed automatically.
As for anisotropic B-factors, I agree that they may work here, but they should only be used if they cause a genuine improvement of the fit with the data. That should be tested, see http://scripts.iucr.org/cgi-bin/paper?dz5234
Hi Matteo,
This is a very difficult question... But still there are rules, at least at our center.
One have to check all together, completeness, redundancy, I / sigma (I), and R merge... Because these parameters are correlated... For example, R merge is higher for high redundant data, and R merge is lower for incomplete data (let's talk about last shells)...
So, first I would run scale with 20 shells...
And I would not keep any shell with I / sigma (I) below 2.0.... As for R merge, if the redundancy for the last shell is below 4, I would not trust in the data with R merge above 50%. But if the redundancy is 8 and higher, than low 60's are fine...
As for completeness, I would prefer to have above 90% for the last shell, but I would except high 80's if that helps with the refinement...
So, in general... higher the Resolution, lower R factor one should reach. And if by cutting the resolution one improves the R factor, that is the indication of the problems in the high resolution shell...
And there is not a big difference in the density between 1.8A and 1.5A structures, if they have the same solvent content. But in the data statistics with the 20 shells, that will be three shells... So, cutting just one shell is not so critical...
I hope this help.
best wishes... )))
@ Alice and Danial: my Rmerge in the high res cell is 0.501, with 96.1% completeness and multiplicity of 2.5
@ William and Peter: yes, I have added all the solvent molecules and looked for residues with alternative conformation. I have done TLS refinement (that helped a bit!). I will try to play with the I/sigI parameter to see if it makes any difference
@Rob: I agree with you...I was surprised that my R factors are not lower despite my model seems to be very good.
@ Jurgen: thank you for the useful publication. My completeness parameter are all ok, 96.1% in the outer shell, 82.8% in the inner and overall completeness is 92.3%
@ Stefan: thanks a lot, I will try to reprocess my data as you suggested!
Thank you so much for all your useful comments!
Hello Matteo,
Have you checked your data set for images/frames with significantly lower signal quality, single ones or in a row? Often the last frames are the worst, due to radiation damage. If you have not done so already, I would check, e.g. the SCALA log file for BATCH numbers with Imean/SIGM0 and Rmerge too high above average. Discard these batches and scale the data again, as long as the completeness is not reduced too much (about 80% in the highest resolution shell is considered as limit). In case your Rpim (and probably the other R-factors as well) goes down you can expect better R-factors in the refinement - but use same test set for Rfree calculation, when you continue working with the current model.
One additional remark; the sentence should be: If you have not done so already, I would check, e.g. the SCALA log file for BATCH numbers with Imean/SIGM0 BELOW and Rmerge too high above average.
Of course, high I/sigma is desired!
Out of curiosity, which data reduction program are you using ? HKL2000, (i)Mosflm, d*trek or XDS ?
@ Peter. You were right...I had a look at the scala log file and I realized that the Imean/SIGM0 goes down and R goes up during frames collection. I have collected 240 images, and from image 180 onward it is clear the radiation damage pushing the R up and the ration down.
So I tried to integrate only the first 180 frames and I obtained much better R merge factors in my scala summary table, but the completeness decreased too much (78% overall). I will now try to use more images (maybe 210) and see if I can have the completeness back on track. thanks!
Matteo,
the true test for the resolution cutoff is simple. It is the refinement.
First, refine your model at the highest possible resolution using the CC(1/2) criterion or the one you like most..
Then, refine your model with the SAME strategy but including 0.1A-0.2A less resolution data. Compare the refinement of the lower resolutions with the R-factors calculated with the model refined at the higher resolution bin calculated against the respective resolution bin. If inclusion of that shell adds useful information to your model (lower R-factors, less R-work to R-free difference) you should use that data, regardless of the R-merge, I/sigma, high shell completeness, etc.. If not, make the cut.
Using an I/sigma criterion is an well intended advise, but be aware that the sigma estimate (it is not a measurement, it is an estimate) influences the I/s criterion more than anything else in high resolution bins. Don't let people tell you that you are fitting noise based on a parameter that is guessed.
Especially since there is no such thing as an _real_ isotropic diffraction pattern. All diffraction patterns _are_ anisotropic to some extent (at last in my opinion) although I'd like to be lectured by someone who knows better if there are exceptions, e.g. in cubic space groups.
S.
Hi Matteo,
Since you are using SCALA, you could try the "Reject Outliers" option by stepwise reducing the SD values of the rejection criterion by one unit (default ist > 6.0 and > 8.0 SD from mean, respectively). Again, this could reduce your completeness, but sometimes a few spots that increased your R-factors significantly could be eliminated. Also, a compromise of discarding several batches and perhaps limiting the resolution to 1.6 Angstrom may suffice to obtain reasonable completeness and better R-factors.
If I could change one thing about modern structural biology, I would change its over-reliance on numerology. Numbers are useful, but they are no substitute for common sense. You are worried because your completeness dropped to 78% overall. Who said this was a magic number?
Look, your goal is simple: to obtain the most accurate atomic model possible, with a precision that is justified by the quality of the data you are able to measure. Here are some mantras for you to recite.
There is no substitute for high resolution.
No method known can convert bad data into good data.
When you have enough experience, the quality of your electron density map will be a better guide as to how you're doing than any other measure.
Don't throw high resolution data away until your structure is finished. Even relatively imprecise measurements at high resolution can help your map. At the end, if you need to get the numbers better to satisfy some arbitrary criterion of "correctness", you can truncate the data to lower resolution then. But when you are trying to figure out what your molecule looks like and what it is doing, anything at high resolution except truly dreadful data is likely to be helpful.
@Gregory: Well, numerology is pseudo science, while a criterion for the signal-to-noise ratio between 1.5 or even 2 is simply a physical and common sense limit. In case the completeness in the highest shell was between 60-70%, when the overall completenes was 78% (in addition to I/sigma < 2), there will be not much information left. Then the structure is refined against a lot of noise.
Bad data cannot be turned into good data, but there is no need to include them, and they can be filtered or separated from the good data.
The problem with a map as quality measure is that it is rather subjective and the difference between 1.5 and 1.6 Angstrom is most likely marginal, in particular if in the range 1.5-1.6 A noise is significantly increasing. Also, maps are not equally good in every part of an asymmetric unit and depend largely on the model (phases and Fcalc) not only on Fobs, unless you are using a pure experimental map. Therefore, I prefer parameters as criteria for the cutoff, which are pure experimental ones, such as I/sigma, completeness, Rpim, or the Wilson plot.
The Wilson plot (output as graph from SCALA) is another helpful indicator for the high resolution limit of the data: a linear interpolation of the falling curve ln(I/I_th) to the x-axis is a good estimate for a reasonable data cutoff.
Peter Goettig,
I have a dataset after processing in XDS or KL2000 with I/sigma
~0.7 and ~60% completeness in the highest shell.
After anisotropic truncation and scaling I get ~1.7 I/sigma and ~18% completeness in that shell.
Rpim and Rrim are way beyond 100% in both cases and the multiplicity is 2.3.
I can literally see those reflections. Why should I reject them?
What makes you so sure the 1.5-2.0 I/sigma limit has any relevance? How do you know there are no useful reflections in that resolution bin? Why should I exclude it a-priori without testing it in refinement first? How do you even know the sigmas that determine the I/s value are correct in that shell? What makes you so sure this issue does not apply to spherical data the same way it applies to elliptical data?
Please help me out here. I am lost.
S.
Hello Stefan,
the relevance of signal intensity is very simple: At 0.7 I/sigma your average signals are even smaller than errors of the measurements. They disappear in the background noise and you will get no useful information from these unreliable reflections. The 2:1 I/sigma criterion holds true for all sorts of physical measurements as detection limit. The reflections that you still see in the high resolution bin must have I/sigma values significantly larger than 2.
I suggest that you keep all these data for one refinement, and in parallel you do a refinement with the strict cutoff criteria (I/sigma = 2 and at least 80% completeness, if you like, then try I/sigma =1.5 as well for comparison). By comparing the R-factors and quality of the model (rmsd bonds/angles, geometric outliers) you should be able to find the most suitable approach.
Good luck!
Dear All,
Despite Greg's warning the discussion strayed into numerology.
I give only two examples of many confusions that exist in the discussion above. I/sigma equal to 1 would sound as something not so reasonable. But, most of crystal structures are refined against F not I. Therefore very roughly speaking I/sigma equal to 1 corresponds to the 2F/sigma(F). This means that such data is more than 50% reliable as judged by the normal distribution and chi square tests. (For a reminder 3 sigma gets you to around 95% distribution). If one realizes that anybody can change the I/sigma value almost arbitrarily (certainly by a factor of 2-3) by the changes in the integration methods (different exposure time, different program, different cutoffs, different integration box, different profiles etc.etc.) then one realizes that what Greg is saying is like GOLD. What counts is how well the structure is represented in the data and how data represents well the real protein.
It is hence worth to to be guided by a few crude and not universal but helpful reminders.
1) Collect as much as you can. Definitely 1 I/sigma would be recommended.
2) Collect to maximal resolution. Which means that in 90% of cases you cannot see the reflections. Let the software you are using decide that.
3) Accept the data on the common sense level not on the community slogans. For solved structures redundancy of 1 is frequently much better than 10. For solving new structures high redundancy and low R merge are the key. Completeness of 100% is outstanding but completeness of 50% is better than 0.
4) Check the mutual relationship in your data reduction your low resolution has to be by definition very accurate: low chi square and very low R merge, your high resolution data at I/sigma 1 can go up to around 0.8 in R merge. Even if it is 100% it just means that you have every other reflections less reliable by a factor of 2.
5) If one remembers that setting all intensities to 1 with reliable phases recovers the density completely then the discussion becomes a bit redundant.
6) Most of the errors in structures are human made not data originated. If anybody needs convincing then I published at least three papers correcting obvious mistakes.
And please do not become an extremist. There are a lot of them in refereeing circles, and this is precisely why we have so many subpar structures published. Numerology is not a substitution for common sense.
The attempts to defame reasonable physical values as "numerology" are bit tasteless, in my opinion. By the way, the so-called community slogans represent basically the experience of hundreds of crystallographers over several decades, who avoided to collect to much fool´s gold. The next "anti-numerology" step might be to abolish these horrible numbers, which are called the Miller indices ...
Seemingly, some of the participants in the discussion are not aware that I/sigma = 1 means simply +/- 100% error of the measurement, i.e. the signal disappears in the noise (once proper scaling has been performed). Of course, the I/sigma values change, depending on programs, settings of the user, etc. Since the algorithms for the calculation of sigmaF are a bit dubious, one cannot rely on the idea that the true F/sigmaF ratio is automatically about 2:1 after truncation (when I = sigI). In fact, high resolution structures are better refined with Intensities and sigma(I), e.g. in SHELX, resulting in much more reliable structures, due to the usage of the real standard deviations.
In the majority of cases, 50% overall completeness will not allow for solving any structure, 50% completeness in a shell ranging from 1 -1.5 I/sigma contains little useful information, which could improve the model quality in terms of accuracy of the atom positions, bond lengths or angles. However, there is an interesting and most likely exceptional example by Karplus & Diederichs (Science 25 May 2012:
vol. 336 no. 6084 pp. 1030-1033 ) , which appears to support the "only high resolution counts"-hypothesis. I suspect that the white noise of the high resolution shell did not any harm, while the model was not substantially improved either. And, quite problematic for some, you have to watch other numbers, namely CC1/2 and CCtrue! Unfortunately, this approach does not work for most cases and you are forced to cutoff your refinement at the traditional standard numbers.
It is also a misconception that data sets are always nearly perfect and only humans introduce the errors. Crystals are imperfect and degrade during data collection, not only the cell constants change, but also the chemical composition of the biomolecules, not to mention all sorts of molecular flexibility that cannot be modelled properly to date.
Finally, "It is better to have a good 1.8 Å structure, than a poor 1.637 Å structure" as Kleywegt and Jones say (Good Model-building and Refinement Practice, Methods in Enzymology, (1997) 277, 208-230.
I did not see anyone mention the recent Science paper (Science. 2012 May 25;336(6084):1030-3. doi: 10.1126/science.1218231. Linking crystallographic model and data quality. Karplus PA, Diederichs K.) which addresses this point very definitively.
Jacob,
This article was mentioned before, but I think it takes some time before the field accepts those findings.
Also, we are facing a new generation of detectors that are already installed at beamlines right now. The paper Peter was quoting was from 1997 when technology was far behind from what we have today.
Re-thinking concepts is a fundamental part of the scientific method and I would urge the crystallographic field to do this. Before I read that paper (*) I would have been on Peter's side in this discussion.
Peter,
Every dataset can be 100% "complete", if I wish it to be. Just by integrating all predicted reflections and not having overlaps generated by the experiment.
I agree that using low intensity shells is labor intense in refinement but I never accepted "it is too much work" as a valid argument. Discarding data a-priori based on an arbitrary parameter gives you...what exactly?
Why should I discard _shells_ that have an Imean/sd lower than 1 ?
How do I know there are no useful reflections in that _shell_ ?
Before even trying the refinement.
S.
EDIT:
(*) [for clarity] the 2012 Karplus & Diederichs paper
Just a quick reply to Jacob: I have mentioned it!?! Just read my previous answer thoroughly.
Hello Stefan.
The crystal in the Karplus & Diederichs 2012 paper seemed to have the potential to diffract much better, and was consequently far from the diffraction limit. Thus, I assume that they integrated just noise in the high resolution shell, while the accurate predictions from the imposed crystal constants ensured these pseudo-reflections to be consistent with the rest of the data. By scaling them, they adopted just the values fitting to the rest of the data. Probably these reflections represent nothing else but Fcalc-derived-from-obs. Obviously, the data of the "high resolution range" did not improve the R-factors (see their supplement). Significantly, the optimum of Rwork/Rfree was exactly at the traditional cutoff of 1.8 Angstrom. Whether the model itself was really improved is not clear, since no statistic is shown. Also, they do not provide a comparison of the maps calculated at 1.8 and 1.4 Angstrom.
As long as there are no other convincing cases, I consider this study as an extreme exception using an ideally designed situation without further significance for basically all real protein crystals.
Even the best detectors and most brilliant beams cannot change the quality of a crystal, so I do not see any reason to dismiss the Kleywegt & Jones 1997 paper as invalid for refinement in the early 21st century. Concerning the completeness as manipulated number: That is why the criterion of I/sigma 2:1 is so reasonable. Your measurements have only an error of 50% and may still contain substantial information. In case of I/sigma = 0.28 (limit of Karplus & Diederichs 2012) they have an error of more than 350%. By the way, a fundamental problem occurs here: The standard deviation extends in the negative range, which makes no physical sense after the scaling has been performed. Nobody would sell any other result with such an error in the natural sciences, like "we determined the mass of this protein: it is 35801 +/- 120000 g/mol". Could the protein have a mass of - 90000 g/mol?
Back to completeness: There are some nice and informative movie by James Holton (http://ucxray.berkeley.edu/~jamesh/movies/), in particular "Completeness in Oscillation Photography" from a real crystal tells you how a map with 50% completeness looks like.
Most people who refined protein structures experienced that they had to cut off their data at a resolution with I/sigma > 2, since data below did not help the model or maps at all. Anyway, even the traditional refinement way is to use the highest possible ("common sense") resolution first and cut it later, in order to obtain an acceptable structure with good geometric parameters.
Dear Peter,
For a young scientist you certainly show a substantial level of certitude and self confidence (as you should for a person with Martinsried roots). However, I am afraid there is no substitute for experience and just what we call wisdom. With all the follies of age that Greg and possibly I can show we represent a lot of experience in practical science. In order to establish the factual (and physical) basis for your opinions I took a liberty and had a quick, cursory look at your structures in the PDB (from Bode's times to the more recent times). With somewhat mixed feelings I have to report that your structures as deposited in the PDB do not allow me to place a full confidence in your views as they showed substantial weaknesses. For a more detailed info I will be happy to connect with you and discuss it off line, but to provide at least a rudimentary example I cite the most obvious observations from my very brief look at your 1 Å structure of kallikrein (2QXI). GLN30 has a flipped head-group, ARG77 out of obvious density, disorder modeling of VAL45, 47, 121 at the core (and many others) is just simply incorrect. The modeling of solvent is of relatively low quality that at this resolution should be rather obvious. And you have to trust me that some other structures are in much worse shape.
I am writing this not to hurt you in any dimension because a vast majority of the PDB structures have substantial defects. I am writing it to instill a sense of decency that comes with humility against the subject we are all studying.
As a final warning I would only cite George Sheldrick who recommended many, many years ago to run the refinements on measured I's with all the data (including negative I's) which does not square well with your convictions whatsoever. I do not require you to know all the literature and all old people but at least have some flexibility in using numerical arguments that, as I showed in my post, are as uncertain as financial planning at the time of the crisis (not worth much without people supporting them). As a personal experience I can only recommend that I routinely provide a good modeling explanation for electron density maps up to the 0.4 sigma in map contouring level. The errors are shown the best at around 0.7 level and good phases usually support the structure up to below 0.5 sigma level. At this contouring level one can easily see the disordered fragments up to 300 Å2 in B factors. But, I guess, for numerologist/purists this world just simply do not exist.
As a final remark for a general audience, I would only warn that proteins are getting messier and messier when you get to higher resolution (not in any dimension the picture gets clearer as commonly assumed) as we are reaching the boundaries of physical world in which proteins are glassy systems that by design are liquid-like. So there is no surprise that at certain resolution everything gets messier and the statistics more complicated to interpret.
Dear Boguslaw,
First of all, I am really surprised that you attempt such a petty diversionary manoeuvre, in an inappropriate patronizing way involving personal attacks (“so young, so inexperienced”) and on my work. You must be quite desperate to check my biography or even single atom positions in PDBs, in order to find some “weaknesses”, which you could use against me. I think, your wisdom has left you at this point. My only reply concerning the 2QXI data is that my recent inspection and the PDB redo results do not support your views: Your claims are mostly incorrect or greatly exaggerating (e.g. one atom of Arg77 is slightly out of the density!) and PDB-redo results in a full optimization gain of only 0.64% for R and 1.07% for Rfree, while no significant errors were detected.
However, I hope to continue the discussion here with you in a rational and scientific manner. The question of the negative intensities is certainly interesting enough. So I found one of George Sheldricks statements (November 2007) in ccp4bb: “… the SHELX manual discusses the question of refinement against amplitudes or intensities. If you are refining against intesities, there is no need to "truncate" the data, indeed it would be definitely counter-productive to use TRUNCATE to convert I and sig(I) to F and sig(F) and then to convert these back to I and sig(I). For SHELXL it is also not necessary to scale the data so that they are on an absolute scale. I personally believe in refining against the data you actually measured without compromising them in any way, but I appreciate that I am in a small minority.” OK, it is acceptable that on a relative scale negative intensities down to -3 sigma can be employed. The problems arise, when data are scaled and the reflections and/or the error are in negative intensity range. Literally negative intensity would mean that photons are sucked from the detector to the crystal. Anyway, how do these observations fit to Fcalcs, which cannot be negative? If the ratio I/sigma (scaled) is below 1, how can you distinguish a true observation from noise?
Peter,
I just finished refinement of one of my structures I am redoing right now (see above). By adding 0.4A more data (with Imean/sd If the ratio I/sigma (scaled) is below 1, how can you distinguish a true observation from noise?
Dear Stefan,
Congratulations. That is exactly my experience.
Dear Peter,
Nobody was trying to be patronizing. I did not say that you are "so young and so inexperienced". Please read my post again instead of foaming now. If you do not consider yourself young it is your privilege.
Your opinions are fully respected. I just pushed back your push on what other people suppose to do and provided a rational basis for my views. I get the demand that I cut the data a lot from the referees. I sometimes even get the most ridicules demands that I should have refined by this not the other refinement tool (or reduce the data the particular way). Sometimes I comply, sometimes I fight back. This discussion reminds me of a story. In 1994 in Atlanta during the ACA meeting, in a very important session, a very important crystallographer announced that he is finally convinced that cutting entire low resolution shell (below 6Å resolution) was bad for the structures. I could not restraint my a smile. Stating an obvious that is not obvious to others it is what we call education. He educated himself.
Nothing is going to change my opinions about your structures and you do not want me to post pictures from your electron density maps of several of your structures. So, even though this might be somewhat educational it would be somewhat embarrassing too. In general, there is nothing embarrassing in that, as I find errors in my structures every single time I come back to them. But fighting on this publicly ... I do not see any need for it. By the way, read my newest letter to Protein Science that clearly spells my doubts about PDB-Redo efforts.
I was just making a point that forceful views must be supported by experience. So again calm down, and let other people have their opinions without your ridiculous examples like "Nobody would sell any other result with such an error in the natural sciences, like "we determined the mass of this protein: it is 35801 +/- 120000 g/mol". Could the protein have a mass of - 90000 g/mol?" We are arguing a legitimate issues here and even if you do not respect me you should slightly tip your hat towards Greg or George instead of fighting a loosing battle.
Besides, in my opinion, I have quite a lot of unorthodox opinions.
1) For instance, we should refine the structures on the entire diffraction pattern not only the pseudo-Bragg peaks that we arbitrarily selected by our indexing and integration procedures that "create" the data versus "recording" them. The amount of info in the inter-Bragg spaces is just staggering.
2) We should collect all the data in P1 and determine the "real" symmetry after we refine all the copies of the molecules, even though we solved and initially refined the structure in the particular space group.
3) We should always refine hydrogens (including water molecules).
4) We should refine the solvent as it has the structure many angstroms away from the protein surface.
5) We should never use dummy atoms (particularly with occupancies 0) or truncate the residues.
6) We should refine ensembles (for instance invisible loops should have the "NMR-like" multiple representations with correct stereochemistry.
And quite a few others that are too controversial to spell out.
I do not understand Boguslaw Stec´s way of discussing a scientific topic, with repeated diversions and personal attacks, in particular this sort of psychlogical blackmail attempt (“you do not want me to post post pictures from your electron density maps your data …”), which is hardly a rational argument. In fact, such things should be banned from a forum like Research Gate. The topic brought up by Matteo here was nearly forgotten “When solving a protein structure, is it preferable to have higher resolution or lower R factors?”.
If Stefan´s refinement really shows additional density, which gives better insight into the active site, it would be a great achievement. However, I still have doubts, whether data above 3 Angström resolution can be handled with exactly the same parameters (0.4 Angstrom better, CC* < 0.2 etc.) as in the original scheme of Karplus & Diederichs. So I wonder how the density correlation of this arginine looks like.
Up to now no one has presented evidence that scaled reflections in the range I/sigma 1.0 – 0.28 carry any true information. Here the fundamental problem of - sigma extending over the threshold of real negative intensity remains, unless an offset of the intensity was present. At least a theoretical explanation and a reference should be given, supporting the usage and confirming the information content of such reflections. A more recent publication by Diederichs & Karplus (Acta Cryst D, 2013) contains more examples of data sets with CC below 0.2 for paired refinement, while they do not discuss the topic of I/sigma < 1 . For me it would be convincing, if for example in the presence of an anomalous scatterer the anomalous differences were detectable in these weak reflections, since this property could not be imposed by the indexed pattern. As long as such evidence is missing, I can only favour the traditional approach of data processing and refinement. In case you follow the discussions in the ccp4 bulletin board, you will find that others think in a similar way.
Post #6 contains the PubMed link to the often mentioned and by now mandatory paper by Karplus & Diederichs for this thread. Seems a bit weird that one can not see all contributions at once without clicking somewhere to unravel them. Perhaps with the short attention span these days 10 answers are too much ?
Dear Peter,
You won. I do not intend to convince you about anything, anymore, and that includes to be tolerant. You failed to address many of the multiple, purely science points (maybe with the exception of I/SigI=1) and call my remarks personal.
In a sense they are personal. This is like talking to a kid who after driving in a Go-cart race is convinced that he can win in a Formula 1 race. The kid is clearly endangering himself by his convictions but...
By checking your structures I established that they are not utilizing the statistical information from low level contouring of ED maps. That is quite equivalent to not handling well the weak data and not having appreciation for their importance. This is the end of the discussion. The convictions that you inherited from people who trained you are panning out in this discussion, and your structures. You opted to not question them. The best proof of this fact is that your structure at 1Å was refined with 10Å cutoff on low resolution. Unwittingly, I have provided a criticism of this practice in one of my previous posts without prior knowledge of applied cutoffs.
Instead of arguing we should go back and produce better structures. I just did that with your 2QXI structure. I will send it to you off line. I do not expect you to recognize any points I am making, or accept my model. I only make a point that there are alternatives. Whether this makes any influence on biology remains to be seen but certainly applying my recommendations and improving the models leads to better descriptions of reality. The pattern of correlated disorder that exist in this structure is interesting and certainly has functional importance. Alternative backbone conformations cannot be picked by Redo because this method is not designed to do that.
@ Peter,
>> So I wonder how the density correlation of this arginine looks like. a-priori
Hi,
Here the refinement at 3.4A
That is where Peter recommended making the cut (I/s dropping close to 1)
R-work 21.2
R-Free 24.7
Hi,
Here the refinement at 3.2A after building the arginine in a map calculated at 3.1.
R-work 21.4
R-Free 24.8
Stefan,
Sorry to quote Wikipedia, but the following holds true for anything that claims to be based on physical laws: "Signal-to-noise ratio (often abbreviated SNR or S/N) is a measure used in science and engineering that compares the level of a desired signal to the level of background noise. It is defined as the ratio of signal power to the noise power, often expressed in decibels. A ratio higher than 1:1 (greater than 0 dB) indicates more signal than noise. While SNR is commonly quoted for electrical signals, it can be applied to any form of signal (such as isotope levels in an ice core or biochemical signaling between cells." If you cannot accept this simple FACT, you are leaving the realm of natural sciences. I have the impression that some people including Karplus and Diderichs try to cheat the laws of nature.
Now regarding your density. First of all I would like to know the density correlation for both cases. You could calculate it with SFCHECK. Secondly, what happened to the Glutamate from picture 1? Since I am a numerologist I also would like to know what I/sigma was used in the second picture and the respective completeness. Which contour, which program? The model in the left lower corner looks strange to me.
Another remark: Why do you refine at 3.2 Angstrom , but calculate the map at 3.1? My suggestion was whatsover was discussed later on to cut data after scaling at I/sigma at 2:1.
I think that the right question would be: why am I solving a structure? What I expect to found from high resolution data? Retuning to your questions it seems that you have a very low signal to noise rate in the high resolution data that should be reflected in a very high Rmerge or Rsym. So the best thing to do is to use your experimental data up to a value where (more or less) I/sigma > 1.5 and Rmerge < 0.75. If you use data at higher resolution the risk is that you will be trying to model your 3D structure with just noise and not with actual scattered intensities. In a word look very carefully at the statistics of your scaled data to decide what is real resolution of your data.
@ Mario
>>why am I solving a structure?
to gain insights in the molecular mechanisms of my system.
>>What I expect to found from high resolution data?
a better resolved structural model with lower degree of over fitting.
>> I/sigma > 1.5 and Rmerge < 0.75
>>In a word look very carefully at the statistics of your scaled data to decide what is real resolution of your data.
I have very good reasons to include shells below 0.3 Imean/sd and Rmeas ~1000% in a refinement. Talking 6.0 to 4.0 A data here. I can provide tables and diffraction images if you want to.
Why shouldn't I do it?
I did it yesterday and improved my model beyond anything I though possible a few month ago.
@ Peter
I'd gladly answer any of your questions when you first, at last, answer one of mine:
>>Why should I, a-priori, exclude data below 1 Imean/sd in the refinement?
Dear Mario, Peter and all others who really need a rigid recommendation for how to do the job.
There is no a priori good measurement of error. As I stated at least two times in my previous posts. Why this so I also commented about before. So assuming that an individual researcher knows what this error should be is a bit arrogant. The statistical methods of its determination are unreliable. A short example. If you measure more times your sigma automatically goes down. However, if you systematically miss your reflections in your profiles or boxes you will be measuring the same trash in the beginning as in the end of the process despite having much better sigmas. The real trouble starts when on wants to get more data and lower the measurements errors but instead adds different conformational states of the protein (therefore different intensities) caused by internal changes (these can be radiation damage, heating, solvent rearrangement). I had several cases when crystals just exploded of internal gas buildup or were dying after 3 frames. There is way too much mythology about so called overfitting as well as about dangers of molecular replacement and so called biases. When the structure is right it clicks like a clockwork. I had this luck in my career that I have worked on protein X-ray data from 0.5Å to around 10Å. I have seen very lousy structures with 1.8Å and Rmerge 4% and outstanding structures at 4Å with Rmerge 17%. How do I know that? Because, of multiple subsequent structure determinations.
Two additional lessons that are long forgotten. The Luzzatti plots (even though used incorrectly as a measure of coordinate error, which are really not) teach us that low resolution has high and high resolution has high contribution to the general level of Rfactor. So cutting your data on both sides always (by definition) lowers your eventual R factor. Is you structure better this was? Obviously not. Cutoff of low resolution data distorts the envelope of the molecule and destroys the strength of solvent contribution. Only correctly modeled solvent brings the low resolution Luzzati down. Cutting high resolution lowers the molecular detail of your structure determination. I would estimate that around 40% of ligands in the PDB has at least some defects. I personally have seen (confirmed by rerefinement) large organic molecules fit to water networks as well additional atoms in ligand despite strong negative densities around them.
So in effect when one has a good model and fits it to lower resolution dataset the resulting Rfactor is very low (sometimes around 12%) not higher. Our attachment and correlations of resolution and Rfactor are just biased by history not by what we know now.
If we combine this relatively bleak image with the idea that a respectable map can be produced from all intensities equal to the same numerical value provided good phases and that almost uninterpretable map can be obtained with intensities calculated form the model providing even 35 degree phase error than this discussion appears to be completely empty.
Come to the drawing boards and produce as good structures as we can from as good data as we can. Using strict numerical guide lines is good for graduate school not for serious science. Greg Petsko can tell you something about that when he published a paper in Nature with approximately 50% data completeness but the science in it was beautiful.
@Stefan
>I have very good reasons to include shells below 0.3 Imean/sd and Rmeas ~1000% in a refinement. Talking 6.0 to 4.0 A data here. I can provide tables and diffraction images if you want to. 1.5 and Rmerge < 0.75) that must be evaluated in real life cases. Anyway I think that with Imean/sd 0.3 and Rmeas ~1000% you are using just noise and therefore why do not include I/sd = 0.001 Rmeas = 10^7? In this way every crystals would diffract at 0.1 A resolution or better and the only resolution limit would be in the wavelength of X-rays used: this approach would neglect the reality of X-rays scattered (or not) by your sample (or in better words their interaction with the detector). Anyway you must remember that the X-ray structure is a static ensemble of protein structures that are moving in cell, therefore if the final 3D model tell you that the distance between two atoms is correct at 1/100 of A you must take in mind that the two atoms are moving and thus such precision do not have much sense. What I am trying to tell is that 3D models can be used to do wonderful science even when you have low resolution data: the important thing is the interpretation of the data coming from different techniques at the light of the scientific question you are trying to address: the most important thing is the question.
@ Mario
I understand your concerns. Please understand mine.
As pointed out several times before, a meaningful a priori resolution cut-off can not be made based on merging statistics or I/sigma estimates. For this reason, Karplus and Diederichs proposed the use of the CC(1/2) criterion which is supposedly more sensitive to the information content of the diffraction pattern than any other criterion. The final decision on the true resolution is then made during the refinement.
Look at this diffraction pattern and the scaling table. I truncated it already to more meaningful boundaries - as determined by refinement
Where would you make the cut and why?
FYI: I used the entire range for building and probably make a final refinement cut around 4.0A
best regards,
S.
SUBSET OF INTENSITY DATA WITH SIGNAL/NOISE >= -3.0 AS FUNCTION OF RESOLUTION
RESOLUTION NUMBER OF REFLECTIONS COMPLETENESS R-FACTOR R-FACTOR COMPARED I/SIGMA R-meas CC(1/2) Anomal SigAno Nano
LIMIT OBSERVED UNIQUE POSSIBLE OF DATA observed expected Corr
11.37 23059 3676 3729 98.6% 6.6% 6.4% 23059 24.87 7.2% 99.6* 5 0.814 1534
8.15 40776 6347 6347 100.0% 10.4% 9.2% 40776 19.70 11.3% 99.6* -8 0.779 2876
6.69 53605 8190 8190 100.0% 47.8% 46.0% 53605 7.42 51.9% 95.9* -3 0.815 3791
5.80 63095 9619 9623 100.0% 105.2% 104.9% 63095 2.60 114.5% 87.1* -3 0.730 4505
5.20 71710 10907 10909 100.0% 132.0% 134.6% 71710 1.82 143.6% 89.2* -4 0.663 5154
4.75 79095 12036 12038 100.0% 120.7% 123.3% 79095 1.90 131.4% 90.4* -1 0.672 5717
4.40 86050 13094 13097 100.0% 161.4% 167.1% 86050 1.36 175.6% 89.2* 1 0.635 6246
4.12 92205 14068 14071 100.0% 278.6% 291.9% 92205 0.79 302.9% 77.2* 0 0.589 6732
3.88 69364 14344 14920 96.1% 675.2% 715.0% 69097 0.29 758.0% 29.6* 0 0.546 6726
total 578959 92281 92924 99.3% 45.7% 46.0% 578692 4.10 49.9% 98.9* -1 0.664 43281
NUMBER OF REFLECTIONS IN SELECTED SUBSET OF IMAGES 583251
NUMBER OF REJECTED MISFITS 4121
NUMBER OF SYSTEMATIC ABSENT REFLECTIONS 0
NUMBER OF ACCEPTED OBSERVATIONS 579130
NUMBER OF UNIQUE ACCEPTED REFLECTIONS 92310
Wow, this thread is a real blast!
The question of information content when I/sig < 1 can be illuminated by considering that some strong spots may still be visible, but get averaged out in the I/sig calculation. Conversely, the weaker spots rule out high intensity values.
But I think this discussion died out ~4 years ago!
Jacob