At which stage do I add variability when performing multiple imputation?

I think you have a bigger question here. You probably should not be aggregating the data at all. What you probably have here is a multilevel modeling (also called hierarchical linear modeling or HLM) problem. Therefore you won't be collapsing the data so the imputations would stay at the levels they are at now. I don't know enough about your data to know exactly what the multilevel structure would be. The top level, which I guess would be census tracts needs to have a minimum number of units, roughly 30, but it can go lower depending on what your model looks like. Otherwise, you would go down a level and use dummy variables (k-1 for K census tracts) to represent that level, called fixed effects modeling. But you would be able to maintain individual children per parent even though there are few children for each parent (a stretch of my career was dyadic modeling, just two persons in a married couple). So if you are not planning to do MLM, you need to get up to speed. As far as something like census tracts, you could look at work many of us did on the Project on Human Development in Chicago Neighborhoods (PHDCN) to see the application of HLM to the data. I have a few of those papers up on ResearchGate, but there are many of them, some have the name of the project in the title, though many don't. Robert Sampson, Felton Earls, Stephen Raudenbush, and Jeanne Brooks-Gunne are frequently among the authors. So, as to the missing data. Until somewhat recently, we tended to impute the data sets separately, using something like SAS PROC MI, although I like IVEware (free from University of Michichan's Survey Research Center). IVEware runs within SAS or in its own front end, however, I just learned it also runs within MicrOsiris, a free stats package also from the Survey Research Center written by Neal Van Eck, who is on ResearchGate. However, it turns out that because of the dependencies in the data, it should be imputed based on the relationships within the multilevel data. I believe there is some free software out there, but I don't know the names or the web sites, so if you are inclined to do that you will need to Google that and track it down on your own. As far as I understand, MLwiN, a British multilevel modeling package has the capability within it. HLM, the leading American one is more complicated. Persons who attended a training workshop have a version that does the multilevel imputations, but the regular version for sale doesn't have it, basically because of a lack of documentation. So my answer to the question of before or after aggregation is "no." Bob

Rick Massatti

Thank you for taking the time to answer my questions. While I am not that familiar with multilevel modeling, I think I understand where you are going. My limited understanding of this procedure makes me wonder whether you can account for spatial weights in a multi-level model. I read your 2003 article “a multi-level study of neighborhood…” and it looks like you overcame this issue through nesting individuals within neighborhoods for the model and analyzing data at the individual level. My hesitation is twofold: analyzing the data at the individual level and using a statistic that doesn’t use spatial weighting. For example, airborne neurotoxicant solvents account for 11+ percent of the variance in mental health outcomes in an aspatial OLS regression model (individual level), but only 3 percent in a spatial regression model that is spatially weighted (census tract level). Part of me wonders whether a larger value present in the aspatial model would carry over to the HLM model simply because it doesn’t have a spatial weight. Or perhaps, I am losing too much individual variation by not looking at individual level data. I don’t know… Either way, I think I am stuck aggregating the individuals to the tract level based on my HIPPA waiver request to aggregate individual level data to the census tract.

Robert Thomas Brennan

Hi Rick, only some time later, did I see your other posting on the spatial weighting. Another PHDCN (Project on Human Development in Chicago Neighborhoods) researcher, Jeff Morenoff, did a multilevel spatial model, which would really be the way for you to go with your data. You should be able to find his article, but if you can't, I can track it down for you, just let me know. Also, the HLM software, version 7 has a spatial component (I don't know if it's main competitor MLwiN does or not). I have used the spatial model in a (unpublished) study of day care pricing in Massachusetts using ZIP codes. The entry is tedious, however, because each ZIP code has to be on a line of data with all the adjacent ZIP codes--you can't just input a map or map coordinates. Once that was done, it worked very well. Jeff Morenoff's approach (and I haven't read his paper in some time) combined models using different pieces of software. I don't remember what University he is at now, but you might be able to track him down for some advice on doing a multilevel spatial model. I am sure the options have improved a lot since his paper. I wouldn't know much about the restrictions of the HIPPA waiver, but you wouldn't be reporting any individual results. Usually the informed consent agreements we have our subjects sign pertain to the reporting of the results not the handling of the data. Back to the original question, if you are aggregating, it seems it would only make sense to do the multiple imputations after aggregation. Assuming you do a reasonable number of imputations and you average them, the one value you get is going to be basically the singly imputed value (let's say from EM or regression equations), because the added errors should average out to zero, so you are going to end up with a value in the upper level data set that looks like a reliable value. Aggregate everything up first and then impute. Unfortunately, aggregating precludes being able to do the correct (for example the "joint model') method for imputing multilevel data. Best, Bob

Rick Massatti

Thanks again. I found Jeff Morenoff at the University of Michigan, so I'll email him about the article. That's too bad you had to and enter all of the contiguity-based relationships. It sounds like the program you used only took *.gwt files and not binary contiguity *.gal files, which would have been much easier. If it does take *.gal files, then you can use freeware like GeoDa to easily develop the relationships (assuming you have already created a shapefile).

Robert Thomas Brennan

Rick, no, it's just the way HLM handles it. It's designed only to take the input as a list of clusters, let's say census tracts, with all the bordering clusters listed on the same line. It I had had a file like that (for Massachusetts ZIP codes), I could have input it, but who has I file like that? I hope I described the paper well enough that Jeff will be able to send you the right one. Even if you don't go that route, it's helpful because it fully details his method for combining HLM and spatial stats, which is worth knowing about (not that I remember) . Bob

Outdoor emergence-trap insect netting: what type to use and where to buy?

Has there been a 40% decline in phytoplankton?

Does GR require dark matter to be present to hold a galaxy together? Would First Motion be an improvement?

Can Einstein's GR be replaced by a mechanical model?

Can RIPA or SDS lysisbuffer be used for plant cells?

What are examples of protein dilution buffers for a bradford protein assay?

Does Propylene-glycol influence manipulativity of specimen for museum collections?

Measurement of volatile contamination in DEHP?

Why an increasing Mass Spectrometer sensitivity?

What causes this warning in Silvaco TCAD?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?