The large-scale international assessments, including PISA, have a two-stage sampling design . Therefore we have to analyze evaluation of within-and between-school variation of PISA database.
There are two key issues - one is the two level structure you note, the second is plausible values, HLM 8 ( https://ssicentral.com/index.php/products/hlm-general/ ) handles both
https://www.hearne.software/getattachment/a104646a-d78e-4ccd-88cf-cf04a78b47f2/HLM8-Manual.aspx see 11.2.3 Working with plausible values in HLM
" Without HLM, these procedures could be performed by producing HLM estimates for each plausible value, and then averaging the estimates and calculating the standard errors using another computer program. These procedures are tedious and time-consuming, especially when performed on many models, grades, and dependent variables. HLM takes the plausible values into account in generating the HLM estimates. For each HLM model, the program runs each of the five (or the number specified) plausible values internally, and produces their average value and the correct standard errors. There will seem to be one estimate, but the five HLM estimates from the five plausible values are produced and their average and measurement error calculated correctly, thus ensuring an accurate treatment of plausible value data. The output is similar to the standard HLM program output, except that all the components are averaged over estimates derived from the five plausible values. In addition, the output from the five plausible value runs is available in a separate output file. "
On plausible values more generally https://www.rasch.org/rmt/rmt182c.htm