17 April 2025 1 4K Report

Hi everyone,

I'm working with a panel dataset in Stata that includes variables such as type 2 diabetes cases, smoker density, and obesity prevalence across different regions and time periods. Some of these variables contain zero values, which represent actual observations (i.e., no reported cases) in certain areas.

As part of our model testing, we tried using log-level and log-log functional forms, but applying the natural logarithm to these variables resulted in missing values (.) due to ln(0) being undefined. This caused several issues during regression and especially with our Hausman test, where the note said:

“The rank of the differenced variance matrix (0) does not equal the number of coefficients being tested (3)..."

And also our r-squared is very low, only: 40-50. To address this, we are considering transforming our variables using ln(x + 1) instead. I understand this is a common workaround in many contexts, but I would like to ask:

  • Is ln(x + 1) an acceptable transformation in this case, particularly for disease prevalence and behavioral variables like smoking, where zero indicates no incidence?
  • Are there any published studies or datasets that use this method, especially in Stata or in health economics or epidemiology research?
  • Will this approach help preserve the integrity of the sample when running tests like the Hausman test or fixed/random effects models?
  • Any references, insights, or recommendations would be greatly appreciated!

    Thank you in advance.

    More Mika Ed's questions See All
    Similar questions and discussions