I am working on three different survey data; each has 20 binary variables for Hierarchical Cluster Analysis (HCA) on SPSS. Out of the 441, 2869 and 2543 total samples for each survey, only 42 (9.5%), 563 (19.6%) and 547 (21.5%) were valid in the HCA. The data consist, individual, behavioural and community characteristics recorded as 1 and 0. For the behavioural variables, 1 stand for a 'presence of risk' and 0 for an 'absence of risk'. Both the individual and community variables have similar codes (as used in Logistic Regression). Is it appropriate to replace missing values as an absence of risk (0), instead of the presence of risk (1) that may exaggerate the risk in the study? I am aware that the former may equally introduce similar error.
How do I manage the missing values and the potential errors?