The number of malware continues to increase dynamically and are very complex and sophisticated. Distributed Malware contributes to loss or privacy invasion, having negative impact on confidentiality, integrity and availability of private data.
I would download and test the WEKA toolkit (https://www.cs.waikato.ac.nz/ml/weka/). You can be also interested in the slides about Deep learning http://www.cs.waikato.ac.nz/ml/weka/slides/Chapter10.pptx
If you talk about contribution of Distributed Malware to loss of privacy invasion and so on in the context of ML, I would measure the detection performance using the cost-dependent performance metrics and map the malus onto the corresponding cost. You might be interested in my solution Thesis Opinion Mining and Lexical Affect Sensing
Recurrent Convolutional Neural Networks is an influential tool for solving various problems in the machine learning and computer vision fields. Traditional text classifiers often rely on many human-designed features, such as dictionaries, knowledge bases and special tree kernels. While in RCNN method the text classification will be done without human being introduced characteristics. It capture contextual information when learning word representation performed and can produce less noise compared to old traditional method. Through this method one can judges which words play key roles in text classification to capture the key components in texts. It produced result commonly in databases.
The implement it for Malware Classification Via using Count-based methods which compute how often some word co-occurs with its neighbors in a corpus and then they map these count-based statistics to a vector for each word.
&
Predictive methods which try to predict a word given its neighbors in terms of learned small, dense embedding vectors.
the standard performance metrics is Skip-Gram model.
The CNN without retrained word embedding achieved an improvement of 98,56% with respect to the equal probability benchmark while the CNN with retrained word embedding achieved an improvement of 98,33%. Both models are very close in terms of performance and no one can be declared as better than the other. May the fact of generating the vectors representations of words using samples of malware instead of using malware is the cause of this situation. Probably if the input of the used to learn the word embedding had been samples of malware the results would have been quite different. That’s because instructions such as XOR or nop are commonly used by malware for obfuscation purposes. On one hand the exclusive OR (XOR) operation is commonly used to obfuscate particular sensitive strings in the code such as URLs or registry keys.
your problem of the signature based is about the management of large database because the number of malware continues to increase dynamically, may that has a new signature, I think we will create a method to classify it, and fast retrieve malware of database, and since the size of database increase,, to solve classify database by using the concept of room based, we use this concept “room based” to manage the database. Each room based that has content Prohibition privileges of signature based on malware files, or pattern of collections of signature based of malware files.