I am training a neural network, but the feature vectors do not have the same size. This problem may be fixed by adding some zeros or removing some values, but the greater problem would be data loss or generating meaningless data.
So, is there any approach to make them equal size, without mentioned weaknesses? Maybe transformation to other dimensions?
I do not want to use random values or "NA".
@Mohamad Ali Torkamani
Using a histogram as the new feature vector is the same as using a "bag-of-words", or in this case a "bag-of-opcodes" approach. This has been suggested my myself and others. On the face of it, this will solve the variable-length problem.
However, there is a big problem with the fact that a sequence of opcodes may not represent the potential sequence of execution if some of the opcodes specify a branch or redirection to a position in the code that is out of sequence. It gets worse if some of the opcodes denote conditional branches, as this will mean that the same sequence of opcodes can result in difference sequences of execution.
Keep in mind what each component of the feature vector represents. Adding a zero generally represents "no information available" hence will not prejudice the outcome one way or the other. Depending on your implementation, it may "water down" the outcome because "no information available" training vector elements add no useful information while increasing the uncertainty of the outcome.
The question you need to ask is: "Do the feature vector components add useful information". If certain components are redundant or represents "noise", they can be probably eliminated.
Thanks Joachim & George for your answers, but:
The feature vectors, here, are not a standard feature vector. I called them FV because they are kind of characteristics of a given file.
For example, consider the signal of different persons' voices, with different length.
Each of them can be represented by different number of bytes.
With this new look, how do you see the problem? How can i make them prepared for a good training?
In classification, imputation is usually applied for missing values. There are various methods for imputation, such as the mean/median value across observations or a nearest neighbor imputation using the feature space (of remaining features).
Although i am looking for a transformation, such as PCA but in different direction (to increase FV length), but i think adding some values (everything except zero) would be better rather than considering zero for them.
Ali, for audio files of different length, you usually are faced with some time-related features (a single number per frame), therefore you need to compress the variable number of frame-based features into compact and fixed size representation. So statistical moments are very welcome here (mean, standard deviation, mode, quartiles, min, max, skewness, kurtosis), also autocorrelation coefficients could help you to obtain useful information from variable length signals. Another solution could be obtaining fixed size description of distribution - either through histogram (predefined number of bins) or GMM (predefined number of centers). If you don't want loose some time-related patterns, then HMM should be considered instead of GMM.
For example, in page 21 of a thesis here [ http://www.ifs.tuwien.ac.at/mir/pub/Pflugfelder_DiscriminationAnalysis_Thesis.pdf ], 24 Bark coefficients, which are frame-based features, are compressed by proposed SSD (Figure 2.1 - c and d), therefore your feature vector for each wav would contain 24*7=168 elements. Please remember, that element in feature vector for neural networks is very important - it can not mean one thing for one wav, another thing for another wav, just because these wav files are different in length.
Thank you Evaldas for your very helpful answer.
Actually, the time does not matter here at all. There are some 1-dimension different size of series of numbers (between 0 to 255) which are to be related to some classes. of course they should be normalized and the rang will be changed.
I know each node in NN must be meaningful. But, i hope that NN learn the training sets even by adapting weights according to theses numbers. However, it may not be consider as a learning problem, but the test set are totally similar to the training set (According to the problem) and i hope NN can find proper classes using the justified weights.
Regarding to your answer, i will consider the introduced features to use them alongside statistical approaches to make feature vectors equal size, although i do not know if they can be added to this kind of problem or not yet.
Are you sure neural networks are the best option? You might want to use frequent set analysis algorithms, for instance, or apply, as said above some kind of preprocessing so that eventually vectors of the same kind are used.
How about using cascaded SVMs. This will give you a higher degree of flexibility for a variable FV.
you could use recurrent nets (e.g., Simple Recurrent Networks), but they are not so successful on long series (there are some methods to overcome this). Speech is usually preprocessed in frames (e.g., 25ms) which you code on some frequency scale (Mel spectrum or so). Direct acoustic processing by NNets is a waste of resources, unless you use the network to preprocess the data. Then, I dont's see a problem with zero-padding your speech segments if you want to process the data with feed-forward nets.
I'm not aware whether people in this domain do length normalization, which is very useful in other domains (e.g., kinematics). So, you could interpolate your data to a certain length. The obvious problem there is that by doing so you change the frequency characteristics of the speech elements, so this could only work on small length adjustments. Moreoever, in order this to work, you need a good number of training instances with different lengths, so that the (NN) classifier learns to grasp the invariances you need. This approach should work in simple speech recognition with a limited number of classes, i.e., words.
please see our papers for binary neural networks:
1), Universal Perceptron and DNA-Like Learning Algorithm for Binary Neural Networks:LSBF and PBF Implementations, IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 10, OCTOBER 2009,1645-1658;
2), Universal Perceptron and DNA-Like Learning Algorithm for Binary Neural Networks:Non-LSBF Implementation, IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 8, AUGUST 2009,1293-1301 .
Salam,
You choose the maximum size of the training vectors as the size of all vectors. You complete the missing values by interpolation values. interpolation values are calculated as average values between two neighbors values.
p.s. should you want to time-scale your data and do speech perception, you might first do frequency transformation and only then do time-scaling. Thus, you preserve the frequency domain.
You said "the time does not matter here at all", so I'm confused about the nature of the variable length vectors. If it really is the case that the vector length is nothing to do with time, I'd be tempted to try putting the length as one input and the mean value as another...(and/or the median, and/or the standard deviation ... , depending on just what you're looking at) ... again, it's going to depend on the volume of training data (if you don't have a lot of long vectors, then some of your inputs will have their weights underspecified). Otherwise, I'd go with Evaldas's comment!
Normalize the data before analysis.
You are worried about retaining the meaning if the data, so you must use a method of normalization that applies to your type of data. I am thinking in particular of four possible types of scalar data: nominal, ordinal, interval, ratio.
The idea of normalization assumes that it is okay to recode your feature values using a different scale or unit of measurement. It may also assume that it is okay to shift the scale (for instance, to achieve a mean of zero for the transformed feature values).
If you want to transform to other dimensions, you need to be careful not to lose information. Going to fewer dimensions can mean a loss of information, but this can be beneficial if the lost information can be attributed to noise or error in the data.
Why do some of your feature vectors have a different size ? Are they damaged ?
Every feature vector must be the same size and the feature definitions must be the same, pattern by pattern.
Normally one is comparing apples with apples instead of apples and oranges. But if you have a problem of missing or unequal features in the feature vectors that you are trainining with and you want to use all of the features x1,...,xN even though some vectors have only x1,...,xM (where M < N), then I would set each feature vector x1,...,xM to x1,...,xN by filling in the missing values in x1,...,xM with the value 0.5. If the features that are NOT missing distinguish the classes that are to be learned from the training, then the value 0.5 will provide the smallest expected error overall. Cheers. Have fun! -- Carl
OMG, i just wanted to give you a tangible example by telling different person's voices. I am sorry, but the problem is not neither voice or speech classification problem at all. It is not even related to them.
As i mentioned previously, there are some series of numbers with different length. I want to map them into two classes, for example TRUE or FALSE.
The reason is chose NN as a tool is that they are flexible and are able to learn this complicated and strange problem.
but after implementation, i found out the different size of the incoming vectors fixed by adding some zeros, harmed the performance and made learning hard to the NN.
So, i am looking for the best approach to make vectors' size equal.
The normalization must be done, but after making them size equal.
@Michael Manry, I know it. I just said that may be they can not be considered as FV. but the problem is what i described.
@Christopher Ward, The normalization is not a serious problem. i have to make them equal size. may be normalization in horizontal direction would be helpful.
@Carl may be the best way is what you mentioned. furthermore, i want to extract some feature from this numbers to make a meaningful feature vectors.
@Ridha Ejbali, The interpolation is not a good idea here. because there a lot of missing values.
@Stefano Rovetta, You are totally right about not working on data directly.
I have two ideas for solution of your problem:
a) Bipropagation
b) Border Pairs Method
With this two methods you can get more dimensional feature vector wich is also beter separable then predecesor. Both methods a) annd b) are used in MLP. His first layer transform input vector in higher dimensional feature vectors (if nesesery) wich are noise resistant and good separable in case of clasification.
In the case of this type of vectors can be used two techniques:
1. A set of vectors, the vectors are taken as much information, and thereby build a Kohonen map, so that when a vector that lacks an entry is completed with this information to the cluster which vector belongs with respect to the components having information.
2. A second option, the algorithms associated with estimating the distribution, thus a complete set of vectors, calculated for each component are the mean and standard deviation, so that the complete any feature vector alone is necessary to generate a random number on said distribution.
In both cases it is ensured that the component is missing, information having the pattern set or training feature vectors. According to this, it is better to opt for radial base neural network.
Is it that each sequence is in its entirety either an instance of the target class or else is not a member of the class?
If the answer to this question is yes, then I would need more information as to the nature of these sequences before suggesting an answer.
In the meantime, you may wish to peruse:
Arvind S. Mohais, Rosemarie Mohais, Christopher Ward, Christian Posthoff: Earthquake classifying neural networks trained with random dynamic neighborhood PSOs. GECCO 2007: 110-117
@Bojan Thank you, but:
How can i feed the first layer while there are some different length feature vectors? Can i use some approaches told in previous comments, for example adding 0.5 for any missed value?
@Alejandro Peña Thank you but:
Your approach may be helpful if the series follow a particular pattern or distribution. But as i mentioned before they are totally irrelevant to each other. So, collecting information from other tuples can not be meaningful.
@Christopher Yes every single tuple belongs to one of the True or Flase classes. The nature of them is something like machine language source code of many files, for example in X86 architecture. Each file is represented by some numbers (OpCode) and has its own numbers of line.
You can learn NN firstly with long feature vectors (FV) only. Then you can put short FV on input with missing values equal to 0.5. After that you can correct missing values of FV until output value not correct. Finaly you can search for rules in this corrections. If there are any rules then you can use it for transformation of the short FV-s into long FV-s.
Yes you can use 0 or 0.5 for mising values
answer for:
@Bojan Thank you, but:
How can i feed the first layer while there are some different length feature vectors? Can i use some approaches told in previous comments, for example adding 0.5 for any missed value?
You can construct a polynomial P(n) in the variable n, such that P(n) equals the nth number in the sequence. The polynomial coefficients can equal the new input features.
Alternately, in the original inputs, I'd use zero-valued inputs after the sequence ends.
If the sequences are like machine language code files, then you may be able to use a bag-of-words approach to vectorising each sequence. In essence, what you do is count for each sequence the frequency of occurrence of each opcode in the language.
Since the dictionary of opcodes is the same for each sequence, the frequency vector (one count per opcode) will have the same dimensionality for each sequence.
Smaller sequences may yield vectors with shorter norms/lengths (to use a geometrical interpretation) than larger sequences, but may be characterised by similar directions (to continue the geometrical interpretation).
An extension of this is to count frequency of occurrence of subsequences of opcodes - which yields vectors of larger dimensionality.
P.S. I realise that your data may not be literally about opcodes...
Need to look a bit more at the data. Are 99% of the features present in all vectors and only 1% of the features present in just a few vectors? Then it may be better to reduce dimensions by eliminating the troublesome 1% of the features. On the other hand, if 99% of your features are present in just a few vectors then neural networks are not the right tools; this case is like natural language where most words occur rarely and each sentence is of different length. The way your features are distributed in your vector population will determine the right tool and its configuration.
@Christopher Ward,
No, i am working on OpCodes exactly!
I attached a short version of my data including some files' OpCode filled by Zero to make their size equal. They are all belong to class True, but the same scenario would be here for class False.
@Stefano Rovetta Thank you but the problem here is something more than just a feature vector. Actually the problem is completely different from a usual learning problem. look at to the enclosed file in above comment. this values should be mapped to a particular class. They are completely irrelevant to each other.
Och, they're 8 bit op codes. Presumably from the execution of some program on a processor with 8 bit op codes. Now, depending on the CPU decode architecture,there may be a relationship between the bits set and specific aspects of the op code (e.g. memory fetch, memory store, register operation, etc.) or not. Is this what you are trying to find out? But given that they are 8 bit op codes, you need all 8 bits as input. Code each as 0/1 or -1/1 (unlikely to make any difference). Large number of 0's, I note: is 0 the op code for no-op?
@Leslie Smith, As i mentioned these codes should be mapped into just two classes True or False. the zeros at the end of each vector are added to make its size equal to others' size.
I considered each byte separately to increase the number of nodes in my NN to give it more generalization power.
may be replacing zeros by 0.5 be satisfying, but i do not know if there is any better solution or not?
@Michael Manry: Could you please give me more description about this approach?
@Stefano, if i got you clearly i have to say that you may be right if vectors in different classes were not similar to each other, but consider they are just some files and hence, OpCodes!
there may be some vectors in two classes which are very similar.
If the meaning is in the executional semantics of the opcodes then you may have to preprocess your data through a suitable parser as suggested by Stefano. Your goal in so doing would be to make the executional semantics explicit in the sequence of opcodes (i.e. no redirection of any sort)... this will not solve your problem of variable length sequences though.
For instance, you may have to unroll loops to create a representative sequence of execution in which the body of the loop is repeated. Also function calls may have the same opcode, but very different effects on execution depending on what function is called, so you may have to replace function calls with inline copies of the function code.
@Christopher, You are totally right. may be working on data directly is not a good idea. i am looking for any feature extractable from a file's opcode.
You can formalize your problem as a Multi-Instance Learning (MIL) problem. In simplest case you can use a bag of words representation (as in most text processing problems) or bag of SIFTs in computer vision.
There are also some work, that instead of generating histograms, parametrically fits a probability distribution to each sample and then uses statistical distance metrics (such as KL-distance etc.) in order to check how far two points are from eachother.
What kind of data are you using? doing some kind of data encoding or transformation is often necessary for NNs since they can not handle strings, variable length verctors,... Using different kinds of encoding has direct impact on the accuracy of NNs. You can always come up with encodings that are customised for your problem. If you provide me with more info about the problem that you are trying to solve and type of
data I might be able to help.
You can consider hashing your input into an integer and the applying a modulo operation so that it lies within a certain range. Then set that index of the input vector to one. This can be repeated for each component of your input, e.g. if there are multiple op codes within a given training example you would set multiple elements of the vector to one. This is a common approach with categorical data. The book Mahout in Action gives a good overview of this.
@Nona Kermani, A sample of data was attached to one of the answer. Let me do it again. you can find it enclosed. these files belong to class True. The same history goes on for False class. Different length and irrelevant data in each class :-s
@Mohamad Ali Torkamani, take a look to the attached data in above comment. do you have any idea if they can be used in MIL approach or not?
Definitely adding zeros is not a good idea, where does the data come from? what does the repeated values mean (like two 255s in the last sample) ?
A good representation (which is the simplest one btw) is to form the histogram of features, and then use the histogram as your new feature vector
@Mohamad Ali Torkamani
Using a histogram as the new feature vector is the same as using a "bag-of-words", or in this case a "bag-of-opcodes" approach. This has been suggested my myself and others. On the face of it, this will solve the variable-length problem.
However, there is a big problem with the fact that a sequence of opcodes may not represent the potential sequence of execution if some of the opcodes specify a branch or redirection to a position in the code that is out of sequence. It gets worse if some of the opcodes denote conditional branches, as this will mean that the same sequence of opcodes can result in difference sequences of execution.
@Mohamad Ali & @Christopher,
All values even zero and even repetitive 255s (and everything) are from opcodes and meaningful except Zeros at the end of each line. Those Zeros at the end of lines have been added to make opcodes' size equal.
@Christopher, Thanks for your exact explanations
Another option, instead of using direct vectors, is to use the distance (which you can probably define meaningfully) from each vector to the rest. You'll have as many dimensions as vectors, but at least you'll have proper vectors that can be handled easily by kernel methods, clustering and whatever.
Not a bad idea. Ali could also
(1) Use zero-fill to make the sequences the same length, N,
(2) Find the N by N autocovariance matrix,
(3) Take its SVD and get the KLT transform matrix U^T .
(4) Use the KLT of each sequence to produce N1 inputs, where N1 is a fixed
integer < N.
Fill the missing features by zero, subsequent to that transfer the data to sub-space (PCA) and consider the principal components corresponding to larger eigenvalues. That should solve your problem.
Cluster the data in to a standard number of clusters. you can use k-means algorithm or any other vq based algorithm. I think this will compress as well as standardize your data into a certain standard number of clusters
Hi Ali Fakhari,
Have you tried to use biclustering to perform imputation of missing values ?
By considering multi-faceted correlations between rows and columns of a dataset, biclustering technique allows deeper inferences of missing values from the available data.
For more details about biclustering, you can refer to [1].
[1] F.O. de França, G.P. Coelho, F.J. Von Zuben, Predicting missing values with biclustering: A coherence-based approach, Pattern Recognition, Volume 46, Issue 5, May 2013, Pages 1255-1266.
Regards,
Salomão Madeiro
first of all, why the vectors do not have the same size? becouse of missing data? if so, in any case (no matter what method do you use you are "predicting" values, and you will never guess if it is right or not, so it is a hard problem to solve. My suggestin is you play a bit with a smaller data set with out missing values and try to mimic the misiing behaviour and see what happends. Another solution is to use SVM, in this csae (using a correlation like kernel, missing data is not used.
the simplest method is to use linear interpolation to resize the feature dimension. but it depends on what type of feature you are training using NN.
One possible approach can be to assign random vectors to features as used in random projection. The vector size in this case is fixed. This approach is based on Johnson- Lindenstrauss lemma:
http://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma
All of these solutions seem to revolve around adjusting the data to the size of the neural network however an equally if not better solution would be to make the network resizeable in the sense that it would resize to the length of the feature vector if the vector falls with in a std deviation of reason
An R package is available for performing Random Projection using Johnson - Lindenstrauss Lemma
https://cran.r-project.org/package=RandPro
Feature vectors depend on the application, they refer the pair input and output for supervized learning