How to make feature vectors size equal for training neural networks?

11 November 2012 66 10K Report

I am training a neural network, but the feature vectors do not have the same size. This problem may be fixed by adding some zeros or removing some values, but the greater problem would be data loss or generating meaningless data.

So, is there any approach to make them equal size, without mentioned weaknesses? Maybe transformation to other dimensions?

I do not want to use random values or "NA".

Christopher Ward Popular answer

@Mohamad Ali Torkamani

Using a histogram as the new feature vector is the same as using a "bag-of-words", or in this case a "bag-of-opcodes" approach. This has been suggested my myself and others. On the face of it, this will solve the variable-length problem.

However, there is a big problem with the fact that a sequence of opcodes may not represent the potential sequence of execution if some of the opcodes specify a branch or redirection to a position in the code that is out of sequence. It gets worse if some of the opcodes denote conditional branches, as this will mean that the same sequence of opcodes can result in difference sequences of execution.

George Slade

Keep in mind what each component of the feature vector represents. Adding a zero generally represents "no information available" hence will not prejudice the outcome one way or the other. Depending on your implementation, it may "water down" the outcome because "no information available" training vector elements add no useful information while increasing the uncertainty of the outcome.

The question you need to ask is: "Do the feature vector components add useful information". If certain components are redundant or represents "noise", they can be probably eliminated.

Ali Fakhari

Thanks Joachim & George for your answers, but:

The feature vectors, here, are not a standard feature vector. I called them FV because they are kind of characteristics of a given file.

For example, consider the signal of different persons' voices, with different length.

Each of them can be represented by different number of bytes.

With this new look, how do you see the problem? How can i make them prepared for a good training?

Simon Fristed Eskildsen

In classification, imputation is usually applied for missing values. There are various methods for imputation, such as the mean/median value across observations or a nearest neighbor imputation using the feature space (of remaining features).

Ali Fakhari

Although i am looking for a transformation, such as PCA but in different direction (to increase FV length), but i think adding some values (everything except zero) would be better rather than considering zero for them.

Evaldas Vaiciukynas

Ali, for audio files of different length, you usually are faced with some time-related features (a single number per frame), therefore you need to compress the variable number of frame-based features into compact and fixed size representation. So statistical moments are very welcome here (mean, standard deviation, mode, quartiles, min, max, skewness, kurtosis), also autocorrelation coefficients could help you to obtain useful information from variable length signals. Another solution could be obtaining fixed size description of distribution - either through histogram (predefined number of bins) or GMM (predefined number of centers). If you don't want loose some time-related patterns, then HMM should be considered instead of GMM.

For example, in page 21 of a thesis here [ http://www.ifs.tuwien.ac.at/mir/pub/Pflugfelder_DiscriminationAnalysis_Thesis.pdf ], 24 Bark coefficients, which are frame-based features, are compressed by proposed SSD (Figure 2.1 - c and d), therefore your feature vector for each wav would contain 24*7=168 elements. Please remember, that element in feature vector for neural networks is very important - it can not mean one thing for one wav, another thing for another wav, just because these wav files are different in length.

Ali Fakhari

Thank you Evaldas for your very helpful answer.

Actually, the time does not matter here at all. There are some 1-dimension different size of series of numbers (between 0 to 255) which are to be related to some classes. of course they should be normalized and the rang will be changed.

I know each node in NN must be meaningful. But, i hope that NN learn the training sets even by adapting weights according to theses numbers. However, it may not be consider as a learning problem, but the test set are totally similar to the training set (According to the problem) and i hope NN can find proper classes using the justified weights.

Regarding to your answer, i will consider the introduced features to use them alongside statistical approaches to make feature vectors equal size, although i do not know if they can be added to this kind of problem or not yet.

Juan Julián Merelo Guervós

Are you sure neural networks are the best option? You might want to use frequent set analysis algorithms, for instance, or apply, as said above some kind of preprocessing so that eventually vectors of the same kind are used.

Conrad Charles Spiteri

How about using cascaded SVMs. This will give you a higher degree of flexibility for a variable FV.

Ivilin Stoianov

you could use recurrent nets (e.g., Simple Recurrent Networks), but they are not so successful on long series (there are some methods to overcome this). Speech is usually preprocessed in frames (e.g., 25ms) which you code on some frequency scale (Mel spectrum or so). Direct acoustic processing by NNets is a waste of resources, unless you use the network to preprocess the data. Then, I dont's see a problem with zero-padding your speech segments if you want to process the data with feed-forward nets.

Ivilin Stoianov

I'm not aware whether people in this domain do length normalization, which is very useful in other domains (e.g., kinematics). So, you could interpolate your data to a certain length. The obvious problem there is that by doing so you change the frequency characteristics of the speech elements, so this could only work on small length adjustments. Moreoever, in order this to work, you need a good number of training instances with different lengths, so that the (NN) classifier learns to grasp the invariances you need. This approach should work in simple speech recognition with a limited number of classes, i.e., words.

Fangyue Chen

please see our papers for binary neural networks:

1), Universal Perceptron and DNA-Like Learning Algorithm for Binary Neural Networks:LSBF and PBF Implementations, IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 10, OCTOBER 2009,1645-1658;

2), Universal Perceptron and DNA-Like Learning Algorithm for Binary Neural Networks:Non-LSBF Implementation, IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 8, AUGUST 2009,1293-1301 .

Ridha Ejbali

Salam,

You choose the maximum size of the training vectors as the size of all vectors. You complete the missing values by interpolation values. interpolation values are calculated as average values between two neighbors values.

Ivilin Stoianov

p.s. should you want to time-scale your data and do speech perception, you might first do frequency transformation and only then do time-scaling. Thus, you preserve the frequency domain.

Leslie Samuel Smith

You said "the time does not matter here at all", so I'm confused about the nature of the variable length vectors. If it really is the case that the vector length is nothing to do with time, I'd be tempted to try putting the length as one input and the mean value as another...(and/or the median, and/or the standard deviation ... , depending on just what you're looking at) ... again, it's going to depend on the volume of training data (if you don't have a lot of long vectors, then some of your inputs will have their weights underspecified). Otherwise, I'd go with Evaldas's comment!

William Dickinson

Normalize the features before analysis.

Christopher Ward

Normalize the data before analysis.

You are worried about retaining the meaning if the data, so you must use a method of normalization that applies to your type of data. I am thinking in particular of four possible types of scalar data: nominal, ordinal, interval, ratio.

The idea of normalization assumes that it is okay to recode your feature values using a different scale or unit of measurement. It may also assume that it is okay to shift the scale (for instance, to achieve a mean of zero for the transformed feature values).

If you want to transform to other dimensions, you need to be careful not to lose information. Going to fewer dimensions can mean a loss of information, but this can be beneficial if the lost information can be attributed to noise or error in the data.

Michael Manry

Why do some of your feature vectors have a different size ? Are they damaged ?

Every feature vector must be the same size and the feature definitions must be the same, pattern by pattern.

Carl G. Looney

Normally one is comparing apples with apples instead of apples and oranges. But if you have a problem of missing or unequal features in the feature vectors that you are trainining with and you want to use all of the features x1,...,xN even though some vectors have only x1,...,xM (where M < N), then I would set each feature vector x1,...,xM to x1,...,xN by filling in the missing values in x1,...,xM with the value 0.5. If the features that are NOT missing distinguish the classes that are to be learned from the training, then the value 0.5 will provide the smallest expected error overall. Cheers. Have fun! -- Carl

Ali Fakhari

OMG, i just wanted to give you a tangible example by telling different person's voices. I am sorry, but the problem is not neither voice or speech classification problem at all. It is not even related to them.

As i mentioned previously, there are some series of numbers with different length. I want to map them into two classes, for example TRUE or FALSE.

The reason is chose NN as a tool is that they are flexible and are able to learn this complicated and strange problem.

but after implementation, i found out the different size of the incoming vectors fixed by adding some zeros, harmed the performance and made learning hard to the NN.

So, i am looking for the best approach to make vectors' size equal.

The normalization must be done, but after making them size equal.

Ali Fakhari

@Michael Manry, I know it. I just said that may be they can not be considered as FV. but the problem is what i described.

@Christopher Ward, The normalization is not a serious problem. i have to make them equal size. may be normalization in horizontal direction would be helpful.

Ali Fakhari

@Carl may be the best way is what you mentioned. furthermore, i want to extract some feature from this numbers to make a meaningful feature vectors.

Ali Fakhari

@Ridha Ejbali, The interpolation is not a good idea here. because there a lot of missing values.

Ali Fakhari

@Stefano Rovetta, You are totally right about not working on data directly.

Ali Fakhari

@Ivilin Stoianov, sorry for leading you into speech things.

Bojan Ploj

I have two ideas for solution of your problem:

a) Bipropagation

b) Border Pairs Method

With this two methods you can get more dimensional feature vector wich is also beter separable then predecesor. Both methods a) annd b) are used in MLP. His first layer transform input vector in higher dimensional feature vectors (if nesesery) wich are noise resistant and good separable in case of clasification.

Alejandro Peña

In the case of this type of vectors can be used two techniques:

1. A set of vectors, the vectors are taken as much information, and thereby build a Kohonen map, so that when a vector that lacks an entry is completed with this information to the cluster which vector belongs with respect to the components having information.

2. A second option, the algorithms associated with estimating the distribution, thus a complete set of vectors, calculated for each component are the mean and standard deviation, so that the complete any feature vector alone is necessary to generate a random number on said distribution.

In both cases it is ensured that the component is missing, information having the pattern set or training feature vectors. According to this, it is better to opt for radial base neural network.

Christopher Ward

Is it that each sequence is in its entirety either an instance of the target class or else is not a member of the class?

If the answer to this question is yes, then I would need more information as to the nature of these sequences before suggesting an answer.

In the meantime, you may wish to peruse:

Arvind S. Mohais, Rosemarie Mohais, Christopher Ward, Christian Posthoff: Earthquake classifying neural networks trained with random dynamic neighborhood PSOs. GECCO 2007: 110-117

Ali Fakhari

@Bojan Thank you, but:

How can i feed the first layer while there are some different length feature vectors? Can i use some approaches told in previous comments, for example adding 0.5 for any missed value?

Ali Fakhari

@Alejandro Peña Thank you but:

Your approach may be helpful if the series follow a particular pattern or distribution. But as i mentioned before they are totally irrelevant to each other. So, collecting information from other tuples can not be meaningful.

Ali Fakhari

@Christopher Yes every single tuple belongs to one of the True or Flase classes. The nature of them is something like machine language source code of many files, for example in X86 architecture. Each file is represented by some numbers (OpCode) and has its own numbers of line.

Bojan Ploj

You can learn NN firstly with long feature vectors (FV) only. Then you can put short FV on input with missing values equal to 0.5. After that you can correct missing values of FV until output value not correct. Finaly you can search for rules in this corrections. If there are any rules then you can use it for transformation of the short FV-s into long FV-s.

Bojan Ploj

Yes you can use 0 or 0.5 for mising values

answer for:

@Bojan Thank you, but:

How can i feed the first layer while there are some different length feature vectors? Can i use some approaches told in previous comments, for example adding 0.5 for any missed value?

Michael Manry

You can construct a polynomial P(n) in the variable n, such that P(n) equals the nth number in the sequence. The polynomial coefficients can equal the new input features.

Alternately, in the original inputs, I'd use zero-valued inputs after the sequence ends.

Christopher Ward

If the sequences are like machine language code files, then you may be able to use a bag-of-words approach to vectorising each sequence. In essence, what you do is count for each sequence the frequency of occurrence of each opcode in the language.

Since the dictionary of opcodes is the same for each sequence, the frequency vector (one count per opcode) will have the same dimensionality for each sequence.

Smaller sequences may yield vectors with shorter norms/lengths (to use a geometrical interpretation) than larger sequences, but may be characterised by similar directions (to continue the geometrical interpretation).

An extension of this is to count frequency of occurrence of subsequences of opcodes - which yields vectors of larger dimensionality.

P.S. I realise that your data may not be literally about opcodes...

Plutarco Naranjo

Need to look a bit more at the data. Are 99% of the features present in all vectors and only 1% of the features present in just a few vectors? Then it may be better to reduce dimensions by eliminating the troublesome 1% of the features. On the other hand, if 99% of your features are present in just a few vectors then neural networks are not the right tools; this case is like natural language where most words occur rarely and each sentence is of different length. The way your features are distributed in your vector population will determine the right tool and its configuration.

H N Suma

I faced the same problem. I then took Principal components to represent the vector matrix. The number of principal components to be considered was done heuristically. You can try this method if it suits your application. Good Luck.

Ali Fakhari

@Christopher Ward,

No, i am working on OpCodes exactly!

I attached a short version of my data including some files' OpCode filled by Zero to make their size equal. They are all belong to class True, but the same scenario would be here for class False.

Ali Fakhari

@Plutarco Naranjo Look at to the enclosed file in above comment.

Ali Fakhari

@Stefano Rovetta Thank you but the problem here is something more than just a feature vector. Actually the problem is completely different from a usual learning problem. look at to the enclosed file in above comment. this values should be mapped to a particular class. They are completely irrelevant to each other.

Leslie Samuel Smith

Och, they're 8 bit op codes. Presumably from the execution of some program on a processor with 8 bit op codes. Now, depending on the CPU decode architecture,there may be a relationship between the bits set and specific aspects of the op code (e.g. memory fetch, memory store, register operation, etc.) or not. Is this what you are trying to find out? But given that they are 8 bit op codes, you need all 8 bits as input. Code each as 0/1 or -1/1 (unlikely to make any difference). Large number of 0's, I note: is 0 the op code for no-op?

Ali Fakhari

@Leslie Smith, As i mentioned these codes should be mapped into just two classes True or False. the zeros at the end of each vector are added to make its size equal to others' size.

I considered each byte separately to increase the number of nodes in my NN to give it more generalization power.

may be replacing zeros by 0.5 be satisfying, but i do not know if there is any better solution or not?

Ali Fakhari

@Michael Manry: Could you please give me more description about this approach?

Ali Fakhari

@Stefano, if i got you clearly i have to say that you may be right if vectors in different classes were not similar to each other, but consider they are just some files and hence, OpCodes!

there may be some vectors in two classes which are very similar.

Christopher Ward

If the meaning is in the executional semantics of the opcodes then you may have to preprocess your data through a suitable parser as suggested by Stefano. Your goal in so doing would be to make the executional semantics explicit in the sequence of opcodes (i.e. no redirection of any sort)... this will not solve your problem of variable length sequences though.

For instance, you may have to unroll loops to create a representative sequence of execution in which the body of the loop is repeated. Also function calls may have the same opcode, but very different effects on execution depending on what function is called, so you may have to replace function calls with inline copies of the function code.

Ali Fakhari

@Christopher, You are totally right. may be working on data directly is not a good idea. i am looking for any feature extractable from a file's opcode.

Mohamad Ali Torkamani

You can formalize your problem as a Multi-Instance Learning (MIL) problem. In simplest case you can use a bag of words representation (as in most text processing problems) or bag of SIFTs in computer vision.

There are also some work, that instead of generating histograms, parametrically fits a probability distribution to each sample and then uses statistical distance metrics (such as KL-distance etc.) in order to check how far two points are from eachother.

Nona Kermani

What kind of data are you using? doing some kind of data encoding or transformation is often necessary for NNs since they can not handle strings, variable length verctors,... Using different kinds of encoding has direct impact on the accuracy of NNs. You can always come up with encodings that are customised for your problem. If you provide me with more info about the problem that you are trying to solve and type of

data I might be able to help.

Raphael Cendrillon

You can consider hashing your input into an integer and the applying a modulo operation so that it lies within a certain range. Then set that index of the input vector to one. This can be repeated for each component of your input, e.g. if there are multiple op codes within a given training example you would set multiple elements of the vector to one. This is a common approach with categorical data. The book Mahout in Action gives a good overview of this.

Ali Fakhari

@Nona Kermani, A sample of data was attached to one of the answer. Let me do it again. you can find it enclosed. these files belong to class True. The same history goes on for False class. Different length and irrelevant data in each class :-s

Ali Fakhari

@Mohamad Ali Torkamani, take a look to the attached data in above comment. do you have any idea if they can be used in MIL approach or not?

Mohamad Ali Torkamani

Definitely adding zeros is not a good idea, where does the data come from? what does the repeated values mean (like two 255s in the last sample) ?

A good representation (which is the simplest one btw) is to form the histogram of features, and then use the histogram as your new feature vector

Christopher Ward

@Mohamad Ali Torkamani

Christopher Ward

Ali Fakhari

Is 0 a valid opcode?

Ali Fakhari

@Mohamad Ali & @Christopher,

All values even zero and even repetitive 255s (and everything) are from opcodes and meaningful except Zeros at the end of each line. Those Zeros at the end of lines have been added to make opcodes' size equal.

@Christopher, Thanks for your exact explanations

Juan Julián Merelo Guervós

Another option, instead of using direct vectors, is to use the distance (which you can probably define meaningfully) from each vector to the rest. You'll have as many dimensions as vectors, but at least you'll have proper vectors that can be handled easily by kernel methods, clustering and whatever.

Michael Manry

Not a bad idea. Ali could also

(1) Use zero-fill to make the sequences the same length, N,

(2) Find the N by N autocovariance matrix,

(3) Take its SVD and get the KLT transform matrix U^T .

(4) Use the KLT of each sequence to produce N1 inputs, where N1 is a fixed

integer < N.

Dinesh Ramegowda

Fill the missing features by zero, subsequent to that transfer the data to sub-space (PCA) and consider the principal components corresponding to larger eigenvalues. That should solve your problem.

Praveen Naniyat

Cluster the data in to a standard number of clusters. you can use k-means algorithm or any other vq based algorithm. I think this will compress as well as standardize your data into a certain standard number of clusters

Salomão Madeiro

Hi Ali Fakhari,

Have you tried to use biclustering to perform imputation of missing values ?

By considering multi-faceted correlations between rows and columns of a dataset, biclustering technique allows deeper inferences of missing values from the available data.

For more details about biclustering, you can refer to [1].

[1] F.O. de França, G.P. Coelho, F.J. Von Zuben, Predicting missing values with biclustering: A coherence-based approach, Pattern Recognition, Volume 46, Issue 5, May 2013, Pages 1255-1266.

Regards,

Salomão Madeiro

Elmer Fernandez

first of all, why the vectors do not have the same size? becouse of missing data? if so, in any case (no matter what method do you use you are "predicting" values, and you will never guess if it is right or not, so it is a hard problem to solve. My suggestin is you play a bit with a smaller data set with out missing values and try to mimic the misiing behaviour and see what happends. Another solution is to use SVM, in this csae (using a correlation like kernel, missing data is not used.

Deepak Ghimire

the simplest method is to use linear interpolation to resize the feature dimension. but it depends on what type of feature you are training using NN.

Manish Mishra

One possible approach can be to assign random vectors to features as used in random projection. The vector size in this case is fixed. This approach is based on Johnson- Lindenstrauss lemma:

http://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma

Laurence Welch

All of these solutions seem to revolve around adjusting the data to the size of the neural network however an equally if not better solution would be to make the network resizeable in the sense that it would resize to the length of the feature vector if the vector falls with in a std deviation of reason

Siddharth Ravindran

An R package is available for performing Random Projection using Johnson - Lindenstrauss Lemma

https://cran.r-project.org/package=RandPro

Erik Cuevas

Feature vectors depend on the application, they refer the pair input and output for supervized learning

Badges
Science topic

More Ali Fakhari's questions See All

How does forget gate prevent LSTM network from vanishing gradient problem?

I am working on LSTM networks (RNN) and read all about LSTM and its different variations. My problem is about the forget gate and its dynamics. I know WHY we use those gates (e.g....

02 March 2017 4,022 2 View

How backpropagation works for learning filters in CNN?

I have a question about learning filters in Convolution Neural Networks, CNNs. Apparently, using fixed filters, like Gabor, is not common anymore and filters in CNN can be learned in each depth....

06 July 2016 9,520 8 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

Measuring the Intelligence of a Species?

Larger brains, which typically contain more neurons, store and transfer more information (Tehovnik and Chen 2015), but the precise relationship between number of neurons and information has yet to...

05 August 2024 1,238 2 View

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

I need the python code to forecast what crop production will be in the next decade considering climate and crop production variables as seen in the attached.csv file.

05 August 2024 2,977 3 View

The Curse of Evolution and Complexity?

Brain and body mass together are positively correlated with lifespan (Hofman 1993). The duration of neural development is one of the best predictors of brain size, and conception is the best...

05 August 2024 6,247 3 View

Could dyes amplify the spectrum of light to a specific wavelength?

I am interested to know the behavior of dyes toward light. Specifically, Blue dyes re-emit the spectrum, especially from the green zone (known as principal in LED lamps, and blue dyes are known...

05 August 2024 3,290 1 View

How to report results of Generalised Linear Mixed Models in a journal article?

Hi everyone, If you have written or come across any papers where Generalised Linear Mixed Models are used to examine intervention (e.g., in mental health) efficacy, could you please share the...

04 August 2024 4,130 4 View

Need help with my research project on open source SIEM and machine learning?

Hello everyone, I am currently working on a research project that aims to integrate machine learning techniques into an open source SIEM tool to automate the creation of security use cases from...

04 August 2024 3,196 2 View

Swimming/space travel depends on the proprioceptive muscle spindles?

When the entire neocortex is ablated in rodents, although they are still able to swim, all the limbs move continuously and asynchronously (Vanderwolf 2006; Vanderwolf et al. 1978). Normal animals...

03 August 2024 835 3 View

What are the limitations and challenges of using machine learning for predicting concrete compressive strength in practical applications?

Machine learning (ML) has shown great potential in predicting the compressive strength of concrete, an important property for structural engineering. However, its practical application comes with...

03 August 2024 2,546 2 View