I have been reading papers on face recognition but I am unsure of how the accuracy of the algorithm is determined. Do I simply input several known faces and see how many of them are matched ?
Firstly, I assume the task is to match a name (or other identifier) to a particular face, rather than to detect things that are faces compared to things that are not.
There are three ways you could measure accuracy in a face recognition task. The one that was most appropriate would depend to an extent on what the end goal was.
1 - How accurate is the algorithm at detecting one person from a data set containing many images of one person and many images of different people
2 - How accurate is the algorithm at learning a set of faces from training images and then correcly identifying the same people from a test set of different images, where both image sets contain the same people
3 - How accurate is the algorithm at detecting multiple people from a dataset containing images of these people and other people.
For case 1, you would want to train the algorithm on a set of images of one person's face and test on a set of images that contained different images of the target person as well as an equal number of other people. This would effectivly then be a binary classification task and you could use measures of precision and recall (along with associated F1 and accuracy) to evaluate accuracy. (see http://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers for a good overview of this area). This test could be repeated using different people to give more generalised results.
For case 2 you would want to train on multiple images of several people and then test on different images of the same people (a leave one out methodology might be useful here if your dataset is limited). This would be a multi-class classification problem. Confusion matrixes would be helpful in evaluating this sort of test.
For case 3, you would want to train the algorithm on a labelled training set of images of several people and then test on a set of images containing different images of the same people along with other images of faces (the proportions in the test mix would vary depending on the target use, if the idea is to recognise a few people from a crowd you would use a larger number of 'false' images). This could be formulated as a multi-class problem (where each person is a separate class, along with an 'other' class) or as a binary classification problem (person of interest vs others). You would need to be wary of using any measures of accuracy that were dominated by true negatives if you have an unbalanced test set though.
Hope the above is helpful.
Edited to clear up some points and expand on case 3.
I have so much to learn ! I have one last question to ask though. How many images of each individual should I include in my training database ? Does it depend on any conditions ? I have been searching through google for an answer but to no avail.
I downloaded the Yale database where there are 65 images of each individual. I am trying out the LDA and I am unsure of the number/type of images that should be included in the training database.
The answer to "how many images of each face do I need" depends on, as Asaim has said, if you want to do recognition from different angles. If you are doing frontal recognition only, then you could try training on a single image and testing on the rest. You could also try training on more than one image and testing on the remainder. You may find that your system performs better with more training data, but then again, you may not.
If you are wanting to do recognition from different angles you would want to train the system with these type of images and would want to construct a training set accordingly, with at least one image for each pose per person.
If you vary the number of images in the training set and test on the remainder (do multiple randomized runs too) then you will get a feel for how much training data your system needs. I would expect that a database of 65 images per person would contain more than enough for you to build good training and testing data sets.
To answer your first question, there are basically two types of metrics that measure face recognition performance (as we agreed, we are not speaking of second-level measures like failure-to-enroll). See http://en.wikipedia.org/wiki/Biometrics for a full range of metrics.
The first measure is the Recognition Rate, which is probably the most simple measure. It relies on a list of gallery images (usually one per identity) and a list of probe images of the same identities. For each probe image, the similarity to all gallery images is computed, and it is determined, if the gallery image with the highest similarity (or the lowest distance value) is from the same identity as the probe image. Finally, the Recognition Rate is the total number of correctly identified probe images, divided by the total number of probe images.
The second measure is the Verification Rate. It relies on a list of image pairs, where pair with the same and pairs with different identities are compared. Given the lists of similarities of both types, the Receiver Operating Characteristics can be computed, and finally the Verification Rate, see http://en.wikipedia.org/wiki/Receiver_operating_characteristic for details.
There are even more measures, for example the Half Total Error Rate and similar, which rely on independent development and evaluation sets, but I guess that this is a little bit too advanced for a start.
BTW: We provide default implementations for all metrics and plots through our open source project Bob: https://pypi.python.org/pypi/bob.measure
For recognition problems, it is also interesting to get the Cumulative Match Characteristic curve. Assuming that your classifier returns a vector of scores for an input face to test, this curve shows the position where is placed the true individual on the sorted vector. Sometimes, a system can return the first three positions of the score vectors in order to get robust results.
You can use a validation test. This kind of test is use in medicine however it also can be applied to any research. There are many scientific paper that use this procedure, so it is absolutely acceptable. In your case, you identify faces, not diseases. I wrote a research paper in Spanish about this topic. Here you are an example how it can be. if you need farther help, contact me..!!!
there are several metrics you can adopt to evaluate the performance of a Face idenification system. Some of them are: Genuine Acceptance Rate (GAR), Falce Acceptance Rate (FAR), False Rejection Rate (FRR), ....
However, you have to pay attention to which kind of biometric system are you testing. Some measures are more indicated for a verification system (i.e. Equal Error Rate), while some other are usually adopted for reconition systems (i.e. Recognition Rate). Moreover, you have also to take into account between the closed set and the open set context. In the first case, all the identities you are submitting to the system are enrolled into the system, while in the second case there could be persons who are not enrolled. It may be that the following paper can help you to make a clearer idea about some of the most usual situations:
Introduction to face recognition and evaluation of algorithm performance
G.H. Givensa, , , J.R. Beveridgeb, P.J. Phillipsc, B. Draperb, Y.M. Luib, D. Bolmed
Received 12 July 2012, Revised 27 February 2013, Accepted 31 May 2013, Available online 6 June 2013
Recognition rate is how many images correctly matching with the training images, while false acceptance is how many images (out side the dataset) are matching with the dataset images. true rejections are how many images (from the dataset) are not matching with the training dataset.
In biometrics evaluation, one could use measures like equal error rate, half total error rate or weighted error rate. Lets assume you have a biometric face evaulation evaluation system that assigns all authentication attempts a 'score' between closed interval [0, 1]. 0 means no match at all and 1 means a full match. If the threshold is set to 0, then all the users including the genuine (positive) and the impostors (negative) are authenticated. If your threshold is set to 1 then there is a high risk that no one may be authenticated. Therefore, in realtime systems the threshold is kept somewhere between 0 and 1. So, this threshold setting can sometimes may not authenticate the genuine users, which is called FRR (False Reject Rate) but may also authenticate the imposters, which is given by FAR (False Accept Rate).
Note that the FRR is an increasing function of the decision threshold, whereas FAR is a decreasing function of it. By plotting FRR versus FAR curve, one obtains a receiver’s operating characteristic (ROC) curve. From here several useful point-estimate criteria that are useful to find the operational decision threshold, namely, equal error rate (EER), weighted error rate (WER) and its special case, half-total error rate (HTER).
EER, is the point where the two error rates cross each other and the weighted average of FAR and FRR gives us the WER. Average of the two would give us the HTER.
The first thing for to do is define if your system is for verification or identification. Then, you need to do a serial of test looking for the performance of your system. It is done with the confusion matrix. It is just counting achievements and failures. This wiki article is very understandable