Do you mean what technical means allows speaker recording in background noise, or voice activity detection in background noise?
In any case, if you know a priori the characteristics of the ambiant noise, and assume it to be stationnary, you can already try to substract the sampled noise signal to the total recorded signal (as it is statistically constant over time). The main drawback is that it also affects the speech recording quality in the overlapping frequency bands, as it does not discriminate noise from speech.
Alternative and more sophisticated solutions require a specific microphone design. The most common microphone for rejecting diffuse noise is the cardioid microphone, which is basically a directional microphone which sensitivity is increasing when the speaker is speaking in front of the microphone. The counterpart is a strong reinforcement of low frequency when the speaker gets closer to the microphone (proximity effect), thus limiting the recording performances in terms of intelligibility.
Another technique consists in employing an array of microphones, so that to increase the angular selectivity of the sound recording and filtering the contributions coming from other direction (for which a significant amount of diffuse energy can be rejected). An even more sophisticated solution employs 2D arrangement of linear sub-arrays of microphone, one linear array to catch the sound of the speaker (endfire configuration), another to catch the diffuse noise and substract it afterwards in the total array.
I hope it answered (at least partially) your question.
Ron. It is not very clear what you really want to do. I am answering from the standpoint of what I understand.
Acoustically, speech is quite different from noise. The difference can help you know when the speaker begins to speak in ambient noise. As Harve has pointed out, microphones are magnetic tools and so they capture sound waves without discriminating between noise and speech. If the speech has already been recorded, you would need an oscilloscope (or any spectral analyzer) which gives you the waveform of the recording. Assuming that the recording process has been started with the ambient noise, the oscilloscope shows the ambient noise as a band of frequencies without any defined envelop. When the speaker begins to speak, the spectral structure of the signals will change; you will then find well-defined envelop representing vowel sounds and voiced consonants that are well distinguishable from the ambient noise. So, oscillograms should be right for the purpose.
However, I think that you will find it difficult to remove the background noise from your recording because the speech signals are "mixed up" with the noise. But because speech sounds are very resistant to distortion (which I cannot go into here), they retain their distinctive traits even in noise. You might try filtering, that is, you may know the frequency range of the ambient noise. You could apply a frequency pass-band and remove all the noise that is outside the frequency band of your recorded speech. If the ambient noise is within the frequency range of the speech signal, you'll have a struggle because you cannot remove the noise and leave the speech signal.
When you say that the ambient noise is known in advance to the algorithm, I tend to think that you are superposing the recordings. The ambient noise may be recorded in one track and the speech sound in a second track. In that case, you can keep them separate, but if you play the two tracks as a monaural recording, you will have the same problem as mentioned above trying to separate them.
Thanks Herve. Your comments help. I was mostly thinking about an acoustic signal processing approach like your first paragraph, but your suggestions about microphones are also helpful.
For now, more elaboration on the first approach would help. I only want to know how to detect the start of speech; no need to actually remove the ambient noise from the speech (although that would not hurt, but is not my main goal). Does that simplify the problem? I was thinking about subtracting some model of the noise? Is this better done in the frequency domain? What are the computational difficulties?
In another case, the background sound is more predictable, for example it might be music played by the same system so the system knows the exact signal as originally produced, it just doesn't know exactly what it sounds like at the microphone, volume, echo effects and any other effects created by the speaker, environment or the microphone. Does that make the problem easier?