Deep is more like a marketing term to make something sounds more professional than otherwise. CNN is a type of deep neural network, and there are many other types. CNNs are popular because they have very useful applications to image recognition.
Artificial Intelligence has been witnessing a monumental growth in bridging the gap between the capabilities of humans and machines. Researchers and enthusiasts alike, work on numerous aspects of the field to make amazing things happen. One of many such areas is the domain of Computer Vision.
I agree with Qamar's comment. The "deep" is just a fancy name to distinguish this group of NNs, and also CNNs or ConvNets are a member of Multilayer perceptrons family. the depth of CNNs can be different and depends on the network architecture . The below link provides a great description of CNN models.
I could find a very elaborate review paper on Convolutional Neural Networks at: 1901.06032.pdf (arxiv.org)
Even in this article CNN and deep CNN are used interchangeably. The origin of this model in 1989 By LeCun is supposed to provide the answer whether he used The "Deep" concept there or not.
I think the 'deep' word is like a little loose term. Or it may denoted that the more deeper due to large number of layers (stacks/stages) used in deep CNN than only 4-5 layers in case of CNN. After all 'deep' still can not be justified that much.
In principle, CNN is the fundamental method of deep learning, particularly use for image classification, object detection, and many more. However, the term "deep" for CNN is purely indicated to the size of the CNN framework in terms of layers and etc.
Convolutional Neural Networks (CNNs) are one of the most popular neural network architectures. They are extremely successful at image processing, but also for many other tasks (such as speech recognition, natural language processing, and more). The state of the art CNNs are pretty deep (dozens of layers at least), so they are part of Deep Learning. But you can build a shallow CNN for a simple task, in which case it's not (really) Deep Learning.
Convolutional Network (CNN) means a type of neural network having masks to extract the features in an automated way. But features are not extracted by a single mask there might be multiple low-level and mid-level features that mean multiple masks are to be applied. Consequently, you have to take the basic CNN and repeat it many times and stack it into layers before you could successfully extract features. This makes deep CNN or in another world lot of CNN working together in layers fashion. Mostly CNN has many other layers before and after them for example, max-pooling, dense layers, softmax, etc.
You can pose similar questions about other stuff for example, what is the difference between recurrent neural network (RNN) and deep RNN? In a similar way, RNN is cascaded cells of RNN which have feedback and can store state. If you stack multiple layers of RNNs that will make it deep.