Can a neural network designed to output text be trained to output a 2D image description?

Well, I understand your question in this way: Instead of translating a set of pixels into the semantic space, you want to retrain your network to achieve a transcoding of the picture from one form of representation into another.

I see the following problems: The human language, which is used as output to describe the point in the semantic space has far more tolerance in terms of how you can express it. A "sun" can be a "star", a "yellow cirlce" and it is easy to conceive how to create an output of the network for such descriptors. An attention-based interpretation of the image with a "getting tired" feature will give you a sequence of descriptions starting with the main concept and then producing additional descriptions for finer details. For a human this will be enough to "get the picture". The basic syntactic structure of the language is also fairly simple, so that you can use two trained networks: one to recognize the concept, second to interact with a user.

Going away from that changes your task dramatically. SVG-content is not in the semantic space - it is still in the representational. If you open up Inkscape or any other vector drawing tool, you will usually find functions to create a vectorized version of a bitmap picture, which you can import. These converters first approximate the picture with vector shapes and then use an XML-generator to create a valid SVG code. In the first step, I would try to accomplish the approximation of shapes using neural networks. Whatever you implement, it must compete against current state-of-the-art in graphics vectorization. They can compete in pixel accurarcy, compactness of vector representation or resulting perceptional sentiment. I hypothesize that a CI-method can bring interesting results here but I am not aware about research on this topic.

I would not go for trying to replace the XML-generation part as the XML is not really fault tolerant and has many structural dependencies among its code parts. It is not clear, what would be gained by this. It can only serve as a benchmark in order to determine how you can train a network in order to approximate very formal generators. This can be very interesting, for example in case of reverse engineering projects or online system adaptation. The resulting insight is of lesser interest for media conversion as such, I believe.

However, there is a middle ground problem that is not solved by a common bitmap vectorizer and may require a semantic interpretation to get done well: This is the grouping of the SVG entities into semantically related vector groups. You would not want to combine a set of vectors related to a chair with the set of vectors related to a leg just because they are in proximity. This is where I can start to see to use neural networks (or other classifiers) to pre-group image areas for the vectorizer. This could be used in order to group the vectors into semantically relevant vector groups. This would be of actual benefit to users.

Vikas Ramachandra

there is a one to many mapping, from description to images.

in fact, image labels can be thought of as local descriptors, and if you learn a generative model for each class, you can generate new samples (such as a VAE or GAN) by sampling the posterior of each class. similarly, you can build models for multiple keyword combinations (with a set of images for each), so that with a new query, you can map it to the appropriate class, and generate an image from that class subspace.

Pierre-Yves Gicquel

Aleksander Lodwich , thank you for your taking time to write a precise, detailled and clear answer!

My use case idea for such a network would be to let an user draw a basic shape and use this shape as a seed for the network. To achieve that, all the work would indeed need to be perform in a representational space.

I was actually thinking of a subset of SVG : the path element and its attribute d. This elements allows to describe a vector graphic in a rather simple way. It is a sort of simplified Turtle language (moveTo x y lineTo a b lineTo c d ...) A triangle would be for instance represented using "d=m 3 12 l 75 1 l 1 43 z".

By restraining to only Path element of SVG, the representational space is extremely simplified compared to complete SVG-XML. The Turtle-like grammar of the d attribute isn't fault tolerant but has a very limited vocabulary and a very simple syntax.

To train a network over this grammar requires a large example corpus. After some lookup, it seems that such corpus already exists. Various open source SVG image library exists, and after exploring some, it appears that in many cases SVG images are described using only the Path element. Other elements of the format (circle, rect...) are much less used from what I saw. These images aren't vectorized versions of bitmap, they have been designed directly using SVG graphic tools.

The idea of semantical group is inspiring. I wonder if the semantic interpretation could be done over formal geometrical features of the image instead of human semantic. As a first stage at least, it would permit to group images that have common visual features, such as a symmetry center or axis, or that are invariant according to certain rotations.

Vikas Ramachandra Thank you for your answer. Wouldn't it be very consuming however to train VAE for more than few labels and associated set of images?

Aleksander Lodwich

Hello Pierre-Yves,

it is important to get the purpose of it right. I do not fully understand the idea of seeding a network with simple geometrical shapes. Maybe this prevents me of giving you the right idea.

For understanding the visual representation it is far more important to get the relationships between features right than getting the features identified as such. For example, you can draw a polygon with or without Bezier curves and you will still get the same conceptual understanding despite that the technical representation is very different in XML. In fact, the two polygons might show the same figure almost pixel-precisely and yet the technical representation will greatly vary. Having a polygon created from three ellipses (convert ellipses to path and the combine) can be visually exactly the same as keeping the three ellipses separated.

It is therefore crucial to know what is to be achieved. For example, let us imagine a "Fast Artist" plugin for Inkscape, just to stimulate thought. The purpose of this plugin could be to make propositions to the user how to advance his work semi-automatically. Let us hypothesize that the user wants to draw a bunny rabbit. He will use ellipses to sketch a head and two ears. Independently what has been drawn so far, the number of the newly added figures (and hence graphical relationships of potential relevance) is low. This makes the problem kind of easier. The purpose is to detect the intention of drawing a rabbit head. The plugin could ask if the user wants to draw a "dog", "rabbit" or "kangaroo". The user could select one of the animals and the plugin would add some basic features to the figure, would perform some appropriate polygon conversion and add right bendings to the Bezier curves. The work from here could go like this: The user would add some eye brows and the plugin could ask if it shall finish the eye brows and so on. The plugin could precolor the polygons depending on the chosen animal and the like.

If we assume for a second that this is the task, then we are talking about recognizing a limited amount of features. The question is how you can identify the real relationships between the features. Not all physically present features are relevant to the image, for example, because they are hidden by a vector figure in the front. Analyzing such features is even counter productive, I would claim.

Of course, the most common approach would be to quickly rasterize the vectors in a bitmap found in memory, which is also the input to a neural network responsible to identify typical artist intentions. Propositions from this network are used to propose sophisticated artwork editing functions from an otherwise incomprehensibly large pool of special purpose functions. Thanks to the matrix shape, the bitmap contains the information about which parts of the vectors were visible and which absolute positions and proportions were involved. You could add a label to each pixel in order to mark which of the labelled vector commands was mainly responsible for the pixel.

However, I sense your resolve to try the representational analysis. In that case I would first try to train a network that is able to classify visible and invisible features from the Turtle-like descriptions. This could already prove a difficult enough problem. One of the main problems is that byte streams are variable size input. In order to handle it, you will be forced to use some type of "gliding" approach. However, on the spot, I cannot conceive how you would want to explain the relative movements of the network. I could imagine some kind of internal dynamic in the network which is following the "exterior" edges of the description. It would translate the current dynamic of N steps into categories like "edgy wave" or "right angle" or "rather straight line". Nevertheless, this approach would not yield a characterization of visible/invisible parts, which I would consider crucial.

Another problem is that Turtle is relative. The same command can be executed anywhere in the image but its relationship to the other graphical objects will have a dramatically different meaning. I would rather try to restrain the problem to straight line polygons or sets of elementary figures (ellipse, rectangles). Their parameters would be absolute. In that case there could be a way to create a reasonable input interface to the network and a chance that the network can recognize something.

Does the concept of "comfort zone" have some scientific validity?

Possibility of an attack on scientific peer-review system based on anonymous biding?

Feedback defines the constitution of an organism?

What is the reason for current dropping in OER , LSV curve?

What may be the reasons for failures of Tube toi Tube Sheet Joints in Boiler Drum ?

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

Measuring the Intelligence of a Species?

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

The Curse of Evolution and Complexity?

Need help with my research project on open source SIEM and machine learning?

Swimming/space travel depends on the proprioceptive muscle spindles?

What are the limitations and challenges of using machine learning for predicting concrete compressive strength in practical applications?