Keyword identification problem?

Huy Ngoc Pham @Huy_Pham14

10 October 2018 4 882 Report

Hi guys,

I am dealing with a NLP problem which identify the keyword in a sentence.

Eg:

Input: "I love playing PUBG - amazing game".

Output: "PUBG"

(This is just a example, the real data is not English)

I made the bag of word for the whole input and encoded the data. Input is the vector of word index in the bag of word and Output is the one-hot vector which indicate the location of keyword in the sentence.

Eg for above data pairs:

Input: [121, 148, 224, 240, 88, 101]

Output: [0, 0, 0, 1, 0, 0]

My data have about 10.000 records. I tested for some simple recurrent neural network model for this data, the below is the best one.

model = Sequential() model.add(Embedding(max_features, 32)) model.add(LSTM(64, return_sequences=True)) model.add(LSTM(64)) model.add(Dropout(0.5)) model.add(Dense(maxlen, activation='relu')) model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics =['acc']) model.summary() res = model.fit(input_train_pad, y_train_pad, epochs = 10, batch_size=128, validation_split=0.2)

This is the output of the above code:

_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_7 (Embedding) (None, None, 32) 192000 _________________________________________________________________ lstm_13 (LSTM) (None, None, 64) 24832 _________________________________________________________________ lstm_14 (LSTM) (None, 64) 33024 _________________________________________________________________ dropout_7 (Dropout) (None, 64) 0 _________________________________________________________________ dense_7 (Dense) (None, 300) 19500 ================================================================= Total params: 269,356 Trainable params: 269,356 Non-trainable params: 0

Train on 7719 samples, validate on 1930 samples Epoch 1/10 7719/7719 [==============================] - 47s 6ms/step - loss: 3.6325 - acc: 0.1965 - val_loss: 3.1061 - val_acc: 0.3187 Epoch 2/10 7719/7719 [==============================] - 45s 6ms/step - loss: 3.0287 - acc: 0.2731 - val_loss: 3.0797 - val_acc: 0.3187 Epoch 3/10 7719/7719 [==============================] - 49s 6ms/step - loss: 3.0226 - acc: 0.3311 - val_loss: 2.9471 - val_acc: 0.3187 Epoch 4/10 7719/7719 [==============================] - 48s 6ms/step - loss: 2.9734 - acc: 0.3342 - val_loss: 3.0742 - val_acc: 0.3187 Epoch 5/10 7719/7719 [==============================] - 49s 6ms/step - loss: 2.9737 - acc: 0.3342 - val_loss: 2.9441 - val_acc: 0.3187 Epoch 6/10 7719/7719 [==============================] - 56s 7ms/step - loss: 2.9568 - acc: 0.3342 - val_loss: 2.9393 - val_acc: 0.3187 Epoch 7/10 7719/7719 [==============================] - 57s 7ms/step - loss: 2.9641 - acc: 0.3342 - val_loss: 2.9424 - val_acc: 0.3187 Epoch 8/10 7719/7719 [==============================] - 54s 7ms/step - loss: 2.9629 - acc: 0.3342 - val_loss: 2.9524 - val_acc: 0.3187 Epoch 9/10 7719/7719 [==============================] - 55s 7ms/step - loss: 2.9641 - acc: 0.3342 - val_loss: 2.9429 - val_acc: 0.3187 Epoch 10/10 7719/7719 [==============================] - 54s 7ms/step - loss: 2.9554 - acc: 0.3342 - val_loss: 2.9375 - val_acc: 0.3187

The performance is not quite good.

Could you please give me some advises?

Should I change the approaching method?

Thank you so much!

Huy Ngoc Pham

Dear Sir,

Thank you for your comment. Actually, there is a misunderstand here.

I will explain clearly.

First of all, I have a list of vocabulary which is ordered:

Eg: 1 - apple, 2- bee,... , 5000 - zebra (1,2,...5000 is the index of the list, not the occurrence).

The example input ([121, 148, 224, 240, 88, 101] ) is the list of index of each word in the sentence: 121 - I, 148 - love, ...

Regards,

Huy

Eugene Veniaminovich Lutsenko

Your task is related to the intellectual analysis of texts. Its essence is that you need to select the most characteristic words for different texts. To do this, you need to create a model in which the words would be characterized by the amount of information that they contain about the belonging to each of the texts. I solved such problems. I have a number of articles on this subject. Some of them have detailed annotations in English. Here are the links to these articles:

http://ej.kubagro.ru/2003/02/pdf/13.pdf

http://ej.kubagro.ru/2004/03/pdf/03.pdf

http://ej.kubagro.ru/2014/06/pdf/07.pdf

http://ej.kubagro.ru/2014/09/pdf/32.pdf

http://ej.kubagro.ru/2017/01/pdf/01.pdf

And this is my website: http://lc.kubagro.ru/

There are many artificial intelligence systems. Universal cognitive analytical system "Eidos-x++" differs from them in the following parameters:

- developed in a universal setting, independent of the subject area. Therefore, it is universal and can be applied in many subject areas (http://lc.kubagro.ru/aidos/index.htm);

- is in full open free access (http://lc.kubagro.ru/aidos/_Aidos-X.htm), and with relevant source texts (http://lc.kubagro.ru/__AIDOS-X.txt);

- is one of the first domestic systems of artificial intelligence of the personal level, i.e. it does not require special training from the user in the field of artificial intelligence technologies (there is an act of introduction of the "Eidos" system of 1987) (http://lc.kubagro.ru/aidos/aidos02/PR-4.htm);

- provides stable detection in a comparable form of force and direction of cause-and-effect dependences in incomplete noisy interdependent (nonlinear) data of very large dimension of numerical and non-numerical nature, measured in different types of scales (nominal, ordinal and numerical) and in different units of measurement;

- contains a large number of local (supplied with the installation) and cloud-based educational and scientific applications (currently about 30 and 128, respectively) (http://lc.kubagro.ru/aidos/Presentation_Aidos-online.pdf);

- provides multi-language interface support in 44 languages. Language databases are included in the installation and can be replenished automatically;

- supports on-line learning environment and is widely used all over the world (http://aidos.byethost5.com/map3.php).

If you want, we can make an example on your texts. Let's see what happens. The language in which the texts are written does not play a role, but it is desirable that it was not Arbat and not hieroglyphs, because they are not yet displayed in the Eidos system.

Huy Ngoc Pham

Dear Sir. Евгений Луценко,

I fully appreciate your suggestion. But the data is private in-house, so we can not share to another one.

Actually, I imagine that your system need alot of information but out data is moderate (about 10 000 short sentences) which is not suit for your intelligent system.

Thank you so much!

Regards,

Huy

Eugene Veniaminovich Lutsenko

I don't need Your data myself. For example, any data is suitable. If you want I can show you. And you can try yourself. It should work. In the system of Eidos there is even laboratory work with similar tasks. This works 3.03 in 1.3 mode. The data itself can be of different sizes from small to big date. The system works with them all. Now modules with parallel calculations are brought to mind. The speed of solving problems increases up to 3000 times. It will soon be in the public domain

Rotundin/Tetrahydropalmatine Extraction?

The Bigger You Are, the Harder You Fall (some lessons from Dinosaurs)?

Are air moisture harvesting technologies effective in combating desertification?

State of art in natural disasters?

Broca’s area must be intact for the learning of new movement sequences?

How can I get my Granzyme B flow cytometry stain to be consistent?

The Origin of Human Language?

Posthoc test lettering in JAMOVI?

Creating an Automaton/Using Language as the Model?

What are the roles of innovation in achieving the Sustainable Development Goals (SDG)?

What exactly is RAG-LLM doing? Isn’t it data engineering?