Keyword identification problem?

Huy Ngoc Pham @Huy-Pham-6

09 October 2018 4 3K Report

Hi guys,

I am dealing with a NLP problem which identify the keyword in a sentence.

Eg:

Input: "I love playing PUBG - amazing game".

Output: "PUBG"

(This is just a example, the real data is not English)

I made the bag of word for the whole input and encoded the data. Input is the vector of word index in the bag of word and Output is the one-hot vector which indicate the location of keyword in the sentence.

Eg for above data pairs:

Input: [121, 148, 224, 240, 88, 101]

Output: [0, 0, 0, 1, 0, 0]

My data have about 10.000 records. I tested for some simple recurrent neural network model for this data, the below is the best one.

model = Sequential() model.add(Embedding(max_features, 32)) model.add(LSTM(64, return_sequences=True)) model.add(LSTM(64)) model.add(Dropout(0.5)) model.add(Dense(maxlen, activation='relu')) model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics =['acc']) model.summary() res = model.fit(input_train_pad, y_train_pad, epochs = 10, batch_size=128, validation_split=0.2)

This is the output of the above code:

_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_7 (Embedding) (None, None, 32) 192000 _________________________________________________________________ lstm_13 (LSTM) (None, None, 64) 24832 _________________________________________________________________ lstm_14 (LSTM) (None, 64) 33024 _________________________________________________________________ dropout_7 (Dropout) (None, 64) 0 _________________________________________________________________ dense_7 (Dense) (None, 300) 19500 ================================================================= Total params: 269,356 Trainable params: 269,356 Non-trainable params: 0

Train on 7719 samples, validate on 1930 samples Epoch 1/10 7719/7719 [==============================] - 47s 6ms/step - loss: 3.6325 - acc: 0.1965 - val_loss: 3.1061 - val_acc: 0.3187 Epoch 2/10 7719/7719 [==============================] - 45s 6ms/step - loss: 3.0287 - acc: 0.2731 - val_loss: 3.0797 - val_acc: 0.3187 Epoch 3/10 7719/7719 [==============================] - 49s 6ms/step - loss: 3.0226 - acc: 0.3311 - val_loss: 2.9471 - val_acc: 0.3187 Epoch 4/10 7719/7719 [==============================] - 48s 6ms/step - loss: 2.9734 - acc: 0.3342 - val_loss: 3.0742 - val_acc: 0.3187 Epoch 5/10 7719/7719 [==============================] - 49s 6ms/step - loss: 2.9737 - acc: 0.3342 - val_loss: 2.9441 - val_acc: 0.3187 Epoch 6/10 7719/7719 [==============================] - 56s 7ms/step - loss: 2.9568 - acc: 0.3342 - val_loss: 2.9393 - val_acc: 0.3187 Epoch 7/10 7719/7719 [==============================] - 57s 7ms/step - loss: 2.9641 - acc: 0.3342 - val_loss: 2.9424 - val_acc: 0.3187 Epoch 8/10 7719/7719 [==============================] - 54s 7ms/step - loss: 2.9629 - acc: 0.3342 - val_loss: 2.9524 - val_acc: 0.3187 Epoch 9/10 7719/7719 [==============================] - 55s 7ms/step - loss: 2.9641 - acc: 0.3342 - val_loss: 2.9429 - val_acc: 0.3187 Epoch 10/10 7719/7719 [==============================] - 54s 7ms/step - loss: 2.9554 - acc: 0.3342 - val_loss: 2.9375 - val_acc: 0.3187

The performance is not quite good.

Could you please give me some advises?

Should I change the approaching method?

Thank you so much!

Joachim Pimiskern

It is not obvious what you are intending to do and why you are using a neural network. In C you can use strtok(...) to iterate through components of a sentence very fast.

If I understand your example correctly, the inputs of the network are integers, and 240 indicates the occurrence of the word PUBG. You can without learning detect 240 by greater than 239 and less than 241.

Regards,

Joachim

Huy Ngoc Pham

Dear Sir,

Thank you for your comment. Actually, there is a misunderstand here.

I will explain clearly.

First of all, I have a list of vocabulary which is ordered:

Eg: 1 - apple, 2- bee,... , 5000 - zebra (1,2,...5000 is the index of the list, not the occurrence).

The example input ([121, 148, 224, 240, 88, 101] ) is the list of index of each word in the sentence: 121 - I, 148 - love, ...

Regards,

Huy

Eugene Veniaminovich Lutsenko

Your task is related to the intellectual analysis of texts. Its essence is that you need to select the most characteristic words for different texts. To do this, you need to create a model in which the words would be characterized by the amount of information that they contain about the belonging to each of the texts. I solved such problems. I have a number of articles on this subject. Some of them have detailed annotations in English. Here are the links to these articles:

http://ej.kubagro.ru/2003/02/pdf/13.pdf

http://ej.kubagro.ru/2004/03/pdf/03.pdf

http://ej.kubagro.ru/2014/06/pdf/07.pdf

http://ej.kubagro.ru/2014/09/pdf/32.pdf

http://ej.kubagro.ru/2017/01/pdf/01.pdf

And this is my website: http://lc.kubagro.ru/

There are many artificial intelligence systems. Universal cognitive analytical system "Eidos-x++" differs from them in the following parameters:

- developed in a universal setting, independent of the subject area. Therefore, it is universal and can be applied in many subject areas (http://lc.kubagro.ru/aidos/index.htm);

- is in full open free access (http://lc.kubagro.ru/aidos/_Aidos-X.htm), and with relevant source texts (http://lc.kubagro.ru/__AIDOS-X.txt);

- is one of the first domestic systems of artificial intelligence of the personal level, i.e. it does not require special training from the user in the field of artificial intelligence technologies (there is an act of introduction of the "Eidos" system of 1987) (http://lc.kubagro.ru/aidos/aidos02/PR-4.htm);

- provides stable detection in a comparable form of force and direction of cause-and-effect dependences in incomplete noisy interdependent (nonlinear) data of very large dimension of numerical and non-numerical nature, measured in different types of scales (nominal, ordinal and numerical) and in different units of measurement;

- contains a large number of local (supplied with the installation) and cloud-based educational and scientific applications (currently about 30 and 128, respectively) (http://lc.kubagro.ru/aidos/Presentation_Aidos-online.pdf);

- provides multi-language interface support in 44 languages. Language databases are included in the installation and can be replenished automatically;

- supports on-line learning environment and is widely used all over the world (http://aidos.byethost5.com/map3.php).

If you want, we can make an example on your texts. Let's see what happens. The language in which the texts are written does not play a role, but it is desirable that it was not Arbat and not hieroglyphs, because they are not yet displayed in the Eidos system.

Huy Ngoc Pham

Dear Sir. Евгений Луценко,

I fully appreciate your suggestion. But the data is private in-house, so we can not share to another one.

Actually, I imagine that your system need alot of information but out data is moderate (about 10 000 short sentences) which is not suit for your intelligent system.

Thank you so much!

Regards,

Huy

Could you recommend some articles on Urban Transportation System optimization and Innovation?

Is there any cases algae not using the nutrient from the wastewater and grow normally?

1. If I can quantize the atom using this hyperbolic spiral and classical physics, could nature do the same?

Articles on" Gender disparities i leatherwork education"?

Why results of ROS flurescence are negative as there was no bacteria within?

What should I do with parameters that are not relate to my simulation in MyLake model?

Why reactivity isn't increased with more empty spots in valence shell?

Why is the molecule's orientation with an electric field affect polarizability?

Why don't d-orbitals split themselves, why does it take a ligand? why don't protons from ligand repel nucleus split d-orbitals?

Why doesn't chromium 2+ ion use all its d-orbitals to receive lone pairs from six waters in [Cr(H20)6)]3+?

The Bigger You Are, the Harder You Fall (some lessons from Dinosaurs)?

Are air moisture harvesting technologies effective in combating desertification?

State of art in natural disasters?

Broca’s area must be intact for the learning of new movement sequences?

How can I get my Granzyme B flow cytometry stain to be consistent?

The Origin of Human Language?

Posthoc test lettering in JAMOVI?

Creating an Automaton/Using Language as the Model?

What are the roles of innovation in achieving the Sustainable Development Goals (SDG)?

What exactly is RAG-LLM doing? Isn’t it data engineering?