How to go to from word similarity to overall sentence similarity?

Pankajeshwara Sharma @Pankajeshwara-Sharma

22 March 2016 4 6K Report

I have implemented a sentence similarity method using WS4J.

I have read about sentence similarity in articles which is based on word similarity in two sentences. But I couldn't find a method which computes and returns a single value for the overall sentence similarity based o the word similarities.

A similar question was asked at Stack overflow website at "sentence-similarity-using-ws4j"

As you can see I have managed to code with WS4J up to the extent where any word in sentence a finds a synset match in the other sentence (and the matching value is above 0.9) returns a match message. But this is not a good approach I guess.

I have found the article by Yuhua et [2]. all very useful but cannot figure out the method they used for overall sentence similarity.

public static String sentenceSim(String se1, String se2, RelatednessCalculator rc) {

String similarityMessage = "";

String similarityMessage2 = "";

if (se1 == null || se2 == null) {

return "null";

}

if (nlp == null) {

nlp = OpenNLPSingleton.INSTANCE;

}

// long t00 = System.currentTimeMillis();

String[] words1 = nlp.tokenize(se1); // base

String[] words2 = nlp.tokenize(se2); // sentence

String[] postag1 = nlp.postag(words1);

String[] postag2 = nlp.postag(words2);

String u = "";

int matchCount = 0;

int counter = 0;

String mLC = rc.toString().toLowerCase();

for (int j = 0; j < words2.length; j++) { // sentence

String pt2 = postag2[j];

String w2 = MorphaStemmer.stemToken(words2[j].toLowerCase(), pt2);

POS p2 = mapPOS(pt2);

// System.out.print(words2[j]+"(POS "+pt2+")");

for (int i = 0; i < words1.length; i++) { // base

String pt1 = postag1[i];

String origWord1 = words1[i];

String origWord2 = words2[j];

String w1 = MorphaStemmer.stemToken(words1[i].toLowerCase(), pt1);

POS p1 = mapPOS(pt1);

String popup = mLC + "( " + w1 + "#" + (p1 != null ? p1 : "INVALID_POS") + " , " + w2 + "#"

+ (p2 != null ? p2 : "INVALID_POS") + ")";

String dText;

// boolean acceptable = rc.getPOSPairs().isAcceptable(p1, p2);

// ALL WORDS FROM BASE HAS TO MATCH - IF ONE DOESNT,

// THEN ITS NOT MATCH

double d = -1;

if (p1 != null && p2 != null) {//

double r = wordSim(w1, w2, rc);

if (r > 0.9) {

matchCount++;

similarityMessage += "\t\t Similarity Found (Base : sentence) ('Base Word: " + origWord1 + "=" + w1 + " "

+ p1 + "', Sentence Word: '" + origWord2 + "=" + w2 + " " + p2 + "') = " + r + "\n";

System.out.println(similarityMessage);

}

// System.out.println();

}

// output if all words in sentence 1 have found matches in sentences 2

if (matchCount == words1.length) {

similarityMessage2 = "\t\tFound all matches for base in sentence: ";

System.out.println("\t\tBase " + se1);

System.out.println("\t\tFound all matches for base in sentence: ");

System.out.println(similarityMessage);

}

similarityMessage = "";

return similarityMessage;

}

I have done my codes in Java, so I was looking for some java implemetations.

[1]: Li, Y., McLean, D., Bandar, Z. A., O'shea, J. D., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. Knowledge and Data Engineering, IEEE Transactions on, 18(8), 1138-1150.

Madhuri Tayal

You can refer this paper for your question. This is one of the approach.

"Word net based Method for Determining Semantic Sentence Similarity through various Word Senses"

Krzysztof Wołk

In python you may use Difflib library.

Levenstein Distance is quite good just like perplexity. Please refer to http://www.hindawi.com/journals/tswj/2014/745485/

Udit Chakraborty

I don't think there exist any such similarity measure. You may use Levenstein Distance but then again you need to justify the choice.

You may have a look at this:

https://www.researchgate.net/publication/272488572_A_Novel_Semantic_Similarity_Based_Technique_for_Computer_Assisted_Automatic_Evaluation_of_Textual_Answers?ev=prf_pub

Conference Paper A Novel Semantic Similarity Based Technique for Computer Ass...

Krystian Wojtkiewicz

You might take a look at the following article:

Chapter Estimating Semantics Distance of Texts Based on Used Terms Analysis

Which type of compound does lamda max of 218 indicate in a uv-vis spectrum of a partially purified compound through column and TLC?

How does grain and grain boundary affect the ceramic when studying its dielectric properties?

Reason for discontinuities in my Band structure?

If my gene of interest has high GC content can it be problematic in sequencing? What kind of error is expected with GC rich gene sequences??

How to dispose off lipids waste?

What publications should I target as a psychology masters student in the UK?

How effective has the United Nations been in addressing the conflict and its consequences?

What are the main obstacles to achieving a ceasefire or peace agreement between Russia and Ukraine?

How to decide whether the refinement is correct or not, based on Rwp and Rexp factors by Fullprof?

How to draw photon-magnon coupling color plot in origin?

The Bigger You Are, the Harder You Fall (some lessons from Dinosaurs)?

Are air moisture harvesting technologies effective in combating desertification?

State of art in natural disasters?

Broca’s area must be intact for the learning of new movement sequences?

How can I get my Granzyme B flow cytometry stain to be consistent?

The Origin of Human Language?

Posthoc test lettering in JAMOVI?

Creating an Automaton/Using Language as the Model?

What are the roles of innovation in achieving the Sustainable Development Goals (SDG)?

What exactly is RAG-LLM doing? Isn’t it data engineering?