What is the best method for finding document similarity?

More Emilija Gjorgjevska's questions See All

Do you think can be any Uranium bearing rocks in Eastern part of Iran and western part of Afghanistan?

I want to know more about Uranium ore deposits in world.

11 August 2024 6,720 0 View

Do you think can be any diamond bearing rocks in Eastern part of Iran and western part of Afghanistan?

I want to know more about diamond ore deposits in world.

11 August 2024 2,167 1 View

What is the difference between mathematical R^4 space and physical 4D unit space?

We assume that the difference is huge and that it is not possible to compare the two spaces. The R^4 mathematical space considers time as an external controller and the space itself is immobile in...

10 August 2024 6,678 14 View

If Banks do not provide credit facility, what are the options available for FPOs and impact on producer’s income?

10 August 2024 8,198 5 View

Controlling for pupil light reflex when analyzing pupil size time course?

I used eye tracking to examine how participants from two different populations (A and B) react to an image. Participants in population A exhibit larger pupil sizes over time, but they also have...

10 August 2024 3,229 0 View

What are a “Farmers Producer Organization” (FPO) and its essential features?

10 August 2024 477 5 View

Strugglling with m6A dot blot any suugesstion ?

I have been doing the m6A dot blot for a while with no improvement, I am extracting the RNA, and I can see the dots although the three biological replicas give a different reading on the memberan...

10 August 2024 8,539 5 View

Do interactions between biosphere, carbon cycle, & water cycle impact global warming & interaction between atmosphere & hydrosphere?

How do interactions between the biosphere, the carbon cycle, and the water cycle impact global warming and interaction between the atmosphere and the hydrosphere?

09 August 2024 3,291 2 View

How to get moment output in Abaqus Standart?

I have input a moment load in module load Abaqus, i put my moment load on the node surface (using reference point). I have define moment in history output and make a set for moment too. But the...

08 August 2024 4,831 4 View

How is energy cycled through the Earth's climate system and how do matter cycle and energy flow through the rock cycle?

08 August 2024 8,162 0 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Hi, I have a question about normalizing the MTT OD values for doing the statistical analysis. So, if we have 3 different plates and we call them 3 different replicates, so, first we would...

07 August 2024 8,106 4 View

Why does my protein refolded to beta sheet during thermal denaturation analysis?

Hi! So i attempted to understand a novel protein behavior towards heat application by analyzing its secondary structure change. I subjected the protein to a thermal denaturation analysis using...

06 August 2024 1,989 3 View

Emilija Gjorgjevska

Thank you for the answer. I've been reading the whole day yesterday and I've decided to use word2vec because it allows me to find words that are semantically similar. What do you think about this? :)

Peter Ludemann

What method of summarization are you using? It might already use some of your other techniques, and consequently bias your results in unexpected ways.

Also, I would do stemming first, not last.

Valery Kugler

When you will use Word2vec.

Suppose you have word X and words A,B,C. If you have word2vec model which contains all this words you can calculate distance from X to every of this words. Then you become to know that B is most similar to X, next is A, and after goes C. Are there any approaches to finding out that X is really close in meaning to B, or all A, B, C are very far semantically from X, although they are far away, they can always be ordered?

Thank you for your response Peter Ludemann.

I'm using the Centroid-based Text Summarization through Compositionality of Word Embeddings approach (the dilemma was whether to use TextRank or this one and I chose this, since it looks after diversity and coverage).

When it comes to the second part, I'm aware that some operations can be overlapped - I've cleaned the text and extracted all the data that I consider to be relevant. In practice, word2vec performs these procedures too. Having this in mind, I was looking more into using only specific functions (finding similarity) and NOT repeating what I've done before. I haven't tested it yet, so I don't know whether this is possible :)

Lastly, I planned to use Porter stemmer and in that case doing stemming first can be a real problem. Here's the scenario:

" For example, 'was' turns into 'wa' by porter stemmer and when you stemmed first before removing stopwords 'wa' remains in your vector after filtering stopwords which has 'was' as a stopword."

I'm still in a research process to find out how to do this on the best way I can.

NOTE: I'm pretty new to NLP, so my approaches might not be #1 thing to do but I read and consult with people doing NLP every day to improve myself.

Since I guess that this thread will be used by future newbies in NLP, I highly recommend starting with a good literature (not blogs, not courses). Why? Because they miss to point out that some cleaning processes (like stemming and lemmatization, etc.) are not default rules: each problem is specific and should be approached differently. Talk with experienced peers who can balance between advanced strategies and your level of knowledge - both of these approaches are the best starting point for me.

I switched to a different strategy than a this one. Thanks for everyone who contributed to this discussion :)

I recommend Schütze & Manning's books.

Porter stemmer isn't very good; you should be able to find something better.

Stop words are an optimization. TF/IDF will give low values to stop words anyway. (And stop words can create some corner cases, such as throwing away phrases like "to be or not be".)

Peter Ludemann I will check them and give a feedback after some time to share what kind of experience I had. Thank you for everything.