What are the pros and cons of Longest Common Subsequence using as similarity function?

Fabrice Clerot Popular answer

both are quadratic, both can be speeded up with similar techniques BUT they do not do the same thing !

the following might be an interesting read :

The Influence of Global Constraints on Similarity Measures for Time-Series Databases

Vladimir Kurbalija et al

Abstract. A time series consists of a series of values or events obtained over repeated measurements in time. Analysis of time series represents and important tool in many application areas, such as stock market analysis, process and quality control, observation of natural phenomena, medical treatments, etc. A vital component in many types of time-series analysis is the choice of an appropriate distance/similarity measure. Numerous measures have been proposed to date, with the most successful ones based on dynamic programming. Being of quadratic time complexity, however, global constraints are often employed to limit the search space in the matrix during the dynamic programming procedure, in order to speed up computation. Furthermore, it has been reported that such constrained measures can also achieve better accuracy. In this paper, we investigate two representative time-series distance/similarity measures based on dynamic programming, Dynamic Time Warping (DTW) and Longest Common Subsequence (LCS), and the effects of global constraints on them. Through extensive experiments on a large number of time-series data sets, we demonstrate how global constrains can significantly reduce the computation time of DTW and LCS. We also show that, if the constraint parameter is tight enough (less than 10–15% of time-series length), the constrained measure becomes significantly different from its unconstrained counterpart, in the sense of producing qualitatively different 1-nearest neighbor graphs. This observation explains the potential for accuracy gains when using constrained measures, highlighting the need for careful tuning of constraint parameters in order to achieve a good trade-off between speed and accuracy.

http://arxiv.org/pdf/1107.0134

Michael Kleen

This problem is usually solved using https://en.wikipedia.org/wiki/Dynamic_programming and always involves some sort of similarity function. Maybe you can explain a little bit more in detail what kind of answer you would expect here ?

Tran Thai

I want to compare 2 methods: Longest Common Subsequence and Dynamic Time Warping in term of Complexity.

Which is better?

Fabrice Clerot

both are quadratic, both can be speeded up with similar techniques BUT they do not do the same thing !

the following might be an interesting read :

The Influence of Global Constraints on Similarity Measures for Time-Series Databases

Vladimir Kurbalija et al

http://arxiv.org/pdf/1107.0134

Tran Thai

I have read that paper!

Thanks for your help.

Tran Thai

I try to find an example of Edit distance with real penalty (ERP) and Edit distance on real sequences (EDR) but I cannot. Both measures are for sequence of number, right?

My data set is symbolic sequences. What is the best measure for this type of data? I think LCSS is better than ERP or EDR. I am right?

my data:

s1: ABCD

s2: CDAFAG

s3: FAGEBCE

s4: ABC

....

I want to know that ERP and EDR can apply for symbolic sequences or not?

Which is better among LCSS, ERP, and EDR?

Thanks all.

Fabrice Clerot

there is no "better" solution by itself : everything will depend on the nature of the data and your objective

to put it very simply (too roughly, maybe) :

- with LCSS, you do not care about the symbols which are not part of the LCSS between two strings; therefore

ABCD and AAAAAAAAAAAAAAAAAAABCD are at the same distance from ZBCD

- with ED, all the symbols are important and

ABCD is much closer to ZBCD than AAAAAAAAAAAAAAAAAAABCD

everything will depend on your data and application : if your strings may contain a lot of "garbage" and the informative part may be hidden in it (i would think of intrusion detection as a natural example ... but this may be due to my lack of knowledge in this area !), LCSS will make sense since it allows to get rid of the "garbage (uninformative) symbols" by concentrating on the common part between strings only ; on the contrary, if all the symbols are informative (think of orthographic correction), ED will make sense.

Bhadrachalam Chitturi

Dear Tran,

Unless you are working with very large sequences the meaning of the alignment takes precedence over the time complexity. In certain biological applications one develops new algorithms because the existing ones do not adequately address the priorities. See Fabrice's answer above.

The best measure depends on the permitted operations. For example, if you are working with genomes (where each symbol is a gene) where transpositions and reversals abound then of course you do not want to use substitution.

Here transpositions and reversals indicate shuffling of the order (that happens) whereas substitution indicates that a gene X suddenly becomes gene Y!!.

Best regards.

A survey on self organizing maps

Can anyone help me answer this question about disease genes prediction?

Can anyone help me about By-Reference in Ansi C?

Weibull distribution for estimating the parameters using maximum likelihood estimator (MLE)?

Can anyone help me with the mean/expectation and standard deviation in probability theory?

Anomaly detection

Methods for detecting abnormal from correlation patterns?

How do I cluster this data?

Feedback defines the constitution of an organism?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

How are iso-frequency contours plotted?