New statistical features improve string similarity computation

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have proposed and studied new statistical features, co-occurrence matrix (COM) and run-length matrix (RLM), for computing string similarity. These features, adapted from visual computing, are language-agnostic and perform well across various contexts including words, phrases, and code. Experiments showed that COM and RLM features outperformed existing state-of-the-art statistical measures, including edit distances and longest common subsequence, on both synthetic datasets and a real text plagiarism dataset. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Introduces novel statistical features that could enhance natural language processing tasks requiring string comparison.

RANK_REASON Academic paper proposing new statistical features for string similarity. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

paper
other

COVERAGE [1]

arXiv cs.CL TIER_1 · Panos Liatsis · 2026-05-14 17:27

Proposal and study of statistical features for string similarity computation and classification

Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to langu…

COVERAGE [1]

Proposal and study of statistical features for string similarity computation and classification

RELATED TOPICS