Evaluating Factors Affecting Sentences Similarity And Paraphrasing Identification Using K-Means Clustering

Abstract:

This research considers Arabic Paraphrasing Benchmark for identifying similar and paraphrased sentences using k-means clustering. The benchmark is constructed based on Arabic transformation rules for sentences and provided by labels for similar/not similar and paraphrased/not paraphrased sentences. K-means clustering is applied to partition the dataset into clusters with similar sentence pairs. Three factors that affect the distribution of similar and paraphrased sentences are tested by conducting several experiments with K-means clustering. By analyzing the resulted clusters, paraphrased sentences achieve a recall of 0.81 with the pre-trained embeddings and a recall of 0.78 with introducing words’ weight while labeling by majority provides better recall than labeling with a threshold of similarity score.