Efficient Similarity Measures for Texts Matching

Adio Akinwale; Adam Niewiadomski

doi:10.34658/jacs.2015.23.2.7-28

Vol. 23 No. 1 (2015), Artykuły

Vol. 23 No. 1 (2015)

Efficient Similarity Measures for Texts Matching

Artykuły

https://doi.org/10.34658/jacs.2015.23.2.7-28

Published October 31, 2015

Adio Akinwale⁺⁻
Adam Niewiadomski⁺⁻

Adio Akinwale

Federal University of Agriculture Department of Computer Science

Adam Niewiadomski

Politechnika Łódzka

PDF

Keywords

similarity measures
fuzzy relations
n-gram
word list
set theory
subjective examination.

How to Cite

Akinwale, A., & Niewiadomski, A. (2015). Efficient Similarity Measures for Texts Matching. Journal of Applied Computer Science, 23(1), 7-28. https://doi.org/10.34658/jacs.2015.23.2.7-28

Abstract

Calculation of similarity measures of exact matching texts is a critical task in the area of pattern matching that needs a great attention. There are many existing similarity measures in literature but the best methods do not exist for closeness measurement of two strings. The objective of this paper is to explore the grammatical properties and features of generalized n-gram matching technique of similarity measures to find exact text in electronic computer applications. Three new similarity measures have been proposed to improve the performance of generalized n-gram method. The new methods assigned high values of similarity measures and performance to price with low values of running time. The experiment with the new methods demonstrated that they are universal and very useful in words that could be derived from the word list as a group and retrieve relevant medical terms from database . One of the methods achieved best correlation of values for the evaluation of subjective examination.

https://doi.org/10.34658/jacs.2015.23.2.7-28

PDF

References

Markov, A. A., Essai diune recherche statistique sur le text do roman, Engene oneguine, bull. Acad imper sci. st Petersburg, Vol. 7, 1913.

Spinels, D., Zaharias, R., and A., V., Coping with plagiarism and grading load: randomized programming assignments and relflective grading, Computer applications in engineering education, Vol. 5, No. 2, 2007, pp. 113–123.

Shannon, C., Prediction and entropy of printed English, The bell system technical journal, Vol. 30, 1951, pp. 50–64.

Zamora, E. M., Pollock, J. J., and Zamora, A., The use of trigram for spelling error detection, Information processing and management, Vol. 17, 1981, pp. 305–316.

Burnett, J., Cooper, D., Lynch, M.,Willett, P., andWycherley, M., Document retrieval experiments using indexing vocabularies of varying size, Journal of documentation, Vol. 35, 1979, pp. 197–206.

Trenkle, J. M. and Cavnar, W. B., N-gram based text categorization, In: proceedings of the symposium on document analysis and information retrieva, 1994, (University of Nevada, Los Vegas). Bi-n-gram and tri-n-gram methods permit to achieve a very high relatively A. Akinwale, A. Niewiadomski 27

Cheng, B. Y., Carbonell, J. G., and Klein-Seetharaman, J., Protein classification based on text document classification techniques, Journal of protein, Vol. 58, 2005, pp. 955–970.

Nakamura, M. and Shikano, A study of English word category prediction based on neural networks, International conference on acoustics, speech and signal processing, Vol. 2, 1989, pp. 731–734.

Tan, C. L., Sung, S. Y., Yu, Z., and Xu, Y., Text retrieval from document images based on n-gram algorithm, In: PRICAL workshop on text web minning, 2000.

Harrison, M., Implementation of the substring test for hashing, Communication of the ACM, Vol. 14, 1971, pp. 777–779.

Abou-Assaleh, T., Cercone, N., Keselj, V., and Sweidan, R., N-gram based detection of new malicious code, In: COMPSAC workshops, 2004, pp. 41–42.

Abubakar, A. I. Z., Automated grading of linear algebraic equation using n-gram method, (Master Thesis).

Barrvon-Cedeno, A. and Rosso, P., On automatic plagiarism detection based on n-grams comparison, Springer-Verlag Berlin Heidelberg, 2009, pp. 296–370.

Fogla, P. and Lee, W., Q-Gram matching using tree models, IEEE transactions on knowledge and data engineering, Vol. 18, No. 4, 2006, pp. 433–447.

Niewiadomski, A., Methods for the linguistic summarization of data: application of fuzzy sets and their extensions, Akademicka oficyna wydawnicza EXIT, Warszawa, 2008.

Chaski, C., Authorship attribution in digital evidence investigations, International journal of digital evidence, Vol. 4, No. 1, 2005, pp. 135–143.

Tversky, A. and Gati, I., Similarity, separability and triangle inequality, Psychological review, Vol. 89, 1982, pp. 123–154.

Tversky, A., Features of similarity, Psychological review, Vol. 84, No. 4, 1977, pp. 327–352. 28 E_cient Similarity Measures for Texts Matching

Zwick, R., Carlstein, E., and Budeskco, D. V., Measures of similarity amongst fuzzy concepts: A comparative analysis, International journal approximate reasoning, 1987, pp. 221–242.

Williams, J. and Steela, N., Di_erence, distance and similarity as a basis for fuzzy decision support based on prototypical decision classes, Fuzzy sets and systems, Vol. 131, 2002, pp. 35–46.

Ismat, B. and Ashraf, S., Fuzzy equivalence relations, Kuwait journa science and engineering, Vol. 35, 2008, pp. 33–51.

Niewiadomski, A. and Grzybowski, R., Rozmyte miary podobienstwa tekstow w automatycznej ewaluacji testow egzaminacyjnych, Informatyka teoretyczna i stosowana, Vol. 4, No. 6, 2004, pp. 75–79.

Zhou Wei., N. R. M. and Yu, C., A tutorial on information retrieval: Basic terms and concepts, Journal of biomedical discovery and collaboration, Vol. 1, No. 1, 2006, pp. 1–17.

Buscaldi, D., Tournier, R., Ausienac-Gillies, N., and Mothe, J., IRIT textual similarity combing conceptual similarity with N-gram comparison method, In: First joint conference on lexical and computational semantics, Associationfor computational linguistics, Montreal, Canada, 2012, pp. 41–42.

Downloads

Download data is not yet available.

Efficient Similarity Measures for Texts Matching

Keywords

How to Cite

Download Citation

Abstract

References

Downloads