Deteksi plagiasi biasanya digunakan untuk mendeteksi kesamaan dari suatu dokumen. Ini digunakan agar lebih mudah untuk mendeteksi kesamaan antar dokumen dari pada mengeceknya secara manual. Untuk melihat kesamaan antar dokumen biasanya menggunakan beberapa algoritma similarity atau distance seperti euclideance dan cosine similarity, tapi disini saya menggunakan cosine similarity karena cosine similarity lebih baik dari euclidean distance [3].

Untuk membuat deteksi plagiasi dokumen dapat menggunakan bahasa pemrograman python dengan beberapa library yang digunakan seperti ;

Scikit-learn (library open source yang biasa digunakan untuk keperluaan machine learning)
glob (library untuk mencari file berdasarkan pattern)

Dataset

Disini saya menggunakan tiga dokumen yang berisikan teks dan berekstensi ‘.txt’.

Dokumen pertama berisi teks seperti ini

Long ago, when there was no written history, these islands were the home of millions of happy birds; the resort of a hundred times more millions of fishes, sea lions, and other creatures. Here lived innumerable creatures predestined from the creation of the world to lay up a store of wealth for the British farmer, and a store of quite another sort for an immaculate Republican government.

Dokumen kedua berisikan teks yang mirip sebagian

Long ago, when there was no written history, these islands were the home of millions of happy birds; the resort of a hundred times more millions of fishes, sea lions, and other creatures. Here lived innumerable creatures predestined from the creation of the world to lay up a store of wealth for the British farmer, and a store of quite another sort for an immaculate goverment

Dokumen ketiga dibuat berbeda sekali dengan isi seperti ini

The dimension of the smartphone is 164.9 x 75.1 x 8.5 mm and it weighs 188 grams. It is powered by Qualcomm SM4250 Snapdragon 460 processor and comes in 6.52 inches IPS LCD, which is protected by Corning Gorilla Glass 3.

Kodingan

Crawl Dokumen Dengan Glob

Hal pertama untuk membuat deteksi plagiasi di python adalah dengan mengimportkan library untuk crawl dokumen yang sudah dibuat.

import glob
filedok = glob.glob('datasets/*.txt')
text_file = [open(file).read() for file in filedok]print(filedok)
print('='*48)
print(text_file[0])

output dari program diatas akan terlihat seperti ini

['datasets/siswa3.txt', 'datasets/siswa2.txt', 'datasets/siswa1.txt']
================================================
The dimension of the smartphone is 164.9 x 75.1 x 8.5 mm and it weighs 188 grams. It is powered by Qualcomm SM4250 Snapdragon 460 processor and comes in 6.52 inches IPS LCD, which is protected by Corning Gorilla Glass 3.

Melakukan Seleksi Fitur Dengan TF-IDF

TF-IDF akan menentukan frekuensi dari kata yang akan dibandingkan dengan proporsi kata itu dalam dokumen[1]. Kata yang sering muncul di dokumen bukanlah pembeda yang baik, dan harus diberi bobot kurang dari satu yang terjadi dalam beberapa dokumen [2]. Penggabungan skema Term Frequency dengan Inverse Document Frequency terbukti sebagai metode yang kuat sebagai teknik untuk memproses data teks atau keperluan lainnya [2]. Sederhananya TF-IDF adalah seleksi fitur berdasarkan pembobotan frekuensi teks/fitur dalam dokumen. Kode yang digunakan untuk melakukan TF-IDF dengan scikit-learn akan seperti ini

from sklearn.feature_extraction.text import TfidfVectorizer
#mennggunakan tfidf untuk pembobotan
vectorizer = TfidfVectorizer()
vec = vectorizer.fit_transform(text_file).toarray()

jika kita lihat isi dari fitur dengan vectorizer.get_feature_names(), maka akan terlihat seperti ini

['164', '188', '460', '52', '75', 'ago', 'an', 'and', 'another', 'birds', 'british', 'by', 'comes', 'corning', 'creation', 'creatures', 'dimension', 'farmer', 'fishes', 'for', 'from', 'glass', 'gorilla', 'goverment', 'government', 'grams', 'happy', 'here', 'history', 'home', 'hundred', 'immaculate', 'in', 'inches', 'innumerable', 'ips', 'is', 'islands', 'it', 'lay', 'lcd', 'lions', 'lived', 'long', 'millions', 'mm', 'more', 'no', 'of', 'other', 'powered', 'predestined', 'processor', 'protected', 'qualcomm', 'quite', 'republican', 'resort', 'sea', 'sm4250', 'smartphone', 'snapdragon', 'sort', 'store', 'the', 'there', 'these', 'times', 'to', 'up', 'was', 'wealth', 'weighs', 'were', 'when', 'which', 'world', 'written']

Kemudian jika kita lihat matrik yang dihasilkan dari TF-IDF akan sperti ini

[0.         0.         0.         0.         0.         0.09769707
 0.09769707 0.15174098 0.09769707 0.09769707 0.09769707 0.
 0.         0.         0.09769707 0.19539414 0.         0.09769707
 0.09769707 0.19539414 0.09769707 0.         0.         0.12845991
 0.         0.         0.09769707 0.09769707 0.09769707 0.09769707
 0.09769707 0.09769707 0.         0.         0.09769707 0.
 0.         0.09769707 0.         0.09769707 0.         0.09769707
 0.09769707 0.09769707 0.19539414 0.         0.09769707 0.09769707
 0.53109344 0.09769707 0.         0.09769707 0.         0.
 0.         0.09769707 0.         0.09769707 0.09769707 0.
 0.         0.         0.09769707 0.19539414 0.37935246 0.09769707
 0.09769707 0.09769707 0.09769707 0.09769707 0.09769707 0.09769707
 0.         0.09769707 0.09769707 0.         0.09769707 0.09769707]

Menjadikan Fitur dan Matrik TF-IDF Menjadi Tupel

vec_list = list(zip(filedok, vec))

Main Program

Inti dari program kita buat adalah ini, disini kita akan melakukan perulangan untuk mengecek satu dokumen dengan dokumen yang lainnya.

from sklearn.metrics.pairwise import cosine_similarityplag =set()
for siswa, text_vector in vec_list:
    new_vec = vec_list.copy()
    indexx = new_vec.index((siswa, text_vector))
    del new_vec[indexx]
    for siswa_a, text_vector_a in new_vec:
        sim = cosine_similarity([text_vector, text_vector_a])[0][1]
        student_pair = sorted((siswa, siswa_a))
        score = (student_pair[0],student_pair[1], "{:.1f}".format(sim*100)+'%')
        plag.add(score)
for x in plag:
    print(x)

Dan hasil dari program deteksi plagiasi kita akan membandingkan dokumen dengan dokumen yang lainnya.

('siswa1.txt', 'siswa2.txt', '97.5%')
('siswa1.txt', 'siswa3.txt', '13.9%')
('siswa2.txt', 'siswa3.txt', '14.0%')

Referensi

[1] Robertson, S.E. (2004). Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation, 60, 503–520.
[2] Ramos, J.E. (2003). Using TF-IDF to Determine Word Relevance in Document Queries.
[3] https://cmry.github.io/notes/euclidean-v-cosine

Membuat Deteksi Plagiasi Dengan Python