String Kernel Based Binary Text Source Classification
Pradeep Bihani, Master's Thesis Defense
 November 17,  at 4pm
 Pearce Hall 135
 Join online here:
 In this project, we consider a simple problem in machine learning and try to solve it using an elementary method based on linear algebra. Given two sources of short texts (such as two twitter accounts) we want to build a machine learning model which will learn the difference between the two sources, and will then be able to classify further texts coming from these two sources. We do not assume any knowledge of the features of the two sources of text, e.g. what is the language of the texts. We approach this problem using the kernel method and the stnadard technique of support vector machines. This method involves the mapping of the data into a high dimensional Euclidean space and constructing a separating hyperplane that divices the two clouds of points corresponding to the two texts sources. We test our method to classify several pairs of sources, such as (i) tweets of Donald Trump and Joe Biden, (ii) those of Donald Trump and French president Emmanuel Macron, (ii) the poetry of Shakespeare and the verses of the Bible. Depending on the similarity of the sources, we are able to separate them to different extents.

The Committee for Pradeep Bihani: Dr. Debraj Chakrabarti (Chair), Dr. Jordan Watts, and Dr. Sivaram Narayan