
Natural Language
0%
0%
The model was trained on approximately 1.2 million essays equally divided into 2 categories: AI written and Student written. The data was tokenised with a character level tokeniser of vocabulary size 30000. You can visit the Kaggle notebook or the GitHub repository to see the exact datasets used for training.
The architecture uses an LSTM layer for classification. The LSTM is a recurrent neural network that takes 1000 tokens and then processes them. The recurrent neural network produces 1000 outputs and the last layer is used for classification via a linear layer and sigmoid activation. To prevent overfitting, there were dropouts of 0.3 placed in both the LSTM layers and after it.
In order to evaluate the model, I used a test set which was 10% of the entire dataset. The metrics used for evaluation were the accuracy and f1-score. The accuracy gives us an overview of the model's performance but, the f1-score takes into consideration the performance of the model on both classes.