Abstract:
We evaluate two large language models and we compare their ability to detect spam messages in a corpus of emails. We provide the models with a set of email messages, both legitimate and harmful and check which are classified correctly into either of two classes: spam and not spam. Furthermore we verify the impact of embedded hidden commands in the message on the models’ classification ability. In our findings, it is possible to deceive a model by incorporating a command in the email body and cause the classification to fail.