Jakub Reš
Ph.D. student focused on fact injection attacks in large language models at Security@FIT lab of FIT BUT. My research also covers jailbreaking techniques and security of synthesized code, with an emphasis on practical risks and mitigation strategies for real-world deployments.
Session
Updating a specific fact in a 70-billion-parameter model usually feels like using a sledgehammer for heart surgery. Traditional retraining is slow and expensive, often breaking unrelated behaviors in the process. ROME (Rank-One Model Editing) offers a more surgical alternative, treating model weights like a database that can be precisely updated without a full rebuild.
We move past the black box mystery to show how we can locate where a specific fact lives within a transformer's architecture. We will explore the mechanics of knowledge neurons, the risk of unintended side effects like hallucinations, and the serious security implications of malicious fact injections or stealth patching. Through live demonstrations using our open-source toolkit, we will show how ROME performs precision strikes on state-of-the-art models, highlighting both its efficiency and the scenarios where the method’s assumptions break down.
This talk is an ongoing collaborative research between Red Hat and FIT BUT.