I work on AI for software engineering, with a focus on making language models more reliable and useful for code-related tasks.
My research investigates how domain knowledge can be integrated into language models through complementary approaches such as training specialized models on domain-specific data, augmenting generation with retrieval from community knowledge bases, and leveraging execution feedback to improve code quality.
Investigating whether smaller models trained extensively on domain-specific data can outperform much larger general-purpose models for certain tasks. SOBert models (125M-762M parameters, trained for under $2,000) outperform much larger general-purpose models on StackOverflow code labeling tasks.
Developing retrieval-augmented generation systems that leverage community security discussions from StackOverflow to identify and fix vulnerabilities in LLM-generated code. SOSecure achieves 72-97% fix rates compared to 38-56% for GPT-4 alone, without introducing new vulnerabilities.
Applying reinforcement learning with compiler and execution feedback to improve code quality, with a focus on decompilation tasks where the goal is producing readable, semantically equivalent code from binary executables using compiler and test feedback as learning signals.
Publicly released domain-specialized language models on Hugging Face: