Manisha Mukherjee

Hello! I'm Manisha, a PhD student at S3D in the School of Computer Science at Carnegie Mellon University working with Vincent Hellendoorn.

About Me

I am a LLM4Code researcher, and I work at the intersection of machine learning, software engineering, and security to enhance AI-generated code. My passion lies in making code assistants more accurate, secure, and aligned with developer needs.

My research leverages the collective wisdom of the developer community—through domain-specific training and expert-driven retrieval systems—to bridge the gap between AI capabilities and software engineering best practices.

Research Focus

Specialized Models for Specialized Tasks: I investigate how targeted pretraining on programming knowledge repositories (like Stack Overflow) can create more effective specialized models that outperform larger generalist approaches on technical tasks, even with significantly fewer parameters.
Here are two Medium Language Models I trained exclusively on StackOverflow data: SOBertLarge and SOBertBase.
Secure Code Generation: I develop techniques that infuse community-driven security insights into LLM-generated code, significantly reducing vulnerabilities and building greater trust in AI-assisted development workflows.

I am broadly interested in:

AI4Code
Generative AI
Software Engineering
Security
Natural Language Processing (NLP)

Before this, I worked in computer networks and computer vision for my master’s thesis, advised by Tom La Porta.

Experience

Adobe Research (Summer 2024)
Lawrence Livermore National Lab (Summer 2022, 2023)
Idaho National Lab (Summer 2020)
Fujitsu Research Labs America (Summer 2021, 2019)
Cisco Systems Inc (2014-2017)
Capgemini India Pvt Ltd (2011-2012)

Selected Publications and Patents

M. Mukherjee and V. J. Hellendoorn, "SOSecure: Safer Code Generation with RAG and StackOverflow Discussions," arXiv preprint arXiv:2306.03268, 2025.
M. Mukherjee , S. Kim, X. Chen, D. Luo, T. Yu, and T. Mai, "From Documents to Dialogue: Building KG-RAG Enhanced AI Assistants," arXiv preprint arXiv:2502.15237, 2025.
M. Mukherjee and V. J. Hellendoorn, "Stack overflowing with results: The case for domain-specific pre-training over one-size-fits-all models," arXiv preprint arXiv:2306.03268, 2023.
V. J. Hellendoorn, J. Tsay, M. Mukherjee, and M. Hirzel, "Towards automating code review at scale," in 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’21), 2021.
M. Mukherjee, M. Bahrami, and W. P. Chen, "Source code retrieval," in US Patent Application 17/085,894, 2020.
M. Mukherjee, J. Edwards, H. Kwon, and T. F. La Porta, "Quality of information-aware real-time traffic flow analysis and reporting," in 2015 IEEE International Conference on Pervasive Computing and Communication Workshops (PerCom Workshops), IEEE, 2015, pp. 69–74.
M. Mukherjee, "Determination of real-time traffic flow parameters in different devices based on qoi requirements," in MS Thesis, 2014.