I am a final-year undergraduate student at Shiv Nadar University Chennai where I'm majoring in Artificial Intelligence and Data Science. I am specifically interested in Computer Vision and Multi-modal models where I am currently focussing on improving robustness, ensuring fairness and preserving user's privacy.
I worked under the guidance of Sanket Biswas and Josep Llados at the Computer Vision Center, focusing on enhancing the capabilities of vision-language models for the task of language-controlled document editing. Our work has been accepted for presentation at the WACV 2025 conference's workshop on Computer Vision Systems for Document Analysis and Recognition.
I was a visiting researcher under Dr. Karthik Nandakumar at MBZUAI, Abu Dhabi, where I worked on federated learning for extreme non-iid scenarios.
I spent my Summer'2024 as UGRIP intern at MBZUAI, Abu Dhabi, where I worked on analysing the hallucination of LLM's responses to principled prompts under Dr. Zhiqiang Shen. We also collected human and model preferences for each of the response pair for future study on preference based optimization.
Previously, I did a Research Internship under Dr Ravi Kiran Sarvadevabhatla at CVIT Lab in IIIT Hyderabad to generate precise text line segmentation for complex Indic and Southeast Asian historical palm leaf.
Please feel free to check out my resume.
You can also find me on other spaces below.
Our project, conducted under Dr. Zhiqiang Shen (Jason), focused on "Optimizing Prompts for Foundation Models" to reduce hallucination. We curated a benchmark dataset of 25k questions across ~60 topics like law, philosophy, and history. Additionally, we developed a web application to collect human preferences and assess the correctness of responses before and after applying 26 guiding principles. This preference data is crucial for future preference-based optimization techniques, enhancing the accuracy and reliability of AI-generated responses
Research Intern | Computer Vision Center (CVC)
Feb '24 - Present
Working on document editing. Currently analysing the potential of LLMs to generate structured commands to edit documents.
Research Intern | Center for Visual Information Technology (CVIT)
May '23 - Feb '24
Co-Developed on a novel method to achieve precise text line segmentation for complex Indic and Southeast Asian historical palm leaves.
DocEdit Redefined: In-Context Learning for Multimodal Document Editing
Muhammad Waseem, Sanket Biswas, Josep Llados
VisionDocs: Workshop on Computer Vision Systems for Document Analysis and Recognition WACV 2025 [paper]
We introduce an innovative approach to structured document editing that uses Visual-Language Models (VLMs) to simplify the process by removing the need for specialized segmentation tools. Our method incorporates a cutting-edge in-context learning framework to enhance flexibility and efficiency in tasks like spatial alignment, component merging, and regional grouping. By leveraging open-world VLMs, we ensure that document edits preserve coherence and intent. To benchmark our approach, we introduce a new evaluation suite and protocol that assess both spatial and semantic accuracy, demonstrating significant advancements in structured document editing.
LineTR: Unified Text Line Segmentation for Challenging Palm Leaf Manuscripts
Vaibhav Agrawal, Niharika Vadlamudi, Muhammad Waseem, Amal Joseph, Sreenya Chitluri, Ravi Kiran Sarvadevabhatla
ICPR 2024 [paper]
We present LineTR, a novel two-stage approach for precise line segmentation in diverse and challenging handwritten historical manuscripts. LineTR's first stage uses a DETR-style network and a hybrid CNN-transformer to process image patches and generate text scribbles and an energy map. A robust, dataset-agnostic post-processing step produces document-level scribbles. In the second stage, these scribbles and the text energy map are used to generate precise polygons around text lines. We introduce three new datasets of Indic and South-East Asian manuscripts and demonstrate LineTR's superior performance and effectiveness in zero-shot inference across various datasets.
Explored Nearest Neighbor-Based Classification in Federated Learning with Inspiration from Semantic Drift Compensation in Class-Incremental learning to improve model robustness in highly non-IID settings. Achieved promising results in proof-of-concept visualizations.
Utilized Vision-Language Model (VLM) with custom prompt template and an augmentation pipeline to accurately extract product details from images. Built a robust post-processing pipeline to validate extracted data’s measurement units. Improved the overall F1 score by 17%
Uncovering bias and uncertainty in model using Semi-Supervised VAEs
This project aims to investigate and quantify the biases present in face detection models. Identified biases include a preference for white faces over black faces, higher accuracy in detecting male faces compared to female faces, better detection of faces without glasses, and variations in accuracy based on different hair colors. The ultimate goal is to highlight these biases and suggest ways to mitigate them, promoting the development of fairer and more inclusive face detection systems.
Addressed dataset-specific challenges for Urdu text-line segmentation and evaluated pre-trained weights for domain adaptation. Integrated the model into the Indian Government’s Bhashini API during my internship at IIIT Hyderabad.
Developed a model using PyTorch CRAFT and Vision Transformer to determine if two handwritten Hindi images are by the same writer. Achieved an AUC of 0.72 and 10th place in a NCVPRIPG workshop competition.
Optimizing neural network weights using nature-inspired algorithms instead of gradient descent and backpropagation. The algorithms include Ant Colony Optimization, Particle Swarm Optimization, Genetic Algorithm.
This template is a modification to Jon Barron's website. It has further been modified by Rishab Khincha. Find the source code to my version here. Feel free to clone it for your own use while attributing the original author Jon Barron.