I am a Master's student at Mila and Université de Montréal. I'm primarily interested in audio-visual learning, visual scene understanding and computational photography.
Previously, I worked under the guidance of Prof. Ujjwal Bhattacharya at the CVPR Unit, Indian Statistical Institute, Kolkata. Earlier, I have also completed a short stint at Bhabha Atomic Research Center, Mumbai where I worked on Devanagari text recognition in limited-data settings. I graduated from Amity University with a First Class with Distinction Bachelor's degree in Computer Science and Engineering.
My prior research experience has been on audio-visual co-segmentation, audio-visual summarization and medical signal processing.
[Jul '24]:Meerkat is accepted at ECCV 2024! Check here!
[Jul '23]:AdVerb is accepted at ICCV 2023! Check here!
[Jul '23]:UnShadowNet is accepted in IEEE Access journal!
[Sep '22]: Joined Mila as a Master's student. The program is supervised by Prof. Yoshua Bengio.
[Nov '21]: Presented AudViSum at BMVC 2021! [Presentation][Oct '21]:AudViSum accepted at BMVC 2021!
[Sep '21]: Presented Listen to the Pixels at ICIP 2021! [Presentation][May '21]:Listen to the Pixels accepted at ICIP 2021!
[Jan '21]: Presented CardioGAN at ICPR 2020! [Presentation]
We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally.
With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency,
Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking.
Moreover, we carefully curate a large dataset AVFIT-3M that comprises 3M instruction tuning samples collected from open-source datasets,
and introduce MeerkatBench that unifies five challenging audio-visual tasks.
@article{chowdhury2024meerkat,
title={Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time},
author={Chowdhury, Sanjoy and Nag, Sayan and Dasgupta, Subhrajyoti and Chen, Jun and Elhoseiny, Mohamed and Gao, Ruohan and Manocha, Dinesh},
journal={ECCV},
year={2024}
}
AdVerb leverages visual cues of the environment to estimate clean audio from reverberant audio.
For instance, given a reverberant sound produced in a large hall, our model attempts to remove the
reverb effect to predict the anechoic or clean audio.
@article{chowdhury2023adverb,
title={AdVerb: Visually Guided Audio Dereverberation},
author={Chowdhury, Sanjoy and Ghosh, Sreyan and Subhrajyoti, Dasgupta and Ratnarajah, Anton and Tyagi, Utkarsh and Manocha, Dinesh},
journal={ICCV},
year={2023}
}
Shadow removal is a hard task given the challenges associated with it, one of them being unavailabilty of paired
labelled data. We propose a weakly supervised, illumination critic guided method using contrastive learning for
efficiently removing shadows.
@article{dasgupta2023unshadownet,
title={UnShadowNet: Illumination Critic Guided Contrastive Learning For Shadow Removal},
author={Dasgupta, Subhrajyoti and Das, Arindam and Yogamani, Senthil and Das, Sudip and Eising, Ciar{\'a}n and Bursuc, Andrei and Bhattacharya, Ujjwal},
journal={IEEE Access},
year={2023},
publisher={IEEE}
}
Generating representative and diverse audio-visual summaries by exploiting both the audio and visual modalities, unlike prior works. Also presented a new dataset on TVSum and OVP with audio and visual annotations.
@inproceedings{chowdhury2021audvisum,
title={AudViSum: Self-Supervised Deep Reinforcement Learning for Diverse Audio-Visual Summary Generation.},
author={Chowdhury, Sanjoy and Patra, Aditya P. and Dasgupta, Subhrajyoti and Bhattacharya, Ujjwal},
booktitle={BMVC},
year={2021}
}
Audio-visual co-segmentation and sound source separation using a novel multimodal fusion mechanism, also addressing partially occluded sound source separation and co-segmentation for multiple but similar sound sources.
@INPROCEEDINGS{9506019,
author={Chowdhury, Sanjoy and Dasgupta, Subhrajyoti and Das, Sudip and Bhattacharya, Ujjwal},
booktitle={2021 IEEE International Conference on Image Processing (ICIP)},
title={Listen To The Pixels},
year={2021},
pages={2568-2572},
doi={10.1109/ICIP42928.2021.9506019}
}
Generating synthetic ECGs, for easy sharing without risk of privacy breach, using an Attention-based Generative Adversarial Network.
@INPROCEEDINGS{9412905,
author={Dasgupta, Subhrajyoti and Das, Sudip and Bhattacharya, Ujjwal},
booktitle={2020 25th International Conference on Pattern Recognition (ICPR)},
title={CardioGAN: An Attention-based Generative Adversarial Network for Generation of Electrocardiograms},
year={2021},
pages={3193-3200},
doi={10.1109/ICPR48806.2021.9412905}
}
While there exists large literature to detect and recognise English text in natural scenes and documents, during the time of this study, regional languages were not very largely studied. In this project done at Bhabha Atomic Research Center, Mumbai, dealt with a huge shortage of data and the nuances in the Devanagari script. Learning strategies for constrained settings like few-shot learning, transfer learning were used to develop the project. The project was implemented using Keras and Python. A great deal of OpenCV, Matplotlib and other scientific tools were also used.
A humongous amount of data is produced by the LHC per day. This data needs to be processed and used efficiently for further research. This study was on how Machine Learning can be implemented for particle identification, particle track reconstruction, clustering of particles based on similarity, and identifying rare decays. A study on the proposed SHiP experiment, with the scope of Machine Learning in it, was also done.
Personal
I often like to go out with my camera to cover music festivals, capture people and moments. Check out my works on 500px. Besides, I like indulging in a wide variety of movies and music. I keep a huge collection of movies from Kubrick to Nolan, Bergman to Ray.