DevConf.IN 2026

Sneha Singh

I am currently working with Deloitte as a Software Engineer in the HCSM – Infra and Cloud Management team, where I primarily focus on Network Operations Center (NOC) services and cloud technologies. My role involves ensuring smooth infrastructure operations, monitoring critical systems, and supporting cloud environments. Along with that, I have a keen interest in Robotics and Automation.


Company or affiliation:

Deloitte USI

Job title:

Analyst/Software Engineer


Session

02-13
17:15
15min
Quantization at the Edge: Making a 4GB Model Run on 1GB RAM
Sneha Singh

Running generative AI on edge hardware is challenging because LLMs require large memory footprints far beyond what affordable ARM boards offer. Developers often give up or rely on cloud inference, which introduces latency, privacy concerns, and connectivity issues. This problem exists because most quantization tutorials target server-class GPUs and ignore memory-constrained devices where every megabyte matters. Traditional quantization (8-bit or 4-bit) still leaves models too large for sub-2GB RAM environments, and no practical guidance exists for pushing the boundaries on real edge hardware.

This talk walks through a practical method for shrinking a 4GB LLM to run comfortably on a 1GB device through aggressive quantization, operator fusion, KV-cache trimming, and runtime memory pooling. The approach uses open-weight models, offline quantization, and lightweight inference runtimes optimized for ARM CPUs. A demo shows how to load and run a quantized model on a basic board while maintaining usable accuracy. This session will benefit embedded engineers, makers, AI practitioners, and cloud-edge architects exploring low-cost, privacy-friendly AI deployments.

AI, Data Science, and Emerging Tech
VYAS - 1 - Room#VY102