Carnegie Mellon College at NeurIPS 2024 – Machine Studying Weblog | ML@CMU
Carnegie Mellon College is proud to current 194 papers on the thirty eighth convention on Neural Data Processing Techniques (NeurIPS 2024), held from December 10-15 on the Vancouver Conference Heart. Here’s a fast overview of the areas our researchers are engaged on:
Listed below are a few of our high collaborator establishments:
Oral Papers
Stylus: Automatic Adapter Selection for Diffusion Models
This paper explores an alternate method to producing high-fidelity, custom-made photos at lowered prices utilizing fine-tuned adapters as an alternative of merely scaling base fashions with further knowledge or parameters. Over time, the open-source neighborhood has created a big assortment of greater than 100,000 adapters—small modules that fine-tune base fashions for particular duties. Nonetheless, many of those adapters are extremely custom-made and lack clear descriptions, making them difficult to make use of successfully. To handle this, the paper introduces Stylus, a system designed to match prompts with related adapters and routinely compose them for higher picture technology. Constructing on latest analysis displaying the advantages of mixing a number of adapters, Stylus makes use of a three-stage course of: summarizing adapters with improved descriptions and embeddings, retrieving related adapters, and composing adapters primarily based on immediate key phrases to make sure a powerful match. The authors additionally current StylusDocs, a curated dataset of 75,000 adapters with pre-computed embeddings, for analysis. Testing Stylus on fashionable Secure Diffusion checkpoints exhibits that it achieves higher CLIP/FID Pareto effectivity and is twice as most popular by human and multimodal evaluators in comparison with the bottom mannequin.
The Sample-Communication Complexity Trade-off in Federated Q-Learning
This work examines the issue of Federated Q-learning, the place a number of brokers collaboratively be taught the optimum Q-function for an unknown infinite-horizon Markov Choice Course of with finite state and motion areas. The main target is on understanding the trade-off between pattern complexity (the variety of knowledge samples wanted for studying) and communication complexity (the quantity of knowledge exchanged between brokers) for intermittent communication algorithms, a generally used method in federated settings.
The authors first set up a basic limitation: any Federated Q-learning algorithm that achieves linear speedup in pattern complexity relative to the variety of brokers should incur a communication value of no less than Ω(1/1−γ), the place γ is the low cost issue. They then introduce a brand new algorithm, Fed-DVR-Q, which is the primary to attain each optimum pattern complexity and communication complexity concurrently. Collectively, these outcomes present a complete understanding of the trade-offs between pattern and communication effectivity in Federated Q-learning.
Highlight Papers
Aligner Encoders: Self-Attention Transformers Can Be Self-Transducers
The paper introduces a brand new transformer-based method to computerized speech recognition (ASR) that simplifies the alignment course of between audio enter and textual content output. In contrast to conventional fashions, the encoder itself aligns audio info internally, decreasing the complexity of decoding. The proposed “Aligner-Encoder” mannequin combines environment friendly coaching methods and a light-weight decoder, leading to considerably quicker efficiency whereas sustaining aggressive accuracy. Notably, the alignment course of is clear within the self-attention weights of the mannequin, showcasing its capability to deal with the duty effectively.
Approximating the Top Eigenvector in Random Order Streams
This work focuses on streaming algorithms for approximating the highest eigenvector of a matrix when its rows are introduced in a random order. The authors introduce a brand new algorithm that works effectively when there’s a enough hole between the biggest and second-largest eigenvalues of the matrix. Their method makes use of a small quantity of reminiscence, relying on the variety of “heavy rows” (rows with massive norms), and produces extremely correct outcomes. Additionally they present that utilizing this heavy-row-based parameterization is critical for reaching excessive accuracy and enhance on prior strategies by decreasing the hole requirement for random-order streams, although their technique assumes the rows are introduced in a random order relatively than any order.
Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning
Current developments in unsupervised visible illustration studying have highlighted the Joint-Embedding Predictive Structure (JEPA) as an efficient technique for extracting visible options from unlabeled photos utilizing masking methods. Nonetheless, JEPA faces two key challenges: its reliance on Exponential Shifting Common (EMA) fails to forestall mannequin collapse, and its predictions wrestle to precisely seize the common illustration of picture patches. To handle these points, this work introduces C-JEPA, a brand new framework that mixes JEPA with a variance-invariance-covariance regularization technique referred to as VICReg. This method improves stability, prevents collapse, and ensures higher studying of constant representations. Experiments present that C-JEPA achieves quicker convergence and better efficiency on normal benchmarks when pre-trained on ImageNet-1K.
CooHOI: Learning Cooperative Human-Object Interaction with Manipulated Object Dynamics
This work addresses the problem of enabling humanoid robots to collaborate on duties like shifting massive furnishings, which require coordination between a number of robots. Present strategies wrestle as a result of an absence of movement seize knowledge for multi-humanoid collaboration and the inefficiency of coaching a number of brokers collectively. To beat this, the authors introduce Cooperative Human-Object Interplay (CooHOI), a framework that makes use of a two-phase studying method: first, particular person humanoids be taught object interplay abilities from human movement knowledge, after which they be taught to work collectively utilizing multi-agent reinforcement studying. By specializing in shared object dynamics and decentralized execution, the robots obtain coordination by implicit communication. In contrast to earlier tracking-based strategies, CooHOI is environment friendly, doesn’t depend on multi-humanoid movement knowledge, and may simply scale to extra members and numerous object sorts.
DiffTOP: Differentiable Trajectory Optimization for Deep Reinforcement and Imitation Learning
This paper presents DiffTORI, a framework that makes use of differentiable trajectory optimization as a coverage illustration for reinforcement and imitation studying. Trajectory optimization, a standard device in management, is parameterized by a price and a dynamics perform, and up to date advances now permit gradients of the loss to be computed with respect to those parameters. This permits DiffTORI to be taught value and dynamics features end-to-end, addressing the “goal mismatch” in earlier model-based RL strategies by aligning the dynamics mannequin with activity efficiency. Benchmarking on robotic manipulation duties with high-dimensional sensory inputs, DiffTORI demonstrates superior efficiency over prior strategies, together with feedforward insurance policies, energy-based fashions, and diffusion fashions, throughout a variety of reinforcement and imitation studying duties.
Don’t Look Twice: Faster Video Transformers with Run-Length Tokenization
Video transformers are notoriously sluggish to coach because of the massive variety of enter tokens, lots of that are repeated throughout frames. Present strategies to take away redundant tokens usually introduce vital overhead or require dataset-specific tuning, limiting their practicality. This work introduces Run-Size Tokenization (RLT), a easy and environment friendly technique impressed by run-length encoding, which identifies and removes repeated patches in video frames earlier than inference. By changing repeated patches with a single token and a positional encoding to replicate its length, RLT reduces redundancy with out requiring tuning or including vital computational value. It accelerates coaching by 30%, maintains baseline efficiency, and will increase throughput by 35% with minimal accuracy loss, whereas decreasing token counts by as much as 80% on longer movies.
ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights
This work introduces In-Context Abstraction Studying (ICAL), a way that allows large-scale language and vision-language fashions (LLMs and VLMs) to generate high-quality activity examples from imperfect demonstrations. ICAL makes use of a vision-language mannequin to research and enhance inefficient activity trajectories by abstracting key parts like causal relationships, object states, and temporal targets, with iterative refinement by human suggestions. These improved examples, when used as prompts, improve decision-making and cut back reliance on human enter over time, making the system extra environment friendly. ICAL outperforms state-of-the-art fashions in duties like instruction following, internet navigation, and motion forecasting, demonstrating its capability to enhance efficiency with out heavy handbook immediate engineering.
Is Your LiDAR Placement Optimized for 3D Scene Understanding?
This work focuses on bettering the reliability of driving notion methods below difficult and surprising circumstances, notably with multi-LiDAR setups. Most current datasets depend on single-LiDAR methods and are collected in supreme circumstances, making them inadequate for real-world purposes. To handle this, the authors introduce Place3D, a complete pipeline that optimizes LiDAR placement, generates knowledge, and evaluates efficiency. Their method consists of three key contributions: a brand new metric referred to as the Surrogate Metric of the Semantic Occupancy Grids (M-SOG) for assessing multi-LiDAR configurations, an optimization technique to enhance LiDAR placements primarily based on M-SOG, and the creation of a 280,000-frame dataset capturing each clear and hostile circumstances. Experiments present that their optimized placements result in vital enhancements in duties like semantic segmentation and 3D object detection, even in difficult eventualities with harsh climate or sensor failures.
Learn To be Efficient: Build Structured Sparsity in Large Language Models
The paper explores how Giant Language Fashions (LLMs), identified for his or her spectacular capabilities however excessive computational prices, will be made extra environment friendly. It highlights that whereas activation sparsity—the place just some mannequin parameters are used throughout inference—naturally happens, present strategies fail to maximise its potential throughout coaching. The authors suggest a novel coaching algorithm, Be taught-To-be-Environment friendly (LTE), that encourages LLMs to activate fewer neurons, putting a stability between effectivity and efficiency. Their method, relevant to fashions past conventional ReLU-based ones, demonstrates improved outcomes throughout varied duties and reduces inference latency by 25% for LLaMA2-7B at 50% sparsity.
Learning Social Welfare Functions
This work explores whether or not it’s attainable to grasp or replicate a policymaker’s reasoning by analyzing their previous selections. The issue is framed as studying social welfare features from the household of energy imply features. Two studying duties are thought-about: one makes use of utility vectors of actions and their corresponding social welfare values, whereas the opposite makes use of pairwise comparisons of welfares for various utility vectors. The authors display that energy imply features will be realized effectively, even when the social welfare knowledge is noisy. Additionally they suggest sensible algorithms for these duties and consider their effectiveness.
Metric Transforms and Low Rank Representations of Kernels
The authors introduce a linear-algebraic device primarily based on group illustration idea to unravel three vital issues in machine studying. First, they examine quick consideration algorithms for giant language fashions and show that solely low-degree polynomials can produce the low-rank matrices required for subquadratic consideration, thereby displaying that polynomial-based approximations are important. Second, they prolong the classification of constructive particular kernels from Euclidean distances to Manhattan distances, providing a broader basis for kernel strategies. Lastly, they classify all features that remodel Manhattan distances into Manhattan distances, generalizing earlier work on Euclidean metrics and introducing new outcomes about stable-rank-preserving features with potential purposes in algorithm design.
Sample-Efficient Private Learning of Mixtures of Gaussians
This work examines the issue of studying mixtures of Gaussians whereas guaranteeing approximate differential privateness. The authors display that it’s attainable to be taught a mix of okay arbitrary d-dimensional Gaussians with considerably fewer samples than earlier strategies, reaching optimum efficiency when the dimensionality d is far bigger than the variety of elements okay. For univariate Gaussians, they set up the primary optimum certain, displaying that the pattern complexity scales linearly with okay, bettering upon earlier strategies that required a quadratic dependence on okay. Their method leverages superior methods, together with the inverse sensitivity mechanism, pattern compression for distributions, and quantity bounding strategies, to attain these outcomes.
Sequoia: Scalable and Robust Speculative Decoding
As using massive language fashions (LLMs) will increase, serving them shortly and effectively has develop into a vital problem. Speculative decoding provides a promising answer, however current strategies wrestle to scale with bigger workloads or adapt to completely different settings. This paper introduces Sequoia, a scalable and sturdy algorithm for speculative decoding. By using a dynamic programming algorithm, Sequoia optimizes the tree construction for speculated tokens, bettering scalability. It additionally introduces a novel sampling and verification technique that enhances robustness throughout varied decoding temperatures. Sequoia achieves vital speedups, bettering decoding pace on fashions like Llama2-7B, Llama2-13B, and Vicuna-33B by as much as 4.04x, 3.73x, and a couple of.27x, respectively, and decreasing per-token latency for Llama3-70B-Instruct on a single GPU by 9.5x in comparison with DeepSpeed-Zero-Inference.
Slight Corruption in Pre-training Data Makes Better Diffusion Models
Diffusion fashions have demonstrated spectacular capabilities in producing high-quality photos, audio, and movies, largely as a result of pre-training on massive datasets that pair knowledge with circumstances, akin to image-text or image-class pairs. Nonetheless, even with cautious filtering, these datasets usually embrace corrupted pairs the place the circumstances don’t precisely characterize the info. This paper supplies the primary complete examine of how such corruption impacts diffusion mannequin coaching. By synthetically corrupting datasets like ImageNet-1K and CC3M, the authors present that slight corruption in pre-training knowledge can surprisingly improve picture high quality, range, and constancy throughout varied fashions. Additionally they present theoretical insights, demonstrating that slight situation corruption will increase entropy and reduces the 2-Wasserstein distance to the bottom reality distribution. Constructing on these findings, the authors suggest a way referred to as situation embedding perturbations, which improves diffusion mannequin efficiency throughout each pre-training and downstream duties, providing new insights into the coaching course of.
Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models
Giant language fashions (LLMs) with billions of parameters are extremely efficient at predicting the following token in a sequence. Whereas latest analysis has computed generalization bounds for these fashions utilizing compression-based methods, these bounds usually fail to use to billion-parameter fashions or depend on restrictive strategies that produce low-quality textual content. Present approaches additionally tie the tightness of bounds to the variety of impartial paperwork within the coaching set, ignoring the bigger variety of dependent tokens, which might provide higher bounds. This work makes use of properties of martingales to derive generalization bounds that leverage the huge variety of tokens in LLM coaching units. Through the use of extra versatile compression methods like Monarch matrices, Kronecker factorizations, and post-training quantization, the authors obtain significant generalization bounds for large-scale fashions, together with LLaMA2-70B, marking the primary profitable bounds for sensible, high-quality text-generating fashions.