Канал: Arxiv Papers

[QA] Beyond position: how rotary embeddings shape representations andmemory in transfomers

[QA] Beyond position: how rotary embeddings shape representations andmemory in transfomers

Beyond position: how rotary embeddings shape representations andmemory in autoregressive transfomers

Beyond position: how rotary embeddings shape representations andmemory in autoregressive transfomers

[QA] ALTA: Compiler-Based Analysis of Transformers

[QA] ALTA: Compiler-Based Analysis of Transformers

ALTA: Compiler-Based Analysis of Transformers

ALTA: Compiler-Based Analysis of Transformers

[QA] UnStar: Unlearning with Self-Taught Anti-Sample Reasoning for LLMs

[QA] UnStar: Unlearning with Self-Taught Anti-Sample Reasoning for LLMs

UnStar: Unlearning with Self-Taught Anti-Sample Reasoning for LLMs

UnStar: Unlearning with Self-Taught Anti-Sample Reasoning for LLMs

[QA] Representation Shattering in Transformers: A Synthetic Study with Knowledge Editing

[QA] Representation Shattering in Transformers: A Synthetic Study with Knowledge Editing

Representation Shattering in Transformers: A Synthetic Study with Knowledge Editing

Representation Shattering in Transformers: A Synthetic Study with Knowledge Editing

[QA] Generative Reward Models

[QA] Generative Reward Models

Generative Reward Models

Generative Reward Models

[QA] Debug Smarter, Not Harder: AI Agents for Error Resolution in Computational Notebooks

[QA] Debug Smarter, Not Harder: AI Agents for Error Resolution in Computational Notebooks

Debug Smarter, Not Harder: AI Agents for Error Resolution in Computational Notebooks

Debug Smarter, Not Harder: AI Agents for Error Resolution in Computational Notebooks

[QA] Decomposing The Dark Matter of Sparse Autoencoders

[QA] Decomposing The Dark Matter of Sparse Autoencoders

Decomposing The Dark Matter of Sparse Autoencoders

Decomposing The Dark Matter of Sparse Autoencoders

[QA] A Hitchhiker's Guide to Scaling Law Estimation

[QA] A Hitchhiker's Guide to Scaling Law Estimation

A Hitchhiker's Guide to Scaling Law Estimation

A Hitchhiker's Guide to Scaling Law Estimation

[QA] Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations

[QA] Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations

Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations

Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations

Looking Inward: Language Models Can Learn About Themselves by Introspection

Looking Inward: Language Models Can Learn About Themselves by Introspection

[QA] Looking Inward: Language Models Can Learn About Themselves by Introspection

[QA] Looking Inward: Language Models Can Learn About Themselves by Introspection

[QA] Thinking LLMs: General Instruction Following with Thought Generation

[QA] Thinking LLMs: General Instruction Following with Thought Generation

Thinking LLMs: General Instruction Following with Thought Generation

Thinking LLMs: General Instruction Following with Thought Generation

Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

[QA] Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

[QA] Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

MOVIE GEN: A Cast of Media Foundation Models

MOVIE GEN: A Cast of Media Foundation Models

[QA] MOVIE GEN: A Cast of Media Foundation Models

[QA] MOVIE GEN: A Cast of Media Foundation Models

[QA] One Step Diffusion via Shortcut Models

[QA] One Step Diffusion via Shortcut Models

One Step Diffusion via Shortcut Models

One Step Diffusion via Shortcut Models

[QA] Inference Scaling for Long-Context Retrieval Augmented Generation

[QA] Inference Scaling for Long-Context Retrieval Augmented Generation

Inference Scaling for Long-Context Retrieval Augmented Generation

Inference Scaling for Long-Context Retrieval Augmented Generation

[QA] What Matters in Transformers? Not All Attention is Needed

[QA] What Matters in Transformers? Not All Attention is Needed

What Matters in Transformers? Not All Attention is Needed

What Matters in Transformers? Not All Attention is Needed

[QA] Language Models Encode Numbers Using Digit Representations in Base 10

[QA] Language Models Encode Numbers Using Digit Representations in Base 10

Language Models Encode Numbers Using Digit Representations in Base 10

Language Models Encode Numbers Using Digit Representations in Base 10

[QA] Don't Transform the Code, Code the Transforms: Towards Precise Code Rewriting using LLMs

[QA] Don't Transform the Code, Code the Transforms: Towards Precise Code Rewriting using LLMs

Don't Transform the Code, Code the Transforms: Towards Precise Code Rewriting using LLMs

Don't Transform the Code, Code the Transforms: Towards Precise Code Rewriting using LLMs

[QA] Do Unlearning Methods Remove Information from Language Model Weights?

[QA] Do Unlearning Methods Remove Information from Language Model Weights?

Do Unlearning Methods Remove Information from Language Model Weights?

Do Unlearning Methods Remove Information from Language Model Weights?

[QA] MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

[QA] MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

[QA] Pixtral 12B

[QA] Pixtral 12B

Pixtral 12B

Pixtral 12B

[QA] Differential Transformer

[QA] Differential Transformer

Differential Transformer

Differential Transformer

[QA] GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

[QA] GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

[QA] Visual Scratchpads: Enabling Global Reasoning in Vision

[QA] Visual Scratchpads: Enabling Global Reasoning in Vision

Visual Scratchpads: Enabling Global Reasoning in Vision

Visual Scratchpads: Enabling Global Reasoning in Vision

[QA] Efficient Dictionary Learning with Switch Sparse Autoencoders

[QA] Efficient Dictionary Learning with Switch Sparse Autoencoders

Efficient Dictionary Learning with Switch Sparse Autoencoders

Efficient Dictionary Learning with Switch Sparse Autoencoders

[QA] RL, but don't do anything I wouldn't do

[QA] RL, but don't do anything I wouldn't do

RL, but don't do anything I wouldn't do

RL, but don't do anything I wouldn't do

[QA] Restructuring Vector Quantization with the Rotation Trick

[QA] Restructuring Vector Quantization with the Rotation Trick

Restructuring Vector Quantization with the Rotation Trick

Restructuring Vector Quantization with the Rotation Trick

[QA] EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM?

[QA] EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM?

EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM?

EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM?

[QA] Density estimation with LLMs: a geometric investigation of in-context learning trajectories

[QA] Density estimation with LLMs: a geometric investigation of in-context learning trajectories

Density estimation with LLMs: a geometric investigation of in-context learning trajectories

Density estimation with LLMs: a geometric investigation of in-context learning trajectories

[QA] Teaching Transformers Modular Arithmetic at Scale

[QA] Teaching Transformers Modular Arithmetic at Scale

Teaching Transformers Modular Arithmetic at Scale

Teaching Transformers Modular Arithmetic at Scale

[QA] What Matters for Model Merging at Scale?

[QA] What Matters for Model Merging at Scale?

What Matters for Model Merging at Scale?

What Matters for Model Merging at Scale?

[QA] Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

[QA] Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

[QA] Were RNNs All We Needed?

[QA] Were RNNs All We Needed?

Were RNNs All We Needed?

Were RNNs All We Needed?

OOD-CHAMELEON: Is Algorithm Selection for OOD Generalization Learnable?

OOD-CHAMELEON: Is Algorithm Selection for OOD Generalization Learnable?

[QA] OOD-CHAMELEON: Is Algorithm Selection for OOD Generalization Learnable?

[QA] OOD-CHAMELEON: Is Algorithm Selection for OOD Generalization Learnable?

[QA] Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

[QA] Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

[QA] Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

[QA] Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

[QA] Not All LLM Reasoners Are Created Equal

[QA] Not All LLM Reasoners Are Created Equal

Not All LLM Reasoners Are Created Equal

Not All LLM Reasoners Are Created Equal

[QA] Law of the Weakest Link: Cross Capabilities of Large Language Models

[QA] Law of the Weakest Link: Cross Capabilities of Large Language Models

Law of the Weakest Link: Cross Capabilities of Large Language Models

Law of the Weakest Link: Cross Capabilities of Large Language Models

[QA] Realistic Evaluation of Model Merging for Compositional Generalization

[QA] Realistic Evaluation of Model Merging for Compositional Generalization

Realistic Evaluation of Model Merging for Compositional Generalization

Realistic Evaluation of Model Merging for Compositional Generalization

[QA] Emu3: Next-Token Prediction is All You Need

[QA] Emu3: Next-Token Prediction is All You Need

Emu3: Next-Token Prediction is All You Need

Emu3: Next-Token Prediction is All You Need

[QA] MIO: A Foundation Model on Multimodal Tokens

[QA] MIO: A Foundation Model on Multimodal Tokens

MIO: A Foundation Model on Multimodal Tokens

MIO: A Foundation Model on Multimodal Tokens

[QA] A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor ?

[QA] A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor ?

A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor ?

A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor ?

[QA] Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models

[QA] Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models

Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models

Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models

Making Text Embedders Few-Shot Learners

Making Text Embedders Few-Shot Learners

[QA] Making Text Embedders Few-Shot Learners

[QA] Making Text Embedders Few-Shot Learners

[QA] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale

[QA] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale

Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale

Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale

[QA] Infer Human's Intentions Before Following Natural Language Instruction

[QA] Infer Human's Intentions Before Following Natural Language Instruction

Infer Human's Intentions Before Following Natural Language Instruction

Infer Human's Intentions Before Following Natural Language Instruction

[QA] MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

[QA] MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

[QA] Counterfactual Token Generation in Large Language Models

[QA] Counterfactual Token Generation in Large Language Models

Counterfactual Token Generation in Large Language Models

Counterfactual Token Generation in Large Language Models

[QA] Characterizing stable regions in the residual stream of LLMs

[QA] Characterizing stable regions in the residual stream of LLMs

Characterizing stable regions in the residual stream of LLMs

Characterizing stable regions in the residual stream of LLMs

[QA] Watch Your Steps: Observable and Modular Chains of Thought

[QA] Watch Your Steps: Observable and Modular Chains of Thought

Watch Your Steps: Observable and Modular Chains of Thought

Watch Your Steps: Observable and Modular Chains of Thought

[QA] Seeing Faces in Things: A Model and Dataset for Pareidolia

[QA] Seeing Faces in Things: A Model and Dataset for Pareidolia

Seeing Faces in Things: A Model and Dataset for Pareidolia

Seeing Faces in Things: A Model and Dataset for Pareidolia

Rule Extrapolation in Language Models: A Study of Compositional Generalization on OOD Prompts

Rule Extrapolation in Language Models: A Study of Compositional Generalization on OOD Prompts

[QA] Rule Extrapolation in Language Models: A Study of Compositional Generalization on OOD Prompts

[QA] Rule Extrapolation in Language Models: A Study of Compositional Generalization on OOD Prompts

[QA] Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking

[QA] Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking

Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking

Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking

[QA] LLM Surgery: Efficient Knowledge Unlearning and Editing in Large Language Models

[QA] LLM Surgery: Efficient Knowledge Unlearning and Editing in Large Language Models

LLM Surgery: Efficient Knowledge Unlearning and Editing in Large Language Models

LLM Surgery: Efficient Knowledge Unlearning and Editing in Large Language Models

[QA] Embedding Geometries of Contrastive Language-Image Pre-Training

[QA] Embedding Geometries of Contrastive Language-Image Pre-Training

Embedding Geometries of Contrastive Language-Image Pre-Training

Embedding Geometries of Contrastive Language-Image Pre-Training

Kolmogorov–Arnold Transformer

Kolmogorov–Arnold Transformer

[QA] Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

[QA] Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

[QA] Kolmogorov–Arnold Transformer

[QA] Kolmogorov–Arnold Transformer

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

[QA] Re-Introducing LayerNorm: Geometric Meaning, Irreversibility and Comparative Study with RMSNorm

[QA] Re-Introducing LayerNorm: Geometric Meaning, Irreversibility and Comparative Study with RMSNorm

Re-Introducing LayerNorm: Geometric Meaning, Irreversibility and a Comparative Study with RMSNorm

Re-Introducing LayerNorm: Geometric Meaning, Irreversibility and a Comparative Study with RMSNorm

[QA] Is Tokenization Needed for Masked Particle Modelling?

[QA] Is Tokenization Needed for Masked Particle Modelling?

Is Tokenization Needed for Masked Particle Modelling?

Is Tokenization Needed for Masked Particle Modelling?

[QA] Finetuning Language Models to Emit Linguistic Expressions of Uncertainty

[QA] Finetuning Language Models to Emit Linguistic Expressions of Uncertainty

Finetuning Language Models to Emit Linguistic Expressions of Uncertainty

Finetuning Language Models to Emit Linguistic Expressions of Uncertainty

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

[QA] To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

[QA] To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

[QA] On the limits of agency in agent-based models

[QA] On the limits of agency in agent-based models

On the limits of agency in agent-based models

On the limits of agency in agent-based models

[QA] Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models

[QA] Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models

Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models

Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models

[QA] Finetuning CLIP to Reason about Pairwise Differences

[QA] Finetuning CLIP to Reason about Pairwise Differences

Finetuning CLIP to Reason about Pairwise Differences

Finetuning CLIP to Reason about Pairwise Differences

[QA] Think Twice Before You Act: Improving Inverse Problem Solving With MCMC

[QA] Think Twice Before You Act: Improving Inverse Problem Solving With MCMC

Think Twice Before You Act: Improving Inverse Problem Solving With MCMC

Think Twice Before You Act: Improving Inverse Problem Solving With MCMC

[QA] Explaining Datasets in Words: Statistical Models with Natural Language Parameters

[QA] Explaining Datasets in Words: Statistical Models with Natural Language Parameters

Explaining Datasets in Words: Statistical Models with Natural Language Parameters

Explaining Datasets in Words: Statistical Models with Natural Language Parameters

[QA] LLMs Will Always Hallucinate, and We Need to Live With This

[QA] LLMs Will Always Hallucinate, and We Need to Live With This

LLMs Will Always Hallucinate, and We Need to Live With This

LLMs Will Always Hallucinate, and We Need to Live With This

[QA] PingPong: A Benchmark for Role-Playing LLMs with User Emulation and Multi-Model Evaluation

[QA] PingPong: A Benchmark for Role-Playing LLMs with User Emulation and Multi-Model Evaluation

PingPong: A Benchmark for Role-Playing LLMs with User Emulation and Multi-Model Evaluation

PingPong: A Benchmark for Role-Playing LLMs with User Emulation and Multi-Model Evaluation

[QA] LLaMA-Omni: Seamless Speech Interaction with Large Language Models

[QA] LLaMA-Omni: Seamless Speech Interaction with Large Language Models

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

[QA] WINDOWS AGENT ARENA: Evaluating Multi-Modal OS Agents at Scale

[QA] WINDOWS AGENT ARENA: Evaluating Multi-Modal OS Agents at Scale

WINDOWS AGENT ARENA: Evaluating Multi-Modal OS Agents at Scale

WINDOWS AGENT ARENA: Evaluating Multi-Modal OS Agents at Scale

[QA] What Makes a Maze Look Like a Maze?

[QA] What Makes a Maze Look Like a Maze?

What Makes a Maze Look Like a Maze?

What Makes a Maze Look Like a Maze?

[QA] Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

[QA] Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Synthetic continued pretraining

Synthetic continued pretraining

[QA] Synthetic continued pretraining

[QA] Synthetic continued pretraining

[QA] Agent Workflow Memory

[QA] Agent Workflow Memory

Agent Workflow Memory

Agent Workflow Memory

[QA] Programming Refusal with Conditional Activation Steering

[QA] Programming Refusal with Conditional Activation Steering

Programming Refusal with Conditional Activation Steering

Programming Refusal with Conditional Activation Steering

[QA] Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthes

[QA] Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthes

Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthes

Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthes

LoCa: Logit Calibration for Knowledge Distillation

LoCa: Logit Calibration for Knowledge Distillation

[QA] LoCa: Logit Calibration for Knowledge Distillation

[QA] LoCa: Logit Calibration for Knowledge Distillation

[QA] FreeAugment: Data Augmentation Search Across All Degrees of Freedom

[QA] FreeAugment: Data Augmentation Search Across All Degrees of Freedom

FreeAugment: Data Augmentation Search Across All Degrees of Freedom

FreeAugment: Data Augmentation Search Across All Degrees of Freedom

[QA] Theory, Analysis, and Best Practices for Sigmoid Self-Attention

[QA] Theory, Analysis, and Best Practices for Sigmoid Self-Attention

Theory, Analysis, and Best Practices for Sigmoid Self-Attention

Theory, Analysis, and Best Practices for Sigmoid Self-Attention

[QA] Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

[QA] Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

[QA] xLAM: A Family of Large Action Models to Empower AI Agent Systems

[QA] xLAM: A Family of Large Action Models to Empower AI Agent Systems

xLAM: A Family of Large Action Models to Empower AI Agent Systems

xLAM: A Family of Large Action Models to Empower AI Agent Systems

[QA] In Defense of RAG in the Era of Long-Context Language Models

[QA] In Defense of RAG in the Era of Long-Context Language Models

In Defense of RAG in the Era of Long-Context Language Models

In Defense of RAG in the Era of Long-Context Language Models

[QA] Building Math Agents with Multi-Turn Iterative Preference Learning

[QA] Building Math Agents with Multi-Turn Iterative Preference Learning

Building Math Agents with Multi-Turn Iterative Preference Learning

Building Math Agents with Multi-Turn Iterative Preference Learning

[QA] Attention Heads of Large Language Models: A Survey

[QA] Attention Heads of Large Language Models: A Survey

Attention Heads of Large Language Models: A Survey

Attention Heads of Large Language Models: A Survey

[QA] The AdEMAMix Optimizer: Better, Faster, Older

[QA] The AdEMAMix Optimizer: Better, Faster, Older

The AdEMAMix Optimizer: Better, Faster, Older

The AdEMAMix Optimizer: Better, Faster, Older

[QA] Planning In Natural Language Improves LLM Search For Code Generation

[QA] Planning In Natural Language Improves LLM Search For Code Generation

Planning In Natural Language Improves LLM Search For Code Generation

Planning In Natural Language Improves LLM Search For Code Generation

[QA] MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

[QA] MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

[QA] Sample what you can't compress

[QA] Sample what you can't compress

Sample what you can't compress

Sample what you can't compress

[QA] CONTEXTCITE: Attributing Model Generation to Context

[QA] CONTEXTCITE: Attributing Model Generation to Context

CONTEXTCITE: Attributing Model Generation to Context

CONTEXTCITE: Attributing Model Generation to Context

[QA] FLUX that Plays Music

[QA] FLUX that Plays Music

FLUX that Plays Music

FLUX that Plays Music

[QA] Modularity in Transformers: Investigating Neuron Separability & Specialization

[QA] Modularity in Transformers: Investigating Neuron Separability & Specialization

Modularity in Transformers: Investigating Neuron Separability & Specialization

Modularity in Transformers: Investigating Neuron Separability & Specialization

[QA] Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling

[QA] Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling

Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling

Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling

[QA] CycleGAN with Better Cycles

[QA] CycleGAN with Better Cycles

[QA] Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models

[QA] Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models

Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models

Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models

CycleGAN with Better Cycles

CycleGAN with Better Cycles

[QA] The Mamba in the Llama: Distilling and Accelerating Hybrid Models

[QA] The Mamba in the Llama: Distilling and Accelerating Hybrid Models

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

[QA] Generative Verifiers: Reward Modeling as Next-Token Prediction

[QA] Generative Verifiers: Reward Modeling as Next-Token Prediction

Generative Verifiers: Reward Modeling as Next-Token Prediction

Generative Verifiers: Reward Modeling as Next-Token Prediction

Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

[QA] Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

[QA] Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

[QA] A Law of Next-Token Prediction in Large Language Models

[QA] A Law of Next-Token Prediction in Large Language Models

A Law of Next-Token Prediction in Large Language Models

A Law of Next-Token Prediction in Large Language Models

[QA] SLM Meets LLM: Balancing Latency, Interpretability and Consistency in Hallucination Detection

[QA] SLM Meets LLM: Balancing Latency, Interpretability and Consistency in Hallucination Detection

SLM Meets LLM: Balancing Latency, Interpretability and Consistency in Hallucination Detection

SLM Meets LLM: Balancing Latency, Interpretability and Consistency in Hallucination Detection

[QA] How Diffusion Models Learn to Factorize and Compose

[QA] How Diffusion Models Learn to Factorize and Compose

How Diffusion Models Learn to Factorize and Compose

How Diffusion Models Learn to Factorize and Compose