FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Yuyang Ding †,‡

Soochow University

yyding23@stu.suda.edu.cn

Chi Zhang

ByteDance Seed

zhangchi.usc1992@bytedance.com

Juntao Li †,*

Soochow University

ljt@suda.edu.cn

Haibin Lin

ByteDance Seed

haibin.lin@bytedance.com

Xin Liu

ByteDance Seed

liuxin.ai@bytedance.com

Min Zhang

Soochow University

minzhang@suda.edu.cn

We have also implemented a more flexible and easy-to-use infra design for reward models, which has been added to the veRL repo with details in this doc and will be merged as a key feature in a future release. Welcome to use it!

Comparison between FAPO models and their baselines thoughout RL training.
Comparison between FAPO models and their baselines thoughout RL training. FAPO enhances outcome correctness, improves process reliability, and training efficiency and stability, all without increasing the token budget.

Let’s start with a recent OpenAI study on LLM hallucinations:

“Language models hallucinate because their training and evaluation processes favor confident guesses over the acknowledgment of uncertainty. Fundamentally, these hallucinations arise as errors in binary classification.”

OpenAI: Why Language Models Hallucinate

In the context of reinforcement learning, we refer to these “confident guesses” as “flawed positives”, where such flawed guessing rollouts are treated as confident positive signals for policy optimization, thereby reinforcing unreliable reasoning patterns. In this work, we dive into this dilemma, and propose Flawed-Aware Policy Optimization (FAPO), demonstrating the great potential of acknowledging and penalizing these uncertain or flawed patterns to ensure efficient and reliable reasoning.

Distribution and Impact of Flawed Positives in RL

Preliminary Results about the distribution and impact of flawed positives.
Preliminary Results about the distribution and impact of flawed positives.

Key Ovservations:

Flawed Positives are Prevalent in Initial Checkpoints (Figure a): Flawed positives are prevalent across various LLMs (Pre-Trained Model, Instruct Model, and Think Model), which stablish the starting conditions for subsequent RL optimization, accounting for 20%–40% of correct rollouts.

Flawed Positives are Stepping Stones in Learning (Figure b): Flawed positives are most prevalent during the early learning stages but diminish significantly as per-sample confidence improves. This highlights their expected role as natural stepping stones in the learning trajectory, allowing the model to initially reach correct answers before gradually evolving the capability to produce fully correct solutions.

Flawed Positives Persist throughout RL Training (Figure c): As the RL training progresses, flawed-positive ratio remains almost constant at around 30%. This indicates that the optimization process struggles to shift from unreliable reasoning to genuine problem-solving.

Flawed Positives Exert Twofold Effects (Figure d): Assigning negative optimization signals to flawed positives yields substantial performance improvements, although the gains appear more gradually in the early training stages. These findings reveal that flawed positives exert a twofold effect: (1) flawed positives act as stepping stones, enabling the model to achieve rapid capability gains in the early stages, and (2) their improper reward assignment can trap optimization in unreliable reasoning.

Flawed-Aware Policy Optimization

Baseline RL setting

Based on GRPO, we also adopt several effective effective strategies such as clip-higher and token-level loss as our standard baseline RL setting.

J(θ)=E(q,a)D,{oi}i=1Gπθold(q)    1i=1Goii=1Gt=1oi{min[πθ(otq,o<t)πθold(otq,o<t)A^i,t,clip(πθ(otq,o<t)πθold(otq,o<t),1ϵl,1+ϵh)A^i,t]}. \begin{aligned} &\mathcal{J}(\theta) = \mathbb{E}_{(q, a)\sim \mathcal{D},\{o_i\}_{i=1}^{G}\sim\pi_{\theta_\text{old}}(\cdot|q)} \\ &\;\;\frac{1}{\sum_{i=1}^{G}|o_i|}\sum_{i=1}^{G}\sum_{t=1}^{|o_i|}\left\{\min \left[\frac{\pi_\theta(o_t|q, o_{<t})}{\pi_{\theta_\text{old}}(o_t|q, o_{<t})}\hat{A}_{i,t}, \text{clip}(\frac{\pi_\theta(o_t|q, o_{<t})}{\pi_{\theta_\text{old}}(o_t|q, o_{<t})}, 1-\epsilon_{l}, 1+\epsilon_{h})\hat{A}_{i,t}\right]\right\}. \end{aligned}

where (q,a)(q, a) denotes a question-answer pair sampled from the data distribution D\mathcal{D}, πθold\pi_{\theta_{\text{old}}} is the old policy, and ϵ\epsilon controls the clipping range in importance sampling for stability. The advantage Ai,tA_{i,t} is estimated in a group-relative manner:

A^i,t=(riμGRPO)/σGRPO,    whereμGRPO=mean({Ri}i=1G);    σGRPO=std({Ri}i=1G);    ri=Rrule(o,a)={1,If    o=a1,Otherwise \hat{A}_{i,t} = (r_i - \mu_\text{GRPO}) / \sigma_\text{GRPO}, \;\; \text{where} \\ \mu_\text{GRPO} = \text{mean}(\{R_{i}\}_{i=1}^G);\;\; \sigma_\text{GRPO} = \text{std}(\{R_{i}\}_{i=1}^G); \;\; r_i = R_\text{rule}(o, a^*) = \begin{cases} 1, & \text{If}\;\;o = a^* \\ -1, & \text{Otherwise} \end{cases}

where oo is the predicted final answer of old policy and aa^* denotes the ground truth.

FAPO-GenRM

We first train a generative reward model (GenRM) to detect flawed-positives accurately and comprehensively, with the following RL reward reformulation:

RFAPO-GenRM=ROutcome+RProcesswhere  ROutcome={1,If    y^θ=y1,Otherwise,  RProcess={t^θtn,If    y^θ=y=FP0,Otherwise. \begin{aligned} &R_\text{FAPO-GenRM} = R_\text{Outcome} \mathbf{\textbf{+} R_\textbf{Process}} \\ &\text{where}\; R_\text{Outcome} = \begin{cases} 1, & \text{If}\;\;\hat{y}_\theta = y^* \\ -1, & \text{Otherwise} \end{cases} ,\; R_\text{Process} = \begin{cases} - \frac{|\hat{t}_\theta - t^*|}{n}, & \text{If}\;\; \hat{y}_\theta = y^* = \text{FP} \\ 0, & \text{Otherwise} \end{cases}. \end{aligned}

Here, t^θ\hat{t}_\theta and tt^* denote the predicted and ground-truth error indices, and nn is the total number of steps, ensuring RProcess[1,0]R_\text{Process}\in [-1, 0]. The process penalty is distance-sensitive: predictions closer to the true error receive higher rewards, while those farther away incur stronger penalties. This design guides the model toward precise error localization and fosters genuine error-detection ability, rather than mere guessing.

FAPO-Reasoning

With the GenRM detecting flawed positives, we then regulate their roles in the final RL optimization. We introduce a reward-penalization mechanism with a group-relative advantage estimation:

RFAPO(o,aθ)=RRLVR(o,a)+RΔ(o,aθ),where  RΔ(o,aθ)={λ,If    I(o,a)  and  y^θ(o,a)=FP0,Otherwise,A^i,t=[rimean({Ri}i=1G)]/std({Ri}i=1G). \begin{aligned} R_\text{FAPO}&(o, a^*|\theta) = R_\text{RLVR}(o, a^*) \mathbf{\textbf{+} R_\Delta(o,a^* | \theta)}, \\ &\text{where}\; R_\Delta(o,a^* | \theta) = \begin{cases} -\lambda, & \text{If}\;\;\mathcal{I}(o, a^*)\;\text{and}\;\hat{y}_\theta(o, a^*)=\text{FP} \\ 0, & \text{Otherwise} \end{cases}, \\ \hat{A}_{i,t} &= \left[r_i - \text{mean}(\{R_{i}\}_{i=1}^G)\right] / \text{std}(\{R_{i}\}_{i=1}^G). \end{aligned}

Theoretical Analysis about FAPO Effectiveness

We demonstrate the whole learning process and how FAPO leverage flawed positives. The advantage estimation of FAPO can be formulated in that of GRPO:

{μFAPO=μGRPOλγσFAPO2=σGRPO2+λγ(1γ)(λ4α/β+1) \begin{aligned} \begin{cases} \mu_\text{FAPO} = \mu_\text{GRPO} - \lambda\gamma \\ \sigma_\text{FAPO}^2 = \sigma_\text{GRPO}^2 + \lambda \gamma(1-\gamma) (\lambda - \frac{4}{\alpha / \beta + 1}) \end{cases} \end{aligned}

where α,β,γ\alpha, \beta, \gamma denote the positive rate, negative rate, and flawed positive rate, respectively, and we propose ρ=αβ\rho = \frac{\alpha}{\beta} to represent the current learning progress.

The role of flawed positive shift when optimization progress, specifically:

A^Flawed=1λμFAPOσFAPO<0    λ>2βα+β=2α/β+1    ρ=αβ>2λ1 \begin{aligned} \hat{A}_\text{Flawed} = \frac{1 - \lambda - \mu_\text{FAPO}}{\sigma_\text{FAPO}} < 0 \; \Leftrightarrow \; \lambda > \frac{2\beta}{\alpha + \beta} = \frac{2}{\alpha / \beta + 1} \; \Leftrightarrow \; \rho = \frac{\alpha}{\beta} > \frac{2}{\lambda} - 1 \end{aligned}

The scaling factor σFAPO2\sigma_\text{FAPO}^2 changes over σGRPO2\sigma_\text{GRPO}^2 when:

σFAPO2σGRPO2=λγ(1γ)(λ4α/β+1)when αβ<4λ1σFAPO2<σGRPO2;    when αβ>4λ1σFAPO2>σGRPO2 \begin{aligned} \sigma_\text{FAPO}^2-\sigma_\text{GRPO}^2 &= \lambda \gamma(1-\gamma) (\lambda - \frac{4}{\alpha / \beta + 1}) \\ \therefore \text{when }\frac{\alpha}{\beta} < \frac{4}{\lambda} - 1 \Rightarrow \sigma_\text{FAPO}^2 < &\sigma_\text{GRPO}^2; \;\; \text{when }\frac{\alpha}{\beta} > \frac{4}{\lambda} - 1 \Rightarrow \sigma_\text{FAPO}^2 > \sigma_\text{GRPO}^2 \end{aligned}

So the optimization direction moves from warm-up to refinement if current learning progress ρ\rho exceeds 2λ1\frac{2}{\lambda} - 1. Furthermore, when ρ>4λ1\rho > \frac{4}{\lambda} - 1, the advantage estimation is downscaled to stabilize training. In terms of λ\lambda, we adopt a majority-guided strategy, which yields ρshift=1\rho_\text{shift} = 1, further determining λ=1\lambda = 1.

Overall, FAPO provides a principled mechanism for guiding the optimization process, aligning with the ideal learning trajectory where the focus initially lies in producing correct solutions when model capability is limited, and naturally shifts toward refining reliability once correct rollouts surpass incorrect ones.

Experiments

FAPO-GenRM

Performance Comparison between FAPO-GenRM-4B and other SoTA Models.
Performance Comparison between FAPO-GenRM-4B and other SoTA Models.

FAPO-GenRM-4B achieves substantial improvements on both FlawedPositiveBench and ProcessBench, even outperforming the teacher model Qwen3-32B, further demonstrating the effectiveness of our approach.

We have also open-sourced the training dataset FAPO-Critic, training scripts, and the final checkpoint.

FAPO-Reasoning

Comparison between FAPO-Reasoning and the baseline.
Comparison between FAPO-Reasoning and the baseline.

We evaluate FAPO Models in AIME24, AIME25 (Math Domain), and GPQA-Diamond (General Domain), demonstrate the great potential of FAPO in:

(1) Outcome Correctness: FAPO consistently maintains a clear advantage of accuracy over the baselines.

(2) Process Reliability: FAPO responses exhibit a substantially lower flawed-positive ratio.

(3) Training Stability: By mitigating the impact of flawed positives, training stability is enhanced.

We have also open-sourced the training scripts, and the final checkpoint.

Infrastructure design

Introducing generative reward models (GenRMs) may have a considerable impact on the whole RL process, influencing both algorithmic effectiveness and infrastructure efficiency. Below, we outline the following roadmap (part implemented):

[1/N]: Standalone GenRM seperated from Rollout.

[2/N]: GenRM router to distribute request.

[3/N]: Mixture of colocate and standalone mode.

[4/N]: Fully Async RL Pipeline Construction.

Infra Design

BibTeX citation

    
    @article{ding2025fapo,
  title={FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning},
  author={Ding, Yuyang and Zhang, Chi and Li, Juntao and Lin, Haibin and Liu, Xin and Zhang, Min},
  journal={arXiv preprint arXiv:2510.22543},
  year={2025}
}