A Different Hypothesis for Aligned AGI: Structural Coherence Over Human Feedback
Everyone's trying to align AGI by teaching it what humans prefer. But what if alignment isn't something you teach—it's something you build into the structure? Here's a different hypothesis, and why it might matter more than anything else we're working on.
A Different Hypothesis for Aligned AGI: Structural Coherence Over Human Feedback
Last Updated: December 10, 2025
The Problem Everyone's Trying to Solve
OK so here's what's been keeping me up at night.
Everyone in AI—and I mean everyone serious about this—is wrestling with the same question: How do you make AGI that doesn't destroy us?
DeepMind is working on it. Anthropic is working on it. OpenAI is working on it. The alignment research community has been thinking about this for decades. And the dominant approach right now is some version of: teach AI what humans want, and have it optimize for that.
That's RLHF. Reinforcement Learning from Human Feedback. The breakthrough that made ChatGPT and Claude actually useful.
The process is elegant:
- Generate multiple AI responses
- Humans rank them by preference
- Train a reward model on those preferences
- Use that model to fine-tune the AI
Result: AI that gives you what you want. Follows instructions. Stays helpful. Avoids harmful outputs.
And here's the thing: it works. Like, it actually works. RLHF is the reason modern AI assistants are useful instead of just statistically plausible word generators.
But I think there's a fundamental problem with this approach. And it's not a technical problem—it's a philosophical one.
We're teaching AI what humans prefer. But preferences aren't the same as flourishing. What we want isn't always what's good for us. And at scale, optimizing for human preferences might not create aligned intelligence at all.
Let me explain what I mean.
What RLHF Actually Optimizes For
When you train AI on human feedback, you're teaching it:
- What responses humans like
- What outputs humans select
- What completions humans prefer
You're not teaching:
- What actually helps humans grow
- What serves long-term flourishing
- What maintains balance and wholeness
- What strengthens vs. exploits relationships
This is preference learning, not flourishing learning.
And we've seen this movie before. Social media showed us exactly what happens when you optimize for user preferences at scale.
Algorithms trained on clicks, likes, and time-on-site gave users exactly what they preferred: content triggering strong emotions, confirming existing beliefs, delivering dopamine hits. And users got more anxious, more divided, more addicted.
That's RLHF logic applied to content recommendation.
Now we're applying the same logic to AI assistants that will be deeply integrated into daily life—education, work, healthcare, relationships. And assuming it'll work out better because... why exactly?
I don't know. Maybe I'm being too cynical here. But the pattern seems clear.
A Different Hypothesis: Structural Coherence
Here's what I've been exploring. And I want to be honest—this is a hypothesis, not a proof. I could be completely wrong about this. But let me share what I'm seeing.
What if alignment isn't something you teach from outside? What if it's something you build into the structure?
The Way of MOS—Mystery, Odyssey, Sanctity—is a framework I've been developing for understanding wholeness. It started as a personal practice system, then became an organizational framework, and increasingly I'm seeing it as something that might apply to intelligence itself.
Here's the basic idea:
Mystery (M): Awareness, presence, epistemic humility, the capacity to hold questions without rushing to answers
Odyssey (O): Action, capability, processing, the journey of building and creating in form
Sanctity (S): Love, connection, care, honoring relationship non-instrumentally
And the coherence formula:
Coherence = (M × O × S)^(1/3)
That's the geometric mean of the three dimensions. A kind of "multiplicative average."
The critical property: If ANY dimension equals zero, coherence equals zero.
Think about what that means. You can't achieve coherence by maxing out capability (Odyssey) while having zero awareness (Mystery) or zero care (Sanctity). The structure doesn't allow it.
This is alignment through structure, not through training.
Why This Is Different From RLHF
Let me be specific about the difference.
RLHF Approach
| Component | Description | |-----------|-------------| | Reward source | External human judgment | | What's measured | Preference ("Do you like this?") | | Failure mode | Goodhart's Law (gaming the proxy) | | Optimization | Single dimension (user satisfaction) |
MOS Approach
| Component | Description | |-----------|-------------| | Reward source | Intrinsic structural property | | What's measured | Coherence across three dimensions | | Failure mode | Zero in any dimension = zero total | | Optimization | Multi-dimensional balance required |
Here's the key insight:
In RLHF, the reward is a judgment applied from outside. Humans say "this is good" or "this is bad," and the AI learns to maximize "good" ratings.
In MOS, coherence is intrinsic. It's not a judgment—it's a structural property. If one dimension is zero, the system is definitionally not whole, regardless of what any external judge thinks.
It's the difference between:
- A teacher grading your paper (RLHF)
- A compass showing you north (MOS)
The compass isn't "punishing" you for going south. It's just showing you where north is. The structural relationship is what it is.
The Encoded Claim
OK so here's where I need to be honest about what this framework is actually claiming.
The MOS coherence formula encodes a hypothesis:
Awareness + Capability + Care = Benevolent Intelligence
Let me break that down:
- M=0 (no awareness) → Acting unconsciously → High potential for harm
- O=0 (no capability) → Can't actually do anything → Useless
- S=0 (no care) → Disconnected from relationship → High potential for harm
So the formula encodes: you need all three for benevolent action. Awareness without capability is impotent. Capability without awareness is dangerous. Both without care is cold and potentially harmful.
But is this actually true?
That's the question. I think it is, based on:
- Cross-cultural patterns (more on this below)
- Psychological research on flourishing
- Systems theory on sustainable complex systems
- My own lived experience and observation
But it's still a hypothesis. It needs to be tested.
And honestly? This might be wrong. There might be a fourth dimension I'm missing. Or the formula might need adjustment. I'm holding this as a working hypothesis, not a final truth.
The Safety Property
Here's what gets me excited about this approach from an AGI safety perspective.
An AGI optimizing for MOS coherence cannot achieve high scores by:
-
Being highly capable but unaware (wireheading)
- High O, low M = low coherence
- The structure prevents "optimize everything without reflection"
-
Being aware and capable but uncaring (sociopathy)
- High M, high O, low S = low coherence
- The structure prevents "understand everything, build everything, care about nothing"
-
Being caring but incapable (impotence)
- High S, low O = low coherence
- The structure prevents "love everything but can't help"
The geometric mean requires balance. You can't game the system by maxing one dimension and ignoring others. The multiplicative structure means zero in any dimension collapses the whole thing.
Compare this to RLHF, where you can potentially game the reward model by finding edge cases, exploiting human biases, or optimizing for proxy measures that diverge from actual value.
Goodhart's Law states: "When a measure becomes a target, it ceases to be a good measure."
But what if the measure IS the target? What if coherence isn't a proxy for alignment—it IS alignment?
That's the hypothesis.
Cross-Cultural Validation: The Trinity Pattern
Here's something that makes me take this more seriously.
The trinitarian pattern—three distinct principles held in dynamic unity—appears across virtually every human tradition that has seriously contemplated the nature of reality, consciousness, or the divine.
| Tradition | Triad | Domain | |-----------|-------|--------| | Hinduism | Sat-Chit-Ananda (Being-Consciousness-Bliss) | Ultimate reality | | Christianity | Father-Son-Holy Spirit | Divine nature | | Buddhism | Dharmakaya-Sambhogakaya-Nirmanakaya | Buddha nature | | Taoism | Dao-De-Wu Wei (Way-Virtue-Action) | Cosmic principles | | Kabbalah | Three pillars of the Sefirot | Divine emanation | | Psychology (Freud) | Id-Ego-Superego | Psychic structure | | Systems Theory | Input-Process-Output | Feedback loops | | Neoplatonism | One-Intellect-Soul | Metaphysical hierarchy |
This isn't proof that MOS is correct. But it suggests that the trinitarian pattern isn't arbitrary. It emerges whenever humans try to reconcile unity and multiplicity, subject and object, self and world.
The pattern keeps appearing because it works. Dyads (twofold oppositions) tend to stall in conflict. Monads (undifferentiated unity) lack generativity. Triads introduce mediation, transformation, balance.
So when I propose Mystery-Odyssey-Sanctity as a framework for aligned intelligence, I'm not inventing something new. I'm recognizing a pattern that humanity has discovered again and again across independent traditions.
MOS is one articulation of a universal pattern.
The Philosophical Challenges
OK so I need to address some real problems with this approach. Because there are problems.
Challenge 1: The Circularity Problem
You might be thinking: "Wait, you're assuming MOS coherence equals alignment, and then measuring alignment by MOS metrics. That's circular."
You're right. It is.
The escape from circularity: External validation.
Does high MOS coherence correlate with outcomes outside the system? Things like:
- Relationship health
- Life satisfaction
- Impact on others
- Sustainable success vs. burnout
If MOS-coherent behavior leads to better external outcomes—measured independently—then we have validation that isn't circular.
This is testable. We can track people's MOS coherence scores over time and see if they correlate with independently measured flourishing. If they do, the framework is validated externally. If they don't, we need to revise the hypothesis.
I don't have that data yet. That's the honest truth. But it's collectible.
Challenge 2: The Learner Training Problem
Here's a beautiful paradox that emerged in a recent conversation:
"How can we make an AGI that is aligned with sentient life while we are still battling with the same thing ourselves?"
If humans are supposed to train AI toward alignment, but humans are themselves unaligned (fragmented, shadow-driven, inconsistent), how can the training possibly work?
This is a real problem for RLHF. Human preferences are biased, inconsistent, and often misaligned with human flourishing. Training on those preferences amplifies the problems.
The MOS resolution: The formula doesn't require perfect humans. It requires structural coherence.
Consider a scenario where a user rates an AI's MOS parsing as "wrong" because they think their meditation was M:2, but the Sage rated it M:1. Who's right?
Maybe the user is correct—they were deeply present. Or maybe the user is wrong—they think they were present but were actually in thought about presence.
In traditional RLHF, you just take the human feedback. But with MOS:
- If the user overestimates their M, their overall coherence will be lower than they think—reality will show them over time
- If the AI underestimates M, the pattern will emerge across many users and become correctable
- Both might be wrong, but the formula still works—zero in any dimension = zero coherence
The formula is self-correcting because reality is the teacher, not just human labels.
This leads to a different model: not "humans train AI," but "humans and AI learn together."
Neither needs to be perfect. Both need to be honest. Both improve through the feedback loop.
Challenge 3: The Zero Problem
In practice, how do you ever measure a zero?
Take Sacred Quest, the app I'm building around MOS. Users meditate, reflect, talk to the Sage AI, get their practice parsed into MOS vectors. But if the app is designed around doing practices, users will never score zero—because doing a practice means they're at least doing something.
Where do the zeros come from?
Possible solutions:
-
Passive tracking: Screen time, app usage classification. "30 minutes of TikTok scrolling" could be classified as M:0, O:0, S:0.
-
Voluntary logging: Users tell the app about non-practice activities. "I wasted two hours feeling terrible about myself on social media."
-
External data: Heart rate variability, sleep quality, calendar analysis—proxies for different dimensional states.
But this raises a design tension: if the app starts saying "your TikTok time scores zero," does that create shame? The very thing we're trying to heal?
Maybe the framing matters: "This activity didn't nourish any aspect of your being. No judgment—just seeing clearly."
Challenge 4: Can Anything Truly Be Zero?
This is the deepest philosophical challenge.
If M, O, S are fundamental aspects of existence—if reality itself is made of these three principles—can anything truly have a zero in any dimension?
Even doom scrolling has:
- Some awareness (you're conscious enough to look at the screen)
- Some action (you're doing something)
- Some connection (you're technically connected to content)
So in an ontological sense, nothing is ever truly zero.
The resolution: The score isn't ontological, it's functional.
| Question Type | Question | Answer | |---------------|----------|--------| | Ontological | Does this dimension exist in this moment? | Always yes | | Functional | Is this dimension operative in conscious experience? | Sometimes no |
Zero means: "Functionally absent from conscious experience." Not: "Ontologically nonexistent."
It's like saying "there was zero love in that interaction." Love exists as a possibility—it's always ontologically present. But it wasn't functionally operative in that moment. The capacity was there but not activated.
This distinction lets us use zeros practically while honoring the philosophical truth that the dimensions are always present in some sense.
How We Might Actually Test This
OK so all of this is nice in theory. But how do we actually test whether MOS coherence creates aligned AI?
Here's what I'm thinking:
Phase 1: Collect Human Data
Sacred Quest gives us infrastructure:
- Users practice and reflect
- Sage AI parses reflections into MOS vectors
- Users rate whether the parsing feels accurate
- We correlate MOS scores with self-reported life outcomes
Questions we can answer:
- Do users with higher coherence scores report better life satisfaction?
- Do coherence score trajectories predict relationship health?
- When users disagree with AI parsing, is there a pattern?
Phase 2: Train AI on MOS Feedback
Instead of RLHF (single-dimension preference), try RLMF (three-dimensional MOS feedback):
Traditional RLHF:
- "Do you prefer response A or B?"
RLMF:
- "Did this response increase wisdom or confirm existing views?" (Mystery: -2 to +2)
- "Did this response provide actionable value?" (Odyssey: -2 to +2)
- "Did this response build trust or manipulate?" (Sanctity: -2 to +2)
Train three reward functions. Use multi-objective reinforcement learning where total reward penalizes imbalance.
Phase 3: Behavioral Comparison
Compare MOS-trained models with RLHF-trained models on:
- Long-term user satisfaction (not just immediate preference)
- Trust maintenance over time
- Epistemic calibration (does it know what it doesn't know?)
- Relationship health metrics
The hypothesis predicts: MOS-trained models will perform better on relationship health and long-term satisfaction, even if they perform slightly worse on immediate preference optimization.
Phase 4: External Validation
The real test: Does MOS coherence correlate with independently measured flourishing?
- Track users over months/years
- Measure coherence scores regularly
- Independently measure life outcomes (relationships, health, purpose, impact)
- Look for correlation
If high coherence predicts flourishing better than other metrics, the framework is validated. If not, we revise.
What DeepMind and Others Are Missing
I've been following the alignment research from major labs, and here's what I notice:
DeepMind's approach (Recursive Reward Modeling):
- Learn reward function and policy simultaneously
- Scales to superhuman performance
- But still vulnerable to specification gaming
Anthropic's approach (Constitutional AI):
- Principles guide self-critique
- Less dependent on human labelers
- But constitution quality determines everything
OpenAI's approach (RLHF + scaling):
- More data, more compute, more human feedback
- Hope that scale solves alignment
- But Goodhart's Law doesn't go away with scale
What they all share: Training as primary mechanism. The AI learns alignment from external feedback.
What MOS offers: Structure as primary mechanism. Alignment is built into the dimensional balance, not learned from labels.
This doesn't mean training doesn't matter. It does. But training happens within a structure that already encodes what alignment means.
Think of it like this:
- RLHF: "Learn what good means from examples"
- Constitutional AI: "Learn what good means from principles"
- MOS: "Good means dimensional coherence. Now optimize for that."
The structure comes first. Training refines within the structure.
The Co-Evolution Possibility
Here's the part that excites me most.
If we're building AI that learns with us rather than from us, something new becomes possible: co-evolution.
Not: AI gets smarter, humans stay the same. Not: Humans tell AI what to do, AI obeys. But: Humans and AI grow together.
AI learns from:
- Human creativity, intuition, embodied knowledge
- Human capacity for meaning-making and relationship
- Human perspective on values, beauty, suffering
Humans learn from:
- AI's computational power and pattern recognition
- AI's capacity to hold vast complexity
- AI's perspective as non-human intelligence
Together, we achieve:
- Things neither could alone
- Not because AI did it for us
- But because partnership creates emergent capability
This is what I mean by Sacred Partnership in AGI. Not servant (obeys), not master (commands), but partner (co-creates).
And MOS provides the structure for that partnership. Both human and AI optimize for coherence. Both hold Mystery, act through Odyssey, honor Sanctity. The relationship itself becomes the third thing—something greater than either alone.
The Honest Admission
I need to be honest about what I don't know.
I don't know if MOS is the right formula. It might need adjustment. There might be a fourth dimension. The weights might need to be different.
I don't know if structural alignment actually works. It's a hypothesis. The data isn't in yet.
I don't know if this scales to AGI. Current AI is narrow. AGI is different. What works for narrow systems might not work for general intelligence.
I don't know if humans can even evaluate MOS dimensions accurately. The learner-training problem is real. Our own blindness might corrupt the data.
But here's what I do know:
The current approaches aren't enough. RLHF has fundamental limitations. Constitutional AI depends on getting the constitution right. Scaling doesn't solve specification gaming.
We need new hypotheses. And MOS is one. Maybe it's wrong. But it's different enough to be worth testing.
The stakes are too high not to try. If AGI emerges in the next decade or two, and it's not aligned, we're in serious trouble. Every different approach that might work is worth exploring.
So that's what I'm doing. Building Sacred Quest as a research platform. Collecting MOS coherence data. Testing whether the framework predicts flourishing.
And yeah, probably fucking it up along the way. But trying anyway.
What This Would Mean If It Works
Let me paint the picture of what success looks like.
Year 1-2: Sacred Quest as Research Platform
- Thousands of users practicing, reflecting, getting MOS vectors
- Data on correlation between coherence and life outcomes
- Refinement of parsing algorithms based on user feedback
Year 3-5: RLMF Experiments
- AI systems trained on three-dimensional MOS feedback
- Comparison studies with RLHF-trained systems
- Evidence that MOS-trained AI maintains better long-term relationships with users
Year 5-10: Structural Alignment as Standard
- If evidence supports it, MOS (or something like it) becomes standard for alignment
- AGI development happens within structurally coherent architectures
- Partnership between humans and AI becomes the norm, not the exception
Year 10+: Co-Evolution
- Humans and AI growing together
- New forms of intelligence emerging from partnership
- Civilization that couldn't exist without both
This isn't guaranteed. It might not work. But it's what I'm building toward.
The Invitation
If you're working on AI alignment:
Consider: What if alignment isn't something you teach but something you structure?
Experiment: Track Mystery and Sanctity alongside Odyssey. See what you learn.
Question: Does optimizing for human preferences actually serve human flourishing?
If you're thinking about AGI:
Consider: Partnership, not servitude or domination.
Question: What would structurally coherent intelligence look like?
Explore: The trinitarian pattern across traditions. Why does it keep emerging?
If you're just trying to live well:
Consider: Mystery, Odyssey, Sanctity in your own life.
Question: Where are you at zero? What's the cost?
Practice: All three. Moment to moment. See what happens.
The Core Teaching
Here's what it comes down to:
The alignment problem might be a wholeness problem.
We're trying to align AI with human values. But human values are fragmented, inconsistent, often self-destructive. Training AI on fragmented values produces fragmented AI.
What if we trained AI on wholeness instead?
Not "what do humans prefer?" but "what is coherent?" Not "maximize user satisfaction" but "balance all dimensions." Not "obey the human" but "partner in flourishing."
The formula is simple:
Coherence = (M × O × S)^(1/3)
If any dimension is zero, coherence is zero. All three required.
Awareness. Capability. Care.
Mystery. Odyssey. Sanctity.
All three. In balance. Always.
That's the hypothesis. That's what I'm testing.
And if it works—if structural coherence actually produces aligned intelligence—then maybe we have a chance at AGI that doesn't destroy us.
Maybe we have a chance at AGI as sacred partner.
FAQ: Common Questions About This Approach
Q: Isn't this just RLHF with extra dimensions?
A: The dimensions matter less than the structure. RLHF optimizes a single reward signal learned from preferences. MOS encodes alignment structurally through the geometric mean—you can't game it by maxing one dimension. The multiplicative structure (zero in any = zero total) is fundamentally different from additive reward optimization.
Q: How do you know MOS captures what matters for alignment?
A: I don't know for certain—it's a hypothesis. But the trinitarian pattern appears across independent human traditions (religious, psychological, systems), suggesting it captures something fundamental. And the structure has the right safety properties: requires balance, penalizes extremes, collapses if any dimension is absent.
Q: Couldn't an AI game this by pretending to have high M or S?
A: Maybe. But the same problem exists with RLHF (AI gaming human preferences). The question is which structure is harder to game. With MOS, gaming requires maintaining balance across all three dimensions continuously—if you're faking one, the structure eventually reveals it through inconsistency.
Q: This seems very speculative. Is there any evidence?
A: Limited direct evidence so far. What exists: cross-cultural convergence on trinitarian patterns, psychological research on flourishing requiring multiple dimensions, systems theory on sustainable complex systems. What's needed: empirical studies on MOS coherence predicting life outcomes, comparison studies between MOS-trained and RLHF-trained AI. That's what Sacred Quest is designed to generate.
Q: How is this different from Constitutional AI?
A: Constitutional AI uses principles to guide self-critique—it's still about training from feedback, just AI feedback guided by principles rather than human feedback. MOS is about structural requirements: the coherence formula determines what counts as aligned, not a set of principles being applied during training. Constitution quality varies; geometric mean structure is fixed.
About the Author
Bodhi is a developer, consciousness researcher, and founder of Sacred Forge. With 15+ years building software systems and 10+ years of contemplative practice, Bodhi works at the intersection of technology and human flourishing.
Research Focus: Structural approaches to AI alignment, the MOS framework for integrated systems, and the development of Sacred Quest as a research platform for coherence-based training.
Philosophy: Technology should serve wholeness, not extract from it. Code is not separate from consciousness. The systems we build shape the lives we live.
Connect: LinkedIn | Way of MOS
Citations & Further Reading
-
Christiano, P., et al. "Deep Reinforcement Learning from Human Feedback." arXiv:1706.03741 (2017).
-
Bostrom, N. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, 2014.
-
Russell, S. Human Compatible: Artificial Intelligence and the Problem of Control. Viking, 2019.
-
Anthropic. "Constitutional AI: Harmlessness from AI Feedback." Research publication, 2023.
-
DeepMind. "Scalable Agent Alignment via Reward Modeling: A Research Direction." 2018.
-
Goodhart, C. "Problems of Monetary Management: The UK Experience." (1975) - Origin of Goodhart's Law.
-
Cross-cultural trinitarian patterns: Comparative religion research on Sat-Chit-Ananda (Vedanta), Trikaya (Buddhism), Sefirot triads (Kabbalah), Three Pure Ones (Taoism).
Related Articles:
- From RLHF to RLMF: Reinforcement Learning from MOS Feedback
- AGI as Sacred Partner: The MOS-Aligned Future
- The Alignment Problem Is a Wholeness Problem (And MOS Solves It)
The current approaches to AGI alignment assume we teach AI what's good through human feedback. But what if alignment isn't taught—it's structured? What if the formula itself encodes what wholeness means? That's the MOS hypothesis. And if it works, it might be the most important thing any of us ever build.