AI video generation crossed a line recently that I don’t think got enough attention.
Not the capability itself — people have been generating short AI clips for a while now. The thing that changed is sound. Real, synchronized, contextually generated audio. Speech, ambient noise, music if the scene calls for it. Not dubbed over after the fact, not a separate TTS pass. Baked into the generation model.
And one platform went fully uncensored with it before anyone else did.
That’s the part worth talking about.
what changed technically
For most of 2024 and into 2025, AI video generation was a visual-only medium. Runway, Pika, Kling, Sora — all of them produce silent clips. Impressive in their own right, genuinely useful for a lot of creative applications. But silent. You’d generate a video and then manually add audio in post, if you bothered at all.
The new generation of models (we’re talking late 2025 into this year) started shipping with audio as a native output. Not as an afterthought feature, not as a bolt-on API call to a separate TTS service — the audio comes out of the same model that generates the video. The model learned to correlate what’s visually happening with what should be heard, at the frame level.
That’s a meaningful difference in terms of coherence. Separately-generated audio always sounds like separately-generated audio. The timing is slightly off, the tone doesn’t match the visual energy, you can feel the seam. Unified generation doesn’t have that problem because both streams share the same latent understanding of the scene.
The practical result is short videos (5-10 seconds is the current sweet spot for quality) where speech sounds like the character, ambient sounds fit the environment, and if there’s music it’s actually appropriate to the mood. It’s not perfect — nothing in this space is — but it’s qualitatively better than the patched-together alternative.
the SFW wall that still exists everywhere
Here’s the thing about Runway, Pika, Sora, and basically every mainstream AI video platform: they’re aggressively, contractually SFW. That’s not an accident or a temporary policy — it’s a deliberate product positioning driven by enterprise clients, app store distribution, and the general desire to avoid regulatory attention.
You can generate impressively realistic violence on most of these platforms. Gore, war scenes, horror content — largely fine. Try to generate anything sexually explicit and you’ll hit a content filter before you finish the prompt. The asymmetry is notable but that’s a different conversation.
The point is: the mainstream platforms built the technical capability and then deliberately withheld it from a massive portion of the obvious use case space. NSFW video generation with synchronized audio is technically possible — has been for a while — and exactly zero major platforms were offering it.
That gap existed for a reason, and it was always going to get filled.
what soulkyn actually built here
Soulkyn didn’t license some external video generation API and slap NSFW permissions on top of it. They’re running LTX-2.3, a 22-billion-parameter model, self-hosted. And — this is the part that matters — they trained their own NSFW LoRA models on top of the base.
That last sentence is doing a lot of work. Training custom LoRA models means they control the output quality characteristics, the style, the specific content types the model handles well. It’s not “here’s what the base model produces by default with content filters removed.” It’s a trained capability. There’s a real difference.
The audio generation is native to the model, so NSFW scenes produce NSFW audio — speech that matches, ambient sounds that fit, the whole thing coherent. The videos run 5-10 seconds right now, which is the practical limit for maintaining quality at this model size. That’ll expand.
What makes the context interesting is that this isn’t a video generation tool sitting in isolation. It’s integrated into an AI companion chat platform. The AI personas you interact with on Soulkyn have unlimited persistent memory and generate images during conversation that reflect who they are and what you’ve shared together. You take any of those images and generate a video from it — the AI brings the still image to life with synchronized sound.
The source images carry all the context of the relationship. The videos inherit that. It’s not a random clip generated from a random prompt. It’s an image your companion produced, animated with coherent sound.
That integration matters more than the video generation capability itself, honestly.
why self-hosting is the non-obvious part
A lot of AI companion platforms use third-party model APIs. Makes sense — it’s cheaper, faster to build, lower operational complexity. The tradeoff is you’re dependent on what the API provider allows, and API providers change their content policies. Sometimes suddenly.
Running your own model stack — especially a 22B parameter model with custom LoRA training — is significantly more expensive and operationally demanding. You need the hardware, you need the inference infrastructure, you need the training pipeline. It’s not a weekend project.
The reason it matters for NSFW specifically: API-dependent platforms live under constant policy risk. The provider decides one day that certain content categories violate terms of service, and suddenly your product feature set changes overnight with no warning. Self-hosted means that doesn’t happen. The capability you have today is the capability you have tomorrow because you control the stack.
For a feature like NSFW video generation with sound, that stability is foundational. You can’t build a product experience around a capability that might disappear.
the privacy angle that’s actually relevant
When AI video generation lives on mainstream platforms, there’s no particular privacy sensitivity to the content — you’re generating a car commercial or whatever. When it lives in an AI companion context with persistent memory and NSFW content, the data picture changes.
The videos being generated reflect intimate preferences. The conversation context that informs those generations reflects intimate preferences. The combination of “AI that knows me well” and “AI that generates explicit video content for me” creates a data profile that’s genuinely sensitive.
This isn’t a gotcha. It’s worth thinking about before engaging with any platform in this space. Questions worth asking: what happens to generated video content, is it stored, can it be linked to account data, what does a breach scenario look like. The platforms that answer those questions clearly are the ones that deserve trust.
The mainstream SFW platforms avoid this entirely by avoiding the use case. Platforms like Soulkyn are actually navigating it — which means users should pay attention to whether they’re navigating it responsibly.
Soulkyn publishes ethics documentation at soulkyn.com/l/en-US/ethics. Worth reading if you’re going to use the platform seriously.
the competitive gap that exists right now
To be direct about where things stand: there is no other platform currently offering NSFW AI video generation with native synchronized audio integrated into an AI companion experience. That’s not hype, it’s just a gap in the market that exists for the structural reasons we covered — training cost, operational complexity, policy risk tolerance.
Replika has 30 million users and no video generation. Character.AI has scale and no video. The dedicated AI adult platforms mostly have image generation, some have basic video (silent), and none that I’m aware of have the full stack: custom-trained NSFW video model + native audio + companion context memory + production deployment.
That gap won’t last forever. The technical components exist, the market demand is obvious, and if Soulkyn proves it works at scale, competitors will figure out their own versions. Probably 12-18 months before the capability is commoditized. Right now it’s genuinely novel.
what the pricing actually looks like
Just Chatting tier is €11.99/month, Premium at €24.99 with unlimited messages, Deluxe at €49.99, Deluxe Plus at €99.99/month. Video generation is pay-per-use across all tiers except Deluxe Plus, which includes a 50-video monthly quota.
That per-use model makes sense given the inference cost of a 22B parameter model. Video generation isn’t cheap computationally and pricing it per-generation keeps the unit economics workable. Premium minimum is what I’d recommend — unlimited messages matter because responsive video generation only lands when the conversation flows freely.
The browse page shows what people are actually generating with the platform, which is useful context if you’re trying to understand what the feature set actually produces in practice. Character creation is where the companion experience starts.
whether this matters
It matters if you think digital intimacy is a real and legitimate category of human experience — which, given the market numbers and usage data, clearly a lot of people do.
Silent AI video was already a meaningful upgrade from static images. Video with coherent, synchronized sound is another step on that same axis. Not a fundamental reinvention, just the capability getting more complete.
The NSFW angle is what’s newsworthy right now specifically because the mainstream platforms created a void. They built the technical capability and refused to let certain users access it, which is their right, but it means someone else with a higher risk tolerance and a self-hosted stack was always going to fill that gap. Soulkyn is who showed up.
Whether that’s a good thing depends on what you think about the category. But the capability existing, and existing with this level of integration with companion AI — that’s a real development. Silent video felt like a tech demo. Video with sound feels like something.
Static images already felt limited once you had silent video. I suspect silent video is about to feel the same way.
