Thursday, 2024 November 21

ByteDance enters AI video race with Doubao’s PixelDance and Seaweed

ByteDance has officially entered the video generation race with the release of two new artificial intelligence models, PixelDance and Seaweed.

Unveiled during the Volcano Engine AI Innovation Tour in Shenzhen on September 24, the models were introduced under ByteDance’s Doubao brand, targeting the enterprise market. Currently, they remain in an invite-only testing phase, with limited access for a small group of creators involved in internal testing.

The launch came without any prior announcements, but despite the quiet rollout, industry anticipation had been building for months, thanks to advances from competitors like OpenAI and Kuaishou. OpenAI’s Sora, a model that allows users to generate video from text prompts, set a high standard for multimodal AI. Meanwhile, Kuaishou’s Kling AI went viral in June, further raising expectations for ByteDance’s move into this space.

ByteDance, renowned for its dominance in short video content via TikTok and Douyin, has long been seen as a natural contender in AI-driven video production. The company is well-positioned with extensive resources, advanced chip capabilities, and a highly skilled talent pool to shape the future of video generation.

Moreover, video generation plays to ByteDance’s strengths—both ByteDance and Kuaishou have vast datasets and multiple applicable use cases that position them well in this field.

Yet, while Kuaishou launched Kling AI to great success, accumulating over 2.6 million users who have collectively generated 27 million videos and 53 million images, ByteDance had remained silent. Now, with the introduction of PixelDance and Seaweed, the question is: can ByteDance reclaim its edge in the AI video generation race?

Can PixelDance and Seaweed stand out?

Early results from the PixelDance and Seaweed models are promising. Both models show significant improvements in maintaining character consistency and diversity across scenes, overcoming a challenge that has plagued previous video generation models.

Older models struggled with complex commands, often leading to visual distortions or glitches when characters performed more than basic actions. Doubao’s AI models, however, appear to have resolved these issues. Actions such as running, walking, and looking up are now rendered fluidly, producing more natural and lifelike motion. The days of abrupt, jarring transitions—like Will Smith eating noodles turning into Donald Trump—seem to be behind us.

Video generated by Doubao model powered by PixelDance and Seaweed with the prompt: “Western man and woman with long hair riding horses”

A key strength of these models lies in their handling of multi-character interactions. Characters move and interact seamlessly, with logical and smooth movements. Camera angles also vary, utilizing a mix of wide shots, close-ups, zooming, panning, and tracking to create dynamic, engaging scenes. Character details—such as appearances, clothing, and accessories—remain consistent, even when transitioning between shots.

While PixelDance and Seaweed are still in an invite-only testing phase, the AI-generated landscapes and character scenes have been praised during internal tests for their video quality and camera work.

That said, a few bugs persist. In some cases, minor glitches—such as finger deformations—still crop up in character generation scenes.

Doubao’s AI models are built on ByteDance’s self-developed document image transformer (DiT) architecture, which is believed to share similarities with OpenAI’s Sora, a leading AI video generation technology. However, video generation models are still lagging behind their text and image counterparts in terms of development. Much of the foundational technology is closed-source, and data is scarce, meaning companies are focused on engineering optimizations rather than innovation.

Tan Dai, president of Volcengine, explained that ByteDance has optimized the DiT transformer structure for business applications, including Jimeng AI, making significant advancements that have lowered the cost of AI video applications. Despite these innovations, industry experts advise caution, noting that while the technology is promising, expectations should remain realistic.

AI blogger “Guicang” compared Doubao’s video generation capabilities to those of industry leaders like Runway and the emerging startup Luma AI. According to Guicang, Seaweed offers a broader range of prompts and aspect ratios than Luma, though each model has its strengths and weaknesses compared to Runway.

Video generated by Doubao model powered by PixelDance and Seaweed with the prompt: “A foreign man surfing and giving a thumbs up to the camera”

Despite these comparisons, ByteDance’s ambitions are clear. Along with PixelDance and Seaweed, the company introduced new music and simultaneous interpretation models, forming a comprehensive toolkit that spans language, speech, image, and video creation. However, the most significant development has been Doubao’s explosive business growth. Since its large model family launched, Doubao’s daily API call volume has surged, and by September, token usage had surpassed 1.3 trillion—a tenfold increase since May. The platform now processes more than 50 million images and 850,000 hours of speech daily.

A recent growth chart shows Doubao’s monthly active users (MAU) rising rapidly, outpacing competitors. ByteDance’s aggressive pricing strategy has fueled this growth. Since May, major players like ByteDance, Alibaba, Tencent, and startups such as Deepseek have engaged in a fierce price war. ByteDance, in particular, slashed its cost per 1,000 tokens to mere fractions of a cent, driving prices to the floor.

Diagram tracks the monthly active users (MAU) and monthly active rate (MAR) of Chinese generative AI platforms as of August 2024. MAR is defined as the average number of days a user engages with an app per month, averaged over the past three months. Doubao stands out with a considerably higher MAU compared to competitors such as Ernie and Xingye AI. Graphic sources: AIcpb, 36Kr, and InfoQ.

However, competition is no longer solely about pricing—it has shifted to model performance. Tan introduced a new metric, peak tokens per minute (TPM), which measures a model’s data throughput over time. While most models support only 100,000–300,000 TPM, Doubao Pro can handle up to 800,000 TPM. For instance, a research institute’s document translation application may require 360,000 TPM, while an automotive AI application could need 420,000 TPM. Doubao Pro meets all these demands.

With the release of PixelDance and Seaweed, ByteDance has solidified its position in the AI video generation market, completing the final piece of its AI content creation puzzle. This move, alongside OpenAI’s recent advancements in voice capabilities, signals an arms race among major players in the AI space—leaving little room for smaller startups to compete.

The long battle between ByteDance and Kuaishou

ByteDance’s desire to dominate the AI landscape is evident. Its flagship video editing app, CapCut, and AI video tool, Jimeng AI, are overseen by Kelly Zhang, the former CEO of the Douyin business unit, indicating the company’s urgency in speeding up the launch of its new AI video models. This urgency is driven, in part, by an old competitor: Kuaishou.

In June 2024, Kuaishou integrated its video generation tool, Kling AI, into its video editing app Kwaiying. The launch came as the industry eagerly awaited a Chinese counterpart to OpenAI’s Sora, and Kling AI’s reception was overwhelmingly positive.

“One of the major challenges in video generation is cost and ensuring consistency across scenes,” an AI professional told 36Kr. “But Kling AI can generate two-minute videos, exceeding Sora’s 60-second limit.” In terms of camera continuity and maintaining logical relationships between scene elements, industry insiders consider Kling AI a top-tier Chinese product.

At a time when Sora remained unavailable to many users and Shengshu Technology’s Vidu AI was still gaining traction, Kuaishou seized the moment, launching an open beta and offering Kling AI for free. Compared to the heavily resourced PixelDance and Seaweed, Kling AI’s development was fast and lean, with a team of only 20 people taking just three months to go from project inception to launch. By mid-September, Kling AI had already undergone nine iterations, offering higher-resolution videos, smoother movement, and more advanced camera controls.

Kuaishou’s success with Kling AI can be attributed to its wealth of video data, a significant advantage. ByteDance, with its vast TikTok and Douyin datasets, is the most likely challenger, but it has experienced rare setbacks. Just one month before Kling AI’s launch, ByteDance rolled out Jimeng AI on CapCut, but the results were underwhelming. User feedback was lukewarm, with some criticizing Jimeng AI’s performance and pricing. “For a product with average performance, it’s outrageous to charge non-members for videos longer than three seconds,” one user review remarked, highlighting the disappointment among CapCut’s user base.

The pressure on ByteDance is building. According to a 3D video generation specialist, most AI video companies tend to showcase their best outputs—often produced after several prompt attempts. The real test for Doubao’s models will come when they are fully deployed and put to practical use. Key performance indicators such as the ability to generate long shots, maintain spatiotemporal consistency, and handle increased resolution will be critical to its success.

For CapCut, which boasts over 300 million monthly active users, the cost of integrating advanced AI video technology presents a significant challenge. Striking the right balance between managing these costs and delivering high-quality results will only become more difficult as competition in the AI video generation space intensifies.

Having a first mover advantage is critical, and with Kling AI and Vidu AI already well-established in the market, ByteDance, as a latecomer, faces an uphill battle. The competition is fierce, and as more companies enter the fray, the fight for dominance in AI video generation has only just begun.

KrASIA Connection
KrASIA Connection
KrASIA Connection features translated and adapted high-quality insights published on 36Kr.com, the largest and most influential technology portal in Chinese language with over 150 million readers across the globe.
MORE FROM AUTHOR

Related Read