Creators get one more option as Alibaba’s Enters into AI Video Generation through WAN 2.1 Heating the Competition in Already Crowded AI Video Generation Market
Close on the heels of recent launch by OmniHuman-1 AI Video Generation Model by one more Chinese internet giant ByteDance, Alibaba’s release of WAN 2.1 in February 2025 marked a significant development in the realm of artificial intelligence-driven video creation. This open-source tool empowers users to generate videos from both textual descriptions and static images, and it also incorporates video editing functionalities giving creators complete suite to explore their creativity.
Notably, Alibaba’s internal evaluations suggest that WAN 2.1 surpasses the performance of OpenAI’s Sora in specific benchmarks. Being an open-source project, WAN 2.1 distinguishes itself by offering its technology freely, although the associated costs may involve computational resources. The emergence of such a powerful open-source tool has the potential to reshape the landscape of AI video generation, offering a compelling alternative to proprietary models.
This competition between open-source and proprietary models like OpenAI’s Sora is likely to fuel rapid advancements in the field of AI video generation, ultimately providing users with a greater array of options and enhanced video quality.
What is WAN 2.1?
The field of AI video generation has witnessed remarkable progress in recent years, with sophisticated models emerging that can translate creative visions into visual realities. Alibaba’s WAN 2.1 represents the company’s foray into this dynamic AI market, positioning itself as a direct competitor to prominent models such as OpenAI’s Sora and Google’s Veo 2.
A key differentiator for WAN 2.1 is its open-source nature, a strategic decision made upon its release in February 2025, contrasting with the closed and often waitlisted access of some of its competitors. This move underscores Alibaba’s commitment to the AI domain, further evidenced by their substantial investments in AI infrastructure.
Alibaba’s decision to release WAN 2.1 as an open-source project reflects a strategic approach towards fostering community-driven development and widespread adoption. By making the source code publicly accessible, Alibaba invites a global community of developers and researchers to contribute to the model’s ongoing improvement, identify and address potential issues, and explore novel applications.
This collaborative approach can potentially accelerate the model’s evolution at a pace that might exceed what a single organization could achieve independently. Furthermore, Alibaba’s significant investment in AI likely provided the foundational resources, including computational power and expert talent, necessary to develop a competitive model like WAN 2.1.
Key Features and Functionalities:
WAN 2.1 boasts a comprehensive suite of features designed to cater to a wide range of video creation needs. At its core, the tool offers Text-to-Video (T2V) capabilities, enabling users to generate videos from textual descriptions provided in both Chinese and English.
The system includes different T2V models, notably a lightweight version with 1.3 billion parameters and a more robust 14 billion parameter model for enhanced quality. Additionally, WAN 2.1 supports Image-to-Video (I2V) transformation, allowing users to bring still images to life through animation, with dedicated I2V models available at both 720p and 480p resolutions.
Beyond generation, the tool also incorporates Video Editing Capabilities, providing functionalities for enhancing and modifying existing video content. The Multi-Language Support is a significant advantage, with the T2V models specifically optimized for processing prompts in both Chinese and English.
Furthermore, WAN 2.1 offers Flexible Video Ratios, supporting various aspect ratios such as the widely used 16:9 and the mobile-friendly 9:16. For enhanced user experience, WAN 2.1 has been integrated into the Monica AI platform, providing a user-friendly interface and a set of additional tools to streamline the video creation process.
The combination of text and image-to-video generation, coupled with editing functionalities and multi-language support, positions WAN 2.1 as a highly versatile tool capable of addressing diverse video creation requirements. This comprehensive feature set makes it an attractive option for users seeking a single platform for various video-related tasks, potentially simplifying workflows and reducing the need for multiple specialized tools.
Under the Hood: Technology and AI Models
The impressive capabilities of WAN 2.1 are underpinned by a sophisticated technical architecture. At its core lies a Diffusion Transformer with VAE, a cutting-edge approach in generative AI that combines the strengths of diffusion models and transformer networks with variational autoencoders (VAE). Specifically, WAN 2.1 employs a “Wan-VAE” Architecture, which is a 3D causal VAE designed to optimize video generation.
This Spatio-Temporal VAE architecture is engineered to tackle the critical challenges of memory efficiency and maintaining temporal consistency throughout the generated video content. One of the significant benefits of this design is the Faster Video Reconstruction it enables, with WAN 2.1 reportedly achieving speeds 2.5 times faster than competing models.
Furthermore, the architecture contributes to Improved Scene Consistency by efficiently managing historical frame information, resulting in smoother and more coherent videos. WAN 2.1 is not a monolithic model but rather a suite of four distinct models, each tailored for specific tasks: the lightweight T2V-1.3B, the enhanced-quality T2V-14B, the 720p I2V-14B, and the 480p I2V-14B. Powering these models is an extensive training dataset comprising approximately 1.5 billion videos and 10 billion images, providing a rich foundation for understanding and generating diverse visual content.
Output Video Quality
The output video quality of WAN 2.1 has garnered significant attention, with reports and user experiences providing insights into various aspects. In terms of Resolution, the 14 billion parameter models support both 480p and 720p outputs, while the 1.3 billion parameter model primarily operates at 480p. Some sources suggest the potential for 1080p generation in specific scenarios.
Regarding Frame Rates, WAN 2.1 is capable of generating videos at up to 30 frames per second (fps). However, some user reviews have noted a perceived lower frame rate, sometimes around 15fps, which can affect the smoothness of motion.
Open Source and Cost Implications
A defining characteristic of WAN 2.1 is its open-source nature, having been released under an open-source license in February 2025. This allows for the free use, modification, and distribution of its source code, promoting Community-Driven Innovation through broader experimentation and collaboration among researchers and developers. In terms of Cost Implications, while the software itself is free, achieving optimal performance often requires specific Hardware Requirements.
The smaller T2V-1.3B model, for instance, necessitates at least 8.19GB of VRAM and performs best on an RTX 4090 GPU. The Cost of an RTX 4090 GPU can range from approximately $1600 to $3000 or even more, representing a significant investment. For users without such hardware, Cloud Usage options are available, with services like Novita AI offering RTX 4090 rentals at hourly rates around $0.35. Accessing WAN 2.1 through the Monica AI Subscription platform may involve fees after a free trial, with plans starting at about $8.30 per month, often including free credits.
The open-source nature of WAN 2.1 offers the significant advantage of being free to use and modify, which can be particularly appealing to researchers, developers, and individuals on a budget. However, the practical reality of running computationally intensive AI models like WAN 2.1 often necessitates access to powerful hardware, such as high-end GPUs. Similarly, accessing WAN 2.1 through platforms like Monica AI provides a user-friendly experience but may involve subscription fees. Therefore, while WAN 2.1 itself is free as an open-source project, the overall cost of utilizing it effectively can vary depending on the chosen method of access and the availability of suitable hardware.
WAN 2.1 vs. the Competition
When compared to other leading AI video generation models like OpenAI’s Sora and Google’s Veo 2, WAN 2.1 presents a compelling set of advantages and some potential drawbacks. A significant strength of WAN 2.1 is its Open-Source Nature, which distinguishes it from the proprietary models of Sora and Veo 2, offering greater accessibility and the potential for collaborative community development.
In terms of Performance Benchmarks, WAN 2.1 has reportedly outperformed Sora in several key metrics, including scene generation quality, single-object accuracy, and spatial positioning. It also boasts a Faster Video Reconstruction speed, approximately 2.5 times quicker than its competitors. Furthermore, WAN 2.1 offers Multi-Language Support for text prompts in both Chinese and English, and its Consumer-Friendly Model, the 1.3B parameter version, is designed to be accessible on standard RTX 4090 GPUs.
However, potential weaknesses include a Maximum Resolution (Reported) of 720p (with occasional instances of 1080p), which might be lower than the capabilities of some proprietary models. Sora, for example, is known for generating videos up to one minute in length. User feedback has also indicated potential issues with Fluidity and Consistency in motion, with occasional jitteriness observed.
While architecturally faster, the Processing Speed for the Smaller Model is still around 4 minutes for a 5-second video on an RTX 4090 , and one user test suggested Veo 2 might have significantly faster real-world rendering speeds. Finally, regarding Prompt Following and Controllability, while WAN 2.1 offers prompt enhancement features, Veo 2 is noted for potentially providing stronger control through its understanding of filmmaking principles.
Feature | WAN 2.1 | OpenAI Sora | Google Veo 2 |
Open Source? | Yes | No | No |
Text-to-Video? | Yes (EN & CH) | Yes | Yes |
Image-to-Video? | Yes | Yes | Yes |
Max Resolution (Reported) | 720p (sometimes 1080p) | Not publicly specified, likely higher | Not publicly specified, likely higher |
Max Video Length | ~6 seconds (Kapwing), longer possible? | Up to 1 minute | Not publicly specified |
Reported Benchmarks | Outperforms Sora in some metrics | N/A | Strong performance, detail-oriented |
Ease of Use | Intuitive, integrated with Monica, online access | Limited current access | Focus on professional users, fine-grained control |
Hardware Requirements | RTX 4090 (for optimal performance) | Not publicly specified | Not publicly specified |
Cost | Free (open source), potential hardware/cloud costs, Monica subscription | Subscription-based (ChatGPT Plus/Pro) | Pricing plan with varying features and speed |
Multi-Language Support | Yes (EN & CH for T2V) | Likely, but not explicitly detailed | Likely, but not explicitly detailed |
Video Editing | Yes | Likely limited at this stage | Focus on realistic video generation |
Alibaba’s WAN 2.1 stands out as a powerful and increasingly accessible open-source alternative in the rapidly evolving field of AI video generation. Its key features include robust text-to-video and image-to-video capabilities, multi-language support, and even integrated video editing tools. Performance benchmarks suggest it can rival and even surpass proprietary models like OpenAI’s Sora in specific aspects, and its open-source nature fosters community-driven innovation.
The emergence of open-source models like WAN 2.1 has the potential to significantly democratize advanced AI video generation technology. By making such powerful tools freely available, it lowers the barrier to entry for individuals, researchers, and smaller organizations, potentially leading to a surge in creative applications and further advancements in the field. The continuous evolution of AI models is expected to bring about improvements in output quality, resolution, processing speed, and user-friendliness, and WAN 2.1, with its open and collaborative development model, is well-positioned to be a key player in shaping the future of video creation.
You can access WAN 2.1 through: https://monica.im/en/ai-models/wan