This project implements a text-to-video generation system using deep learning techniques. It creates videos based on textual prompts by generating sequences of images that depict moving shapes, specifically circles, in various directions and transformations. The generated videos can be used for various applications, including animation, educational content, and artistic expression.
- Installation
- Usage
- Dataset Generation
- Model Architecture
- Training the Model
- Generating Videos
- Contributing
- License
To set up the project, follow these steps:
-
Clone the repository:
git clone <repository-url> cd <repository-directory>
-
Install required packages:
Use pip to install the necessary libraries:
pip install numpy opencv-python pillow torch torchvision scipy
-
Set up the environment:
Ensure you have a compatible version of Python (3.6 or higher) and PyTorch installed with CUDA support if you intend to run on a GPU.
To generate videos based on text prompts, follow these steps:
-
Generate the dataset:
The dataset consists of videos generated from predefined prompts. Run the dataset generation script:
python generate_dataset.py
This will create a directory named
training_datasetcontaining the generated video frames. -
Train the model:
After generating the dataset, train the model using the following command:
python train_model.py
This will train the GAN architecture on the generated dataset.
-
Generate videos from text prompts:
After training, you can generate videos by running:
python generate_video.py "circle moving down"
Replace
"circle moving down"with any other prompt from the predefined list.
The dataset is generated by creating 10-frame videos of a circle moving in various directions based on text prompts. The dataset generation process includes:
- Creating a directory for the dataset.
- Defining the number of videos and frames.
- Generating images with moving shapes using the
create_image_with_moving_shapefunction. - Applying Gaussian splatting to the generated images for a smoother visual effect.
The project utilizes a Generative Adversarial Network (GAN) architecture comprising:
- Text Embedding Layer: Converts text prompts into embeddings.
- Generator: Generates video frames based on random noise and text embeddings.
- Discriminator: Distinguishes between real and generated frames.
- TextEmbedding: Embeds text prompts into a numerical format.
- Generator: Transforms random noise and text embeddings into video frames.
- Discriminator: Evaluates the authenticity of generated frames.
The model is trained using the following steps:
- Load the dataset using a custom
TextToVideoDatasetclass. - Initialize the generator and discriminator networks.
- Use binary cross-entropy loss for training.
- Optimize the networks using Adam optimizer.
- Iterate through the dataset for a specified number of epochs, updating the generator and discriminator alternately.
To create videos from text prompts, the following functions are implemented:
- generate_video: Generates a video based on a given text prompt.
- frames_to_video: Converts generated frames into a video file.
- save_frames_to_disk: Saves generated frames to a specified directory.
Contributions are welcome! To contribute to this project:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Make your changes and commit them.
- Push to your fork and create a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
- Special thanks to the contributors and the community for their support.
- This project leverages PyTorch and other open-source libraries for deep learning and image processing.
Feel free to customize this README further based on specific details or additional features of your project.