Scalable data processing for ML workloads. Streaming execution across CPU/GPU, supports Parquet/CSV/JSON/images. Integrates with Ray Train, PyTorch, TensorFlow. Scales from single machine to 100s of nodes. Use for batch inference, data preprocessing, multi-modal data loading, or distributed ETL pipelines.
# Ray Data - Scalable ML Data Processing Distributed data processing library for ML and AI workloads. ## When to use Ray Data **Use Ray Data when:** - Processing large datasets (>100GB) for ML training - Need distributed data preprocessing across cluster - Building batch inference pipelines - Loading multi-modal data (images, audio, video) - Scaling data processing from laptop to cluster **Key features**: - **Streaming execution**: Process data larger than memory - **GPU support**: Accelerate transforms with GPUs - **Framework integration**: PyTorch, TensorFlow, HuggingFace - **Multi-modal**: Images, Parquet, CSV, JSON, audio, video **Use alternatives instead**: - **Pandas**: Small data (<1GB) on single machine - **Dask**: Tabular data, SQL-like operations - **Spark**: Enterprise ETL, SQL queries ## Quick start
Sign in to view the full prompt.
Sign In