Dataset Download Scripts
Overview
Section titled “Overview”This page provides optimized scripts for downloading popular datasets. These scripts include:
- Parallel downloads for faster completion
- Automatic extraction and cleanup
- Progress tracking
- Error handling
COCO Dataset
Section titled “COCO Dataset”The Microsoft COCO (Common Objects in Context) dataset is widely used for object detection, segmentation, and captioning tasks.
Dataset Size: ~25GB (train), ~1GB (val), ~6GB (test)
#!/bin/bash
# Download COCO 2017 dataset with all splits and annotations
# Create directory structure
mkdir -p coco/images
cd coco/images
# Download images (in parallel for speed)
echo "Downloading images..."
wget -q --show-progress http://images.cocodataset.org/zips/train2017.zip &
wget -q --show-progress http://images.cocodataset.org/zips/val2017.zip &
wget -q --show-progress http://images.cocodataset.org/zips/test2017.zip &
wait
# Extract images
echo "Extracting images..."
unzip -q train2017.zip &
unzip -q val2017.zip &
unzip -q test2017.zip &
wait
# Clean up zips
rm train2017.zip val2017.zip test2017.zip
# Download annotations
cd ../
echo "Downloading annotations..."
wget -q --show-progress http://images.cocodataset.org/annotations/annotations_trainval2017.zip
wget -q --show-progress http://images.cocodataset.org/annotations/stuff_annotations_trainval2017.zip
wget -q --show-progress http://images.cocodataset.org/annotations/image_info_test2017.zip
# Extract annotations
echo "Extracting annotations..."
unzip -q annotations_trainval2017.zip
unzip -q stuff_annotations_trainval2017.zip
unzip -q image_info_test2017.zip
# Clean up
rm annotations_trainval2017.zip stuff_annotations_trainval2017.zip image_info_test2017.zip
echo "COCO dataset download complete!"
echo "Location: $(pwd)"#!/bin/bash
# Download COCO train and val only (~26GB)
mkdir -p coco/images
cd coco/images
# Download train and val only
echo "Downloading train and val images..."
wget -q --show-progress http://images.cocodataset.org/zips/train2017.zip &
wget -q --show-progress http://images.cocodataset.org/zips/val2017.zip &
wait
# Extract
echo "Extracting..."
unzip -q train2017.zip &
unzip -q val2017.zip &
wait
rm train2017.zip val2017.zip
# Download annotations
cd ../
wget -q --show-progress http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip -q annotations_trainval2017.zip
rm annotations_trainval2017.zip
echo "COCO train+val download complete!"ImageNet
Section titled “ImageNet”ImageNet requires registration. Use these scripts after obtaining download credentials.
#!/bin/bash
# Requires: username and access key from ImageNet
# Set your credentials
USERNAME="your_username"
ACCESS_KEY="your_access_key"
# Download training data (138GB)
wget --user=$USERNAME --password=$ACCESS_KEY \
https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar
# Download validation data (6.3GB)
wget --user=$USERNAME --password=$ACCESS_KEY \
https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar
# Extract training data
mkdir -p train && tar -xf ILSVRC2012_img_train.tar -C train/
cd train
for f in *.tar; do
d=$(basename "$f" .tar)
mkdir -p "$d"
tar -xf "$f" -C "$d"
rm "$f"
done
cd ..
# Extract validation data
mkdir -p val && tar -xf ILSVRC2012_img_val.tar -C val/
echo "ImageNet download complete!"CIFAR-10 / CIFAR-100
Section titled “CIFAR-10 / CIFAR-100”Small datasets that download quickly - typically handled by PyTorch/TensorFlow directly.
import torchvision
import torchvision.transforms as transforms
# CIFAR-10 (60,000 images, ~170MB)
trainset = torchvision.datasets.CIFAR10(
root='./data',
train=True,
download=True,
transform=transforms.ToTensor()
)
# CIFAR-100 (60,000 images, ~170MB)
trainset = torchvision.datasets.CIFAR100(
root='./data',
train=True,
download=True,
transform=transforms.ToTensor()
)import tensorflow as tf
# CIFAR-10
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
# CIFAR-100
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar100.load_data()Generic Download Script Template
Section titled “Generic Download Script Template”Use this template for custom datasets:
#!/bin/bash
# Template for downloading any dataset
# Configuration
DATASET_NAME="my_dataset"
DATASET_URL="https://example.com/dataset.zip"
TARGET_DIR="./datasets/${DATASET_NAME}"
# Create directory
mkdir -p "$TARGET_DIR"
cd "$TARGET_DIR"
# Download with progress bar and resume capability
wget -c --show-progress "$DATASET_URL"
# Extract based on file type
FILENAME=$(basename "$DATASET_URL")
case "$FILENAME" in
*.zip)
echo "Extracting ZIP..."
unzip -q "$FILENAME"
;;
*.tar.gz|*.tgz)
echo "Extracting TAR.GZ..."
tar -xzf "$FILENAME"
;;
*.tar)
echo "Extracting TAR..."
tar -xf "$FILENAME"
;;
esac
# Clean up archive
rm "$FILENAME"
echo "Dataset downloaded to: $TARGET_DIR"Tips for Large Downloads
Section titled “Tips for Large Downloads”Using aria2 for Faster Downloads
Section titled “Using aria2 for Faster Downloads”# Install aria2 (much faster than wget)
sudo apt install aria2
# Download with multiple connections
aria2c -x 16 -s 16 http://images.cocodataset.org/zips/train2017.zipResume Interrupted Downloads
Section titled “Resume Interrupted Downloads”# wget automatically resumes with -c flag
wget -c http://example.com/large-dataset.zip
# aria2 resumes automatically
aria2c -x 16 http://example.com/large-dataset.zipCheck Downloaded File Integrity
Section titled “Check Downloaded File Integrity”# If dataset provides checksums
md5sum downloaded_file.zip
sha256sum downloaded_file.zip
# Compare with provided checksum
echo "expected_checksum downloaded_file.zip" | md5sum -cimg2dataset: Large-Scale Image Dataset Creation
Section titled “img2dataset: Large-Scale Image Dataset Creation”img2dataset is a powerful tool for downloading large-scale image datasets from URLs. It can download, resize, and package 100M+ images efficiently.
Key Features
Section titled “Key Features”- Fast: Download 100M URLs in ~20 hours on one machine
- Formats: WebDataset (recommended), files, parquet, tfrecord
- Distributed: Multi-processing and multi-node support
- Resume: Incremental downloads if interrupted
- Compliant: Respects robots.txt and no-AI headers
Installation
Section titled “Installation”pip install img2datasetBasic Usage
Section titled “Basic Usage”# Create a CSV with columns: url, caption (optional)
# Format: url,caption
# https://example.com/image1.jpg,A beautiful sunset
# https://example.com/image2.jpg,Mountain landscape
img2dataset \
--url_list=urls.csv \
--output_folder=dataset \
--thread_count=64 \
--image_size=256 \
--output_format=webdataset \
--input_format=csv \
--url_col=url \
--caption_col=caption# Download CC3M dataset (~1 hour)
wget https://storage.googleapis.com/conceptual_12m/cc3m.tsv
img2dataset \
--url_list=cc3m.tsv \
--input_format=tsv \
--url_col=url \
--caption_col=caption \
--output_folder=cc3m \
--processes_count=16 \
--thread_count=64 \
--image_size=256 \
--output_format=webdataset# On each node, set different partition
# Node 1: --distributor_type=multiprocessing --subjob_size=1000 --processes_count=1 --partition_id=0
# Node 2: --distributor_type=multiprocessing --subjob_size=1000 --processes_count=1 --partition_id=1
img2dataset \
--url_list=large_dataset.parquet \
--output_folder=output \
--processes_count=1 \
--thread_count=256 \
--image_size=384 \
--distributor_type=multiprocessing \
--partition_id=0 \
--partitions_number=10Common Options
Section titled “Common Options”# Image processing
--image_size=256 # Resize images
--resize_mode=border # border, center_crop, keep_ratio, etc.
--resize_only_if_bigger=True # Don't upscale small images
# Performance
--processes_count=16 # Number of processes
--thread_count=64 # Threads per process
--distributor_type=multiprocessing
# Output format
--output_format=webdataset # webdataset, tfrecord, parquet, files
--output_folder=dataset
# Resume interrupted downloads
--incremental_mode=incremental # Resume from where it stopped
# Filtering
--min_image_size=200 # Skip images smaller than 200px
--max_image_area=512*512 # Skip very large images
--enable_wandb=True # Track progress with Weights & BiasesExample: LAION-400M Subset
Section titled “Example: LAION-400M Subset”# Download metadata
wget https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet
# Download images
img2dataset \
--url_list=part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet \
--input_format=parquet \
--url_col=URL \
--caption_col=TEXT \
--output_format=webdataset \
--output_folder=laion400m \
--processes_count=16 \
--thread_count=128 \
--image_size=384 \
--resize_mode=keep_ratio \
--resize_only_if_bigger=True \
--enable_wandb=TrueLoading Downloaded Dataset
Section titled “Loading Downloaded Dataset”import webdataset as wds
from torch.utils.data import DataLoader
# Create dataset
dataset = (
wds.WebDataset("dataset/{00000..00099}.tar")
.decode("pil")
.to_tuple("jpg;png", "txt")
.map_tuple(transforms.ToTensor(), lambda x: x)
)
# Create dataloader
dataloader = DataLoader(dataset, batch_size=32, num_workers=4)
for images, captions in dataloader:
# Train your model
passimport tensorflow as tf
import webdataset as wds
dataset = wds.WebDataset("dataset/{00000..00099}.tar")
dataset = dataset.decode("pil").to_tuple("jpg;png", "txt")
# Convert to TensorFlow dataset
tf_dataset = tf.data.Dataset.from_generator(
lambda: dataset,
output_signature=(
tf.TensorSpec(shape=(None, None, 3), dtype=tf.uint8),
tf.TensorSpec(shape=(), dtype=tf.string)
)
)Common Use Cases
Section titled “Common Use Cases”1. Create Custom Dataset from URLs
# Your own list of image URLs
img2dataset --url_list=my_urls.txt --output_folder=my_dataset2. Download Public Datasets
- CC3M: 3M image-text pairs (~1 hour)
- CC12M: 12M pairs (~5 hours)
- LAION-400M: 400M pairs (distributed)
- LAION-5B: 5B pairs (distributed cluster)
3. Web Scraping Results Convert web scraping results into training datasets.
Monitoring Progress
Section titled “Monitoring Progress”# Enable Weights & Biases tracking
img2dataset --enable_wandb=True --wandb_project=my_dataset
# Or check output folder
ls -lh dataset/Resources
Section titled “Resources”Related Resources
Section titled “Related Resources”- File Permissions - Make scripts executable
- Data Loading Optimization - Efficient data loading
- HPC Storage - Managing datasets on clusters
- GPU Monitoring - Monitor download progress