Dataset Download Scripts

Overview

This page provides optimized scripts for downloading popular datasets. These scripts include:

Parallel downloads for faster completion
Automatic extraction and cleanup
Progress tracking
Error handling

COCO Dataset

The Microsoft COCO (Common Objects in Context) dataset is widely used for object detection, segmentation, and captioning tasks.

Dataset Size: ~25GB (train), ~1GB (val), ~6GB (test)

Full Dataset
Train + Val Only

Download complete COCO 2017 dataset

#!/bin/bash
# Download COCO 2017 dataset with all splits and annotations

# Create directory structure
mkdir -p coco/images
cd coco/images

# Download images (in parallel for speed)
echo "Downloading images..."
wget -q --show-progress http://images.cocodataset.org/zips/train2017.zip &
wget -q --show-progress http://images.cocodataset.org/zips/val2017.zip &
wget -q --show-progress http://images.cocodataset.org/zips/test2017.zip &
wait

# Extract images
echo "Extracting images..."
unzip -q train2017.zip &
unzip -q val2017.zip &
unzip -q test2017.zip &
wait

# Clean up zips
rm train2017.zip val2017.zip test2017.zip

# Download annotations
cd ../
echo "Downloading annotations..."
wget -q --show-progress http://images.cocodataset.org/annotations/annotations_trainval2017.zip
wget -q --show-progress http://images.cocodataset.org/annotations/stuff_annotations_trainval2017.zip
wget -q --show-progress http://images.cocodataset.org/annotations/image_info_test2017.zip

# Extract annotations
echo "Extracting annotations..."
unzip -q annotations_trainval2017.zip
unzip -q stuff_annotations_trainval2017.zip
unzip -q image_info_test2017.zip

# Clean up
rm annotations_trainval2017.zip stuff_annotations_trainval2017.zip image_info_test2017.zip

echo "COCO dataset download complete!"
echo "Location: $(pwd)"

Download only train and val (lighter)

#!/bin/bash
# Download COCO train and val only (~26GB)

mkdir -p coco/images
cd coco/images

# Download train and val only
echo "Downloading train and val images..."
wget -q --show-progress http://images.cocodataset.org/zips/train2017.zip &
wget -q --show-progress http://images.cocodataset.org/zips/val2017.zip &
wait

# Extract
echo "Extracting..."
unzip -q train2017.zip &
unzip -q val2017.zip &
wait
rm train2017.zip val2017.zip

# Download annotations
cd ../
wget -q --show-progress http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip -q annotations_trainval2017.zip
rm annotations_trainval2017.zip

echo "COCO train+val download complete!"

ImageNet

ImageNet requires registration. Use these scripts after obtaining download credentials.

Download ImageNet with credentials

#!/bin/bash
# Requires: username and access key from ImageNet

# Set your credentials
USERNAME="your_username"
ACCESS_KEY="your_access_key"

# Download training data (138GB)
wget --user=$USERNAME --password=$ACCESS_KEY \
  https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar

# Download validation data (6.3GB)
wget --user=$USERNAME --password=$ACCESS_KEY \
  https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar

# Extract training data
mkdir -p train && tar -xf ILSVRC2012_img_train.tar -C train/
cd train
for f in *.tar; do
  d=$(basename "$f" .tar)
  mkdir -p "$d"
  tar -xf "$f" -C "$d"
  rm "$f"
done
cd ..

# Extract validation data
mkdir -p val && tar -xf ILSVRC2012_img_val.tar -C val/

echo "ImageNet download complete!"

CIFAR-10 / CIFAR-100

Small datasets that download quickly - typically handled by PyTorch/TensorFlow directly.

PyTorch
TensorFlow

Download CIFAR with PyTorch

import torchvision
import torchvision.transforms as transforms

# CIFAR-10 (60,000 images, ~170MB)
trainset = torchvision.datasets.CIFAR10(
    root='./data',
    train=True,
    download=True,
    transform=transforms.ToTensor()
)

# CIFAR-100 (60,000 images, ~170MB)
trainset = torchvision.datasets.CIFAR100(
    root='./data',
    train=True,
    download=True,
    transform=transforms.ToTensor()
)

Download CIFAR with TensorFlow

import tensorflow as tf

# CIFAR-10
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

# CIFAR-100
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar100.load_data()

Generic Download Script Template

Use this template for custom datasets:

Generic dataset download template

#!/bin/bash
# Template for downloading any dataset

# Configuration
DATASET_NAME="my_dataset"
DATASET_URL="https://example.com/dataset.zip"
TARGET_DIR="./datasets/${DATASET_NAME}"

# Create directory
mkdir -p "$TARGET_DIR"
cd "$TARGET_DIR"

# Download with progress bar and resume capability
wget -c --show-progress "$DATASET_URL"

# Extract based on file type
FILENAME=$(basename "$DATASET_URL")
case "$FILENAME" in
  *.zip)
    echo "Extracting ZIP..."
    unzip -q "$FILENAME"
    ;;
  *.tar.gz|*.tgz)
    echo "Extracting TAR.GZ..."
    tar -xzf "$FILENAME"
    ;;
  *.tar)
    echo "Extracting TAR..."
    tar -xf "$FILENAME"
    ;;
esac

# Clean up archive
rm "$FILENAME"

echo "Dataset downloaded to: $TARGET_DIR"

Tips for Large Downloads

Using aria2 for Faster Downloads

# Install aria2 (much faster than wget)
sudo apt install aria2

# Download with multiple connections
aria2c -x 16 -s 16 http://images.cocodataset.org/zips/train2017.zip

Resume Interrupted Downloads

# wget automatically resumes with -c flag
wget -c http://example.com/large-dataset.zip

# aria2 resumes automatically
aria2c -x 16 http://example.com/large-dataset.zip

Check Downloaded File Integrity

# If dataset provides checksums
md5sum downloaded_file.zip
sha256sum downloaded_file.zip

# Compare with provided checksum
echo "expected_checksum  downloaded_file.zip" | md5sum -c

img2dataset: Large-Scale Image Dataset Creation

img2dataset is a powerful tool for downloading large-scale image datasets from URLs. It can download, resize, and package 100M+ images efficiently.

Key Features

Fast: Download 100M URLs in ~20 hours on one machine
Formats: WebDataset (recommended), files, parquet, tfrecord
Distributed: Multi-processing and multi-node support
Resume: Incremental downloads if interrupted
Compliant: Respects robots.txt and no-AI headers

Installation

pip install img2dataset

Basic Usage

Download from URL list

# Create a CSV with columns: url, caption (optional)
# Format: url,caption
# https://example.com/image1.jpg,A beautiful sunset
# https://example.com/image2.jpg,Mountain landscape

img2dataset \
  --url_list=urls.csv \
  --output_folder=dataset \
  --thread_count=64 \
  --image_size=256 \
  --output_format=webdataset \
  --input_format=csv \
  --url_col=url \
  --caption_col=caption

Download CC3M (3M image-text pairs)

# Download CC3M dataset (~1 hour)
wget https://storage.googleapis.com/conceptual_12m/cc3m.tsv

img2dataset \
  --url_list=cc3m.tsv \
  --input_format=tsv \
  --url_col=url \
  --caption_col=caption \
  --output_folder=cc3m \
  --processes_count=16 \
  --thread_count=64 \
  --image_size=256 \
  --output_format=webdataset

Large-scale distributed download

# On each node, set different partition
# Node 1: --distributor_type=multiprocessing --subjob_size=1000 --processes_count=1 --partition_id=0
# Node 2: --distributor_type=multiprocessing --subjob_size=1000 --processes_count=1 --partition_id=1

img2dataset \
  --url_list=large_dataset.parquet \
  --output_folder=output \
  --processes_count=1 \
  --thread_count=256 \
  --image_size=384 \
  --distributor_type=multiprocessing \
  --partition_id=0 \
  --partitions_number=10

Common Options

# Image processing
--image_size=256              # Resize images
--resize_mode=border          # border, center_crop, keep_ratio, etc.
--resize_only_if_bigger=True  # Don't upscale small images

# Performance
--processes_count=16          # Number of processes
--thread_count=64             # Threads per process
--distributor_type=multiprocessing

# Output format
--output_format=webdataset    # webdataset, tfrecord, parquet, files
--output_folder=dataset

# Resume interrupted downloads
--incremental_mode=incremental  # Resume from where it stopped

# Filtering
--min_image_size=200          # Skip images smaller than 200px
--max_image_area=512*512      # Skip very large images
--enable_wandb=True           # Track progress with Weights & Biases

Example: LAION-400M Subset

Download 1M samples from LAION-400M

# Download metadata
wget https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet

# Download images
img2dataset \
  --url_list=part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet \
  --input_format=parquet \
  --url_col=URL \
  --caption_col=TEXT \
  --output_format=webdataset \
  --output_folder=laion400m \
  --processes_count=16 \
  --thread_count=128 \
  --image_size=384 \
  --resize_mode=keep_ratio \
  --resize_only_if_bigger=True \
  --enable_wandb=True

Load WebDataset in PyTorch

import webdataset as wds
from torch.utils.data import DataLoader

# Create dataset
dataset = (
    wds.WebDataset("dataset/{00000..00099}.tar")
    .decode("pil")
    .to_tuple("jpg;png", "txt")
    .map_tuple(transforms.ToTensor(), lambda x: x)
)

# Create dataloader
dataloader = DataLoader(dataset, batch_size=32, num_workers=4)

for images, captions in dataloader:
    # Train your model
    pass

Load in TensorFlow

import tensorflow as tf
import webdataset as wds

dataset = wds.WebDataset("dataset/{00000..00099}.tar")
dataset = dataset.decode("pil").to_tuple("jpg;png", "txt")

# Convert to TensorFlow dataset
tf_dataset = tf.data.Dataset.from_generator(
    lambda: dataset,
    output_signature=(
        tf.TensorSpec(shape=(None, None, 3), dtype=tf.uint8),
        tf.TensorSpec(shape=(), dtype=tf.string)
    )
)

Common Use Cases

1. Create Custom Dataset from URLs

# Your own list of image URLs
img2dataset --url_list=my_urls.txt --output_folder=my_dataset

2. Download Public Datasets

CC3M: 3M image-text pairs (~1 hour)
CC12M: 12M pairs (~5 hours)
LAION-400M: 400M pairs (distributed)
LAION-5B: 5B pairs (distributed cluster)

3. Web Scraping Results Convert web scraping results into training datasets.

Tips

Monitoring Progress

# Enable Weights & Biases tracking
img2dataset --enable_wandb=True --wandb_project=my_dataset

# Or check output folder
ls -lh dataset/

Resources

File Permissions - Make scripts executable
Data Loading Optimization - Efficient data loading
HPC Storage - Managing datasets on clusters
GPU Monitoring - Monitor download progress