Skip to content

Dataset Download Scripts

This page provides optimized scripts for downloading popular datasets. These scripts include:

  • Parallel downloads for faster completion
  • Automatic extraction and cleanup
  • Progress tracking
  • Error handling

The Microsoft COCO (Common Objects in Context) dataset is widely used for object detection, segmentation, and captioning tasks.

Dataset Size: ~25GB (train), ~1GB (val), ~6GB (test)

Download complete COCO 2017 dataset
#!/bin/bash
# Download COCO 2017 dataset with all splits and annotations

# Create directory structure
mkdir -p coco/images
cd coco/images

# Download images (in parallel for speed)
echo "Downloading images..."
wget -q --show-progress http://images.cocodataset.org/zips/train2017.zip &
wget -q --show-progress http://images.cocodataset.org/zips/val2017.zip &
wget -q --show-progress http://images.cocodataset.org/zips/test2017.zip &
wait

# Extract images
echo "Extracting images..."
unzip -q train2017.zip &
unzip -q val2017.zip &
unzip -q test2017.zip &
wait

# Clean up zips
rm train2017.zip val2017.zip test2017.zip

# Download annotations
cd ../
echo "Downloading annotations..."
wget -q --show-progress http://images.cocodataset.org/annotations/annotations_trainval2017.zip
wget -q --show-progress http://images.cocodataset.org/annotations/stuff_annotations_trainval2017.zip
wget -q --show-progress http://images.cocodataset.org/annotations/image_info_test2017.zip

# Extract annotations
echo "Extracting annotations..."
unzip -q annotations_trainval2017.zip
unzip -q stuff_annotations_trainval2017.zip
unzip -q image_info_test2017.zip

# Clean up
rm annotations_trainval2017.zip stuff_annotations_trainval2017.zip image_info_test2017.zip

echo "COCO dataset download complete!"
echo "Location: $(pwd)"

ImageNet requires registration. Use these scripts after obtaining download credentials.

Download ImageNet with credentials
#!/bin/bash
# Requires: username and access key from ImageNet

# Set your credentials
USERNAME="your_username"
ACCESS_KEY="your_access_key"

# Download training data (138GB)
wget --user=$USERNAME --password=$ACCESS_KEY \
  https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar

# Download validation data (6.3GB)
wget --user=$USERNAME --password=$ACCESS_KEY \
  https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar

# Extract training data
mkdir -p train && tar -xf ILSVRC2012_img_train.tar -C train/
cd train
for f in *.tar; do
  d=$(basename "$f" .tar)
  mkdir -p "$d"
  tar -xf "$f" -C "$d"
  rm "$f"
done
cd ..

# Extract validation data
mkdir -p val && tar -xf ILSVRC2012_img_val.tar -C val/

echo "ImageNet download complete!"

Small datasets that download quickly - typically handled by PyTorch/TensorFlow directly.

Download CIFAR with PyTorch
import torchvision
import torchvision.transforms as transforms

# CIFAR-10 (60,000 images, ~170MB)
trainset = torchvision.datasets.CIFAR10(
    root='./data',
    train=True,
    download=True,
    transform=transforms.ToTensor()
)

# CIFAR-100 (60,000 images, ~170MB)
trainset = torchvision.datasets.CIFAR100(
    root='./data',
    train=True,
    download=True,
    transform=transforms.ToTensor()
)

Use this template for custom datasets:

Generic dataset download template
#!/bin/bash
# Template for downloading any dataset

# Configuration
DATASET_NAME="my_dataset"
DATASET_URL="https://example.com/dataset.zip"
TARGET_DIR="./datasets/${DATASET_NAME}"

# Create directory
mkdir -p "$TARGET_DIR"
cd "$TARGET_DIR"

# Download with progress bar and resume capability
wget -c --show-progress "$DATASET_URL"

# Extract based on file type
FILENAME=$(basename "$DATASET_URL")
case "$FILENAME" in
  *.zip)
    echo "Extracting ZIP..."
    unzip -q "$FILENAME"
    ;;
  *.tar.gz|*.tgz)
    echo "Extracting TAR.GZ..."
    tar -xzf "$FILENAME"
    ;;
  *.tar)
    echo "Extracting TAR..."
    tar -xf "$FILENAME"
    ;;
esac

# Clean up archive
rm "$FILENAME"

echo "Dataset downloaded to: $TARGET_DIR"

# Install aria2 (much faster than wget)
sudo apt install aria2

# Download with multiple connections
aria2c -x 16 -s 16 http://images.cocodataset.org/zips/train2017.zip
# wget automatically resumes with -c flag
wget -c http://example.com/large-dataset.zip

# aria2 resumes automatically
aria2c -x 16 http://example.com/large-dataset.zip
# If dataset provides checksums
md5sum downloaded_file.zip
sha256sum downloaded_file.zip

# Compare with provided checksum
echo "expected_checksum  downloaded_file.zip" | md5sum -c

img2dataset: Large-Scale Image Dataset Creation

Section titled “img2dataset: Large-Scale Image Dataset Creation”

img2dataset is a powerful tool for downloading large-scale image datasets from URLs. It can download, resize, and package 100M+ images efficiently.

  • Fast: Download 100M URLs in ~20 hours on one machine
  • Formats: WebDataset (recommended), files, parquet, tfrecord
  • Distributed: Multi-processing and multi-node support
  • Resume: Incremental downloads if interrupted
  • Compliant: Respects robots.txt and no-AI headers
pip install img2dataset
Download from URL list
# Create a CSV with columns: url, caption (optional)
# Format: url,caption
# https://example.com/image1.jpg,A beautiful sunset
# https://example.com/image2.jpg,Mountain landscape

img2dataset \
  --url_list=urls.csv \
  --output_folder=dataset \
  --thread_count=64 \
  --image_size=256 \
  --output_format=webdataset \
  --input_format=csv \
  --url_col=url \
  --caption_col=caption
# Image processing
--image_size=256              # Resize images
--resize_mode=border          # border, center_crop, keep_ratio, etc.
--resize_only_if_bigger=True  # Don't upscale small images

# Performance
--processes_count=16          # Number of processes
--thread_count=64             # Threads per process
--distributor_type=multiprocessing

# Output format
--output_format=webdataset    # webdataset, tfrecord, parquet, files
--output_folder=dataset

# Resume interrupted downloads
--incremental_mode=incremental  # Resume from where it stopped

# Filtering
--min_image_size=200          # Skip images smaller than 200px
--max_image_area=512*512      # Skip very large images
--enable_wandb=True           # Track progress with Weights & Biases
Download 1M samples from LAION-400M
# Download metadata
wget https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet

# Download images
img2dataset \
  --url_list=part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet \
  --input_format=parquet \
  --url_col=URL \
  --caption_col=TEXT \
  --output_format=webdataset \
  --output_folder=laion400m \
  --processes_count=16 \
  --thread_count=128 \
  --image_size=384 \
  --resize_mode=keep_ratio \
  --resize_only_if_bigger=True \
  --enable_wandb=True
Load WebDataset in PyTorch
import webdataset as wds
from torch.utils.data import DataLoader

# Create dataset
dataset = (
    wds.WebDataset("dataset/{00000..00099}.tar")
    .decode("pil")
    .to_tuple("jpg;png", "txt")
    .map_tuple(transforms.ToTensor(), lambda x: x)
)

# Create dataloader
dataloader = DataLoader(dataset, batch_size=32, num_workers=4)

for images, captions in dataloader:
    # Train your model
    pass

1. Create Custom Dataset from URLs

# Your own list of image URLs
img2dataset --url_list=my_urls.txt --output_folder=my_dataset

2. Download Public Datasets

  • CC3M: 3M image-text pairs (~1 hour)
  • CC12M: 12M pairs (~5 hours)
  • LAION-400M: 400M pairs (distributed)
  • LAION-5B: 5B pairs (distributed cluster)

3. Web Scraping Results Convert web scraping results into training datasets.

# Enable Weights & Biases tracking
img2dataset --enable_wandb=True --wandb_project=my_dataset

# Or check output folder
ls -lh dataset/