Fine-tuning vision-language foundation models has emerged as a powerful approach to leveraging internet-scale data for generalization in downstream applications. A particularly promising source of representations already used in supervised learning …