Fréchet inception distance

The Fréchet inception distance (FID) is a metric used to assess the quality of images created by a generative model, like a generative adversarial network (GAN)^[1] or a diffusion model.^[2]^[3]

The FID compares the distribution of generated images with the distribution of a set of real images (a "ground truth" set). Rather than comparing individual images, mean and covariance statistics of many images generated by the model are compared with the same statistics generated from images in the ground truth or reference set. A convolutional neural network such as an inception architecture is used to produce higher-level features describing the images, thus leading to the name Fréchet inception distance.

The FID is inspired by the earlier inception score (IS) metric which evaluates only the distribution of generated images.^[1] The FID metric does not replace the IS metric; classifiers that achieve the best (lowest) FID score tend to have greater sample variety while classifiers achieving the best (highest) IS score tend to have better quality within individual images.^[2]

The FID metric was introduced in 2017,^[1] and is the current standard metric for assessing the quality of models that generate synthetic images as of 2024. It has been used to measure the quality of many recent models including the high-resolution StyleGAN1^[4] and StyleGAN2^[5] networks, and diffusion models.^[2]^[3]

The FID attempts to compare images visually through deep layers of an inception network. More recent works take this further by instead comparing CLIP embeddings of the images.^[6]^[7]

Overview

The purpose of the FID score is to measure the diversity of images created by a generative model with images in a reference dataset. The reference dataset could be ImageNet or COCO-2014.^[3]^[8] Using a large dataset as a reference is important as the reference image set should represent the full diversity of images which the model attempts to create.

Generative models such as diffusion models produce novel images that have features from the reference set, but are themselves quite different from any image in the training set. So the quality of these models cannot be assessed by simply comparing each image to an image in the training set pixel-by-pixel, as done, for example, with the L2 norm.

Instead, the FID models the two sets of images as if they were drawn from two multidimensional Gaussian distributions ${\mathcal {N}}(\mu ,\Sigma )$ and ${\mathcal {N}}(\mu ',\Sigma ')$ . The distance between the two distributions is calculated as the earth mover's distance or the Wasserstein distance between the two Gaussian distributions.

Rather than directly comparing images pixel by pixel (for example, as done by the L2 norm), the FID compares the mean and standard deviation of the deepest layer in Inception v3 (the 2048-dimensional activation vector of its last pooling layer.) These layers are closer to output nodes that correspond to real-world objects such as a specific breed of dog or an airplane, and further from the shallow layers near the input image. So the FID compares how often the same high-level features are found within the two sets of images. After every image has been processed through the inception architecture, the means and covariances of the activation of the last layer on the two datasets are compared with the distance $d_{F}({\mathcal {N}}(\mu ,\Sigma ),{\mathcal {N}}(\mu ',\Sigma '))^{2}=\lVert \mu -\mu '\rVert _{2}^{2}+\operatorname {tr} \left(\Sigma +\Sigma '-2\left(\Sigma \Sigma '\right)^{\frac {1}{2}}\right)$ Higher distances indicate a poorer generative model. A score of 0 indicates a perfect model.

Formal definition

For any two probability distributions $\mu ,\nu$ over $\mathbb {R} ^{n}$ having finite mean and variances, their earth-mover's distance or Fréchet distance is^[9] $d_{F}(\mu ,\nu ):=\left(\inf _{\gamma \in \Gamma (\mu ,\nu )}\int _{\mathbb {R} ^{n}\times \mathbb {R} ^{n}}\|x-y\|^{2}\,\mathrm {d} \gamma (x,y)\right)^{1/2},$ where $\Gamma (\mu ,\nu )$ is the set of all measures on $\mathbb {R} ^{n}\times \mathbb {R} ^{n}$ with marginals $\mu$ and $\nu$ on the first and second factors respectively. (The set $\Gamma (\mu ,\nu )$ is also called the set of all couplings of $\mu$ and $\nu$ .).

For two multidimensional Gaussian distributions ${\mathcal {N}}(\mu ,\Sigma )$ and ${\mathcal {N}}(\mu ',\Sigma ')$ , it is expressed in closed form as^[10] $d_{F}({\mathcal {N}}(\mu ,\Sigma ),{\mathcal {N}}(\mu ',\Sigma '))^{2}=\lVert \mu -\mu '\rVert _{2}^{2}+\operatorname {tr} \left(\Sigma +\Sigma '-2\left(\Sigma \Sigma '\right)^{\frac {1}{2}}\right)$ This allows us to define the FID in pseudocode form:

INPUT a function $f:\Omega _{X}\to \mathbb {R} ^{n}$ .
INPUT two datasets $S,S'\subset \Omega _{X}$ .
Compute $f(S),f(S')\subset \mathbb {R} ^{n}$ .
Fit two gaussian distributions ${\mathcal {N}}(\mu ,\Sigma ),{\mathcal {N}}(\mu ',\Sigma ')$ , respectively for $f(S),f(S')$ .

RETURN $d_{F}({\mathcal {N}}(\mu ,\Sigma ),{\mathcal {N}}(\mu ',\Sigma '))^{2}$ .

In most practical uses of the FID, $\Omega _{X}$ is the space of images, and $f$ is an Inception v3 model trained on the ImageNet, but without its final classification layer. Technically, it is the 2048-dimensional activation vector of its last pooling layer. Of the two datasets $S,S'$ , one of them is a reference dataset, which could be the ImageNet itself, and the other is a set of images generated by a generative model, such as GAN, or diffusion model.^[1]

Variants

Specialized variants of FID have been suggested as evaluation metric for music enhancement algorithms as Fréchet Audio Distance (FAD),^[11] for generative models of video as Fréchet Video Distance (FVD),^[12] and for AI-generated molecules as Fréchet ChemNet Distance (FCD).^[13]

Limitations

Chong and Forsyth^[14] showed FID to be statistically biased, in the sense that their expected value over a finite data is not their true value. Also, because FID measured the Wasserstein distance towards the ground-truth distribution, it is inadequate for evaluating the quality of generators in domain adaptation setups, or in zero-shot generation. Finally, while FID is more consistent with human judgement than previously used inception score, there are cases where FID is inconsistent with human judgment (e.g. Figure 3,5 in Liu et al.).^[15]

References

^ ^a ^b ^c ^d Heusel, Martin; Ramsauer, Hubert; Unterthiner, Thomas; Nessler, Bernhard; Hochreiter, Sepp (2017). "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium". Advances in Neural Information Processing Systems. 30. arXiv:1706.08500.
^ ^a ^b ^c Ho, Jonathan; Salimans, Tim (2022). "Classifier-Free Diffusion Guidance". arXiv:2207.12598 [cs.LG].
^ ^a ^b ^c Esser, Patrick; Kulal, Sumith; Blattmann, Andreas; Entezari, Rahim; Müller, Jonas; Saini, Harry; Levi, Yam; Lorenz, Dominik; Sauer, Axel (2024-03-05). "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis". arXiv:2403.03206 [cs.CV].
^ Karras, Tero; Laine, Samuli; Aila, Timo (2020). "A Style-Based Generator Architecture for Generative Adversarial Networks". IEEE Transactions on Pattern Analysis and Machine Intelligence. PP (12): 4217–4228. arXiv:1812.04948. doi:10.1109/TPAMI.2020.2970919. PMID 32012000. S2CID 211022860.
^ Karras, Tero; Laine, Samuli; Aittala, Miika; Hellsten, Janne; Lehtinen, Jaakko; Aila, Timo (23 March 2020). "Analyzing and Improving the Image Quality of StyleGAN". arXiv:1912.04958 [cs.CV].
^ Jayasumana, Sadeep; Ramalingam, Srikumar; Veit, Andreas; Glasner, Daniel; Chakrabarti, Ayan; Kumar, Sanjiv (2024). "Rethinking FID: Towards a Better Evaluation Metric for Image Generation". IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16–22, 2024. IEEE. pp. 9307–9315. arXiv:2401.09603. doi:10.1109/CVPR52733.2024.00889.
^ Hessel, Jack; Holtzman, Ari; Forbes, Maxwell; Bras, Ronan Le; Choi, Yejin (2021). "CLIPScore: A Reference-free Evaluation Metric for Image Captioning". In Moens, Marie-Francine; Huang, Xuanjing; Specia, Lucia; Yih, Scott Wen-tau (eds.). Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7–11 November, 2021. Association for Computational Linguistics. pp. 7514–7528. arXiv:2104.08718. doi:10.18653/V1/2021.EMNLP-MAIN.595.
^ Lin, Tsung-Yi; Maire, Michael; Belongie, Serge; Bourdev, Lubomir; Girshick, Ross; Hays, James; Perona, Pietro; Ramanan, Deva; Zitnick, C. Lawrence (2015-02-20). "Microsoft COCO: Common Objects in Context". arXiv:1405.0312 [cs.CV].
^ Fréchet., M (1957). "Sur la distance de deux lois de probabilité. ("On the distance between two probability laws")". C. R. Acad. Sci. Paris. 244: 689–692. translated abstract: The author indicates an explicit expression of the distance of two probability laws, according to the first definition of Paul Lévy. He also indicates a convenient modification of this definition.
^ Dowson, D. C; Landau, B. V (1 September 1982). "The Fréchet distance between multivariate normal distributions". Journal of Multivariate Analysis. 12 (3): 450–455. doi:10.1016/0047-259X(82)90077-X. ISSN 0047-259X.
^ Kilgour, Kevin; Zuluaga, Mauricio; Roblek, Dominik; Sharifi, Matthew (2019-09-15). "Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms". Interspeech 2019: 2350–2354. doi:10.21437/Interspeech.2019-2219. S2CID 202725406.
^ Unterthiner, Thomas; Steenkiste, Sjoerd van; Kurach, Karol; Marinier, Raphaël; Michalski, Marcin; Gelly, Sylvain (2019-03-27). "FVD: A new Metric for Video Generation". Open Review.
^ Preuer, Kristina; Renz, Philipp; Unterthiner, Thomas; Hochreiter, Sepp; Klambauer, Günter (2018-09-24). "Fréchet ChemNet Distance: A Metric for Generative Models for Molecules in Drug Discovery". Journal of Chemical Information and Modeling. 58 (9): 1736–1741. arXiv:1803.09518. doi:10.1021/acs.jcim.8b00234. PMID 30118593. S2CID 51892387.
^ Chong, Min Jin; Forsyth, David (2020-06-15). "Effectively Unbiased FID and Inception Score and where to find them". arXiv:1911.07023 [cs.CV].
^ Liu, Shaohui; Wei, Yi; Lu, Jiwen; Zhou, Jie (2018-07-19). "An Improved Evaluation Framework for Generative Adversarial Networks". arXiv:1803.07474 [cs.CV].

[fid-1] Heusel, Martin; Ramsauer, Hubert; Unterthiner, Thomas; Nessler, Bernhard; Hochreiter, Sepp (2017). "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium". Advances in Neural Information Processing Systems. 30. arXiv:1706.08500.

[classifier-free-diffusion-2] Ho, Jonathan; Salimans, Tim (2022). "Classifier-Free Diffusion Guidance". arXiv:2207.12598 [cs.LG].

[:0-3] Esser, Patrick; Kulal, Sumith; Blattmann, Andreas; Entezari, Rahim; Müller, Jonas; Saini, Harry; Levi, Yam; Lorenz, Dominik; Sauer, Axel (2024-03-05). "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis". arXiv:2403.03206 [cs.CV].

[stylegan1-4] Karras, Tero; Laine, Samuli; Aila, Timo (2020). "A Style-Based Generator Architecture for Generative Adversarial Networks". IEEE Transactions on Pattern Analysis and Machine Intelligence. PP (12): 4217–4228. arXiv:1812.04948. doi:10.1109/TPAMI.2020.2970919. PMID 32012000. S2CID 211022860.

[5] Karras, Tero; Laine, Samuli; Aittala, Miika; Hellsten, Janne; Lehtinen, Jaakko; Aila, Timo (23 March 2020). "Analyzing and Improving the Image Quality of StyleGAN". arXiv:1912.04958 [cs.CV].

[6] Jayasumana, Sadeep; Ramalingam, Srikumar; Veit, Andreas; Glasner, Daniel; Chakrabarti, Ayan; Kumar, Sanjiv (2024). "Rethinking FID: Towards a Better Evaluation Metric for Image Generation". IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16–22, 2024. IEEE. pp. 9307–9315. arXiv:2401.09603. doi:10.1109/CVPR52733.2024.00889.

[7] Hessel, Jack; Holtzman, Ari; Forbes, Maxwell; Bras, Ronan Le; Choi, Yejin (2021). "CLIPScore: A Reference-free Evaluation Metric for Image Captioning". In Moens, Marie-Francine; Huang, Xuanjing; Specia, Lucia; Yih, Scott Wen-tau (eds.). Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7–11 November, 2021. Association for Computational Linguistics. pp. 7514–7528. arXiv:2104.08718. doi:10.18653/V1/2021.EMNLP-MAIN.595.

[8] Lin, Tsung-Yi; Maire, Michael; Belongie, Serge; Bourdev, Lubomir; Girshick, Ross; Hays, James; Perona, Pietro; Ramanan, Deva; Zitnick, C. Lawrence (2015-02-20). "Microsoft COCO: Common Objects in Context". arXiv:1405.0312 [cs.CV].

[9] Fréchet., M (1957). "Sur la distance de deux lois de probabilité. ("On the distance between two probability laws")". C. R. Acad. Sci. Paris. 244: 689–692. translated abstract: The author indicates an explicit expression of the distance of two probability laws, according to the first definition of Paul Lévy. He also indicates a convenient modification of this definition.

[gaussian-10] Dowson, D. C; Landau, B. V (1 September 1982). "The Fréchet distance between multivariate normal distributions". Journal of Multivariate Analysis. 12 (3): 450–455. doi:10.1016/0047-259X(82)90077-X. ISSN 0047-259X.

[11] Kilgour, Kevin; Zuluaga, Mauricio; Roblek, Dominik; Sharifi, Matthew (2019-09-15). "Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms". Interspeech 2019: 2350–2354. doi:10.21437/Interspeech.2019-2219. S2CID 202725406.

[12] Unterthiner, Thomas; Steenkiste, Sjoerd van; Kurach, Karol; Marinier, Raphaël; Michalski, Marcin; Gelly, Sylvain (2019-03-27). "FVD: A new Metric for Video Generation". Open Review.

[13] Preuer, Kristina; Renz, Philipp; Unterthiner, Thomas; Hochreiter, Sepp; Klambauer, Günter (2018-09-24). "Fréchet ChemNet Distance: A Metric for Generative Models for Molecules in Drug Discovery". Journal of Chemical Information and Modeling. 58 (9): 1736–1741. arXiv:1803.09518. doi:10.1021/acs.jcim.8b00234. PMID 30118593. S2CID 51892387.

[14] Chong, Min Jin; Forsyth, David (2020-06-15). "Effectively Unbiased FID and Inception Score and where to find them". arXiv:1911.07023 [cs.CV].

[15] Liu, Shaohui; Wei, Yi; Lu, Jiwen; Zhou, Jie (2018-07-19). "An Improved Evaluation Framework for Generative Adversarial Networks". arXiv:1803.07474 [cs.CV].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

v t e Machine learning evaluation metrics
Regression	MSE MAE sMAPE MAPE MASE MSPE RMS RMSE/RMSD R² MDA MAD
Classification	F-score P4 Accuracy Precision Recall Kappa MCC AUC ROC Sensitivity and specificity Logarithmic Loss
Clustering	Silhouette Calinski-Harabasz index Davies-Bouldin Dunn index Hopkins statistic Jaccard index Rand index Similarity measure SMC SimHash
Ranking	MRR NDCG AP
Computer Vision	PSNR SSIM IoU
NLP	Perplexity BLEU
Deep Learning Related Metrics	Inception score FID
Recommender system	Coverage Intra-list Similarity
Similarity	Cosine similarity Euclidean distance Pearson correlation coefficient
Confusion matrix