AI-generated artwork detection using self-distilled transformers with global–local feature learning and Grad-CAM interpretability | Scientific Reports - Nature - News Bunkers

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Advertisement
Scientific Reports volume 16, Article number: 497 (2026) Cite this article
4054 Accesses
2 Citations
Metrics details
This study presents a strong framework for the detection of artificial intelligence-generated artwork using digital imaging and deep learning–based transformers models, which helps in art community to distinguish the authenticity of human-created art from highly fascinated machine-generated content. Art holds reflection of deep cultural, historical, and social significance, however, due to rapid advancements in artificial intelligence, particularly in generative adversarial networks and diffusion models, have enabled the production of visually creating artworks that blur the boundaries between originality and synthesis. Traditional methods relying on conventional features and statistical analysis are increasingly lower performance against such advance transformation, highlighting the need for more advanced detection mechanisms. To address this challenge, the proposed approach employs Distillation with No Labels (DINO) v2, a self-distilled transformer model that excels in extracting discriminative features by capturing both global structures and fine-grained visual cues. The model was trained and evaluated on a balanced dataset of real and AI-generated art images, with results benchmarked against strong baselines. Experimental findings demonstrate that the proposed framework achieves 99.01% accuracy, 95.29% precision, 94.58% recall, 94.93% F1-score, and an AUC of 99%, outperforming all baselines with superior generalization. Furthermore, interpretability methods along with statistical validation based on log of p-values, confirmed that predictions are both dependable and transparent.
Art has always shown reflection of human culture, creativity, and identity based on ancient cave paintings to modern digital illustrations. It captures not only aesthetic beauty but also increases the values, emotions, and narratives of societies across centuries¹. With its ability to exceed geographical and temporal boundaries, art serves as a universal language that communicates human experience in ways words often cannot². In today’s digital era, however, the landscape of artistic creation has been gradually transformed, targeting the very actual authenticity, originality, and authorship³.
The emergence of artificial intelligence (AI) has brought new aspects to the creation of art. More sophisticated models like Generative Adversarial Networks (GANs) and diffusion-based models have now been able to generate artworks that are highly realistic and can have a variety of styles⁴. These innovations provide new opportunities to be creative, on the other hand, they are very problematic to the art authentication, cultural preservation and intellectual property⁵. The conventional authentication methods, which were once effective to detect any manual forgery, are no longer effective in detecting patterns of irregularities being built into AI-generated images⁶. Art trend is becoming increasingly day-by-day because of the introduction of numerous AI-tools to create images. This widening divide reaffirms the urgent necessity of smart, automatic systems of detection that will be able to distinguish between real artworks created by man and the machine-generated fakes⁷.
The previous approaches were based on the intensive use of traditional feature extraction⁸ and the statistical analysis of image textures, color histograms, and brushstroke patterns⁹. Although these methods offered useful information, their low flexibility to investigate more complicated generative processes and generalization over a variety of datasets greatly weakened their performance¹⁰. As the technologies of AIs-generated images appear to be sophisticated, it is necessary to investigate more advanced systems, which are capable of effectively addressing the changing nature of forging digital art¹¹.
In this paper, presents a sophisticated model of classifying real and AI-generated artwork with the help of a transformer-based model, DINOv2. The model has the capability to pick up the global artistic patterns and fine-grained anomalies by incorporating deep feature extraction, self-distillation, and transformer-based representation learning. The framework does not only provide state-of-the-art performance, but also incorporates interpretability techniques in making decisions, which provides the applicability of the framework in digital art forensics and cultural preservation. The research contributions are:
Proposing DINOv2-based framework for detecting AI-generated versus authentic artworks, achieving superior performance with 99.01% accuracy, setting a new benchmark in AI-artwork detection.
Comprehensive benchmarking against baseline models including SAM, ConvNeXt, and Swin Transformers.
Incorporation of interpretability methods including Grad-CAM, and LIME to enhance model transparency and trust.
Statistical validation of model predictions to ensure reliability and consistency across diverse image types.
This study is structured in the following way: In Sect. 2, a review of related work, highlighting prior research on AI-generated image detection in digital art. Section 3 describes the proposed methodology, which consists of the description of the dataset, preprocessing, the description of the baseline models, and the DINOv2 framework. Section 4 has the results of the experiment with the detailed evaluation and the analysis of interpretability. Section 5 discusses findings and implications, while Sect. 6 concludes the study and outlines directions for future research.
Early work showed that even high-quality GAN images commonly contain subtle artifacts on the detection of AI-generated images. A classifier trained on one GAN’s outputs could generalize and detect images from many other GAN and transformers models¹². This implied that CNN was producing images with systematic flaws that are identifiable by classifiers¹³. The requirement for strong forensic tools like AI fakes snowballed also explored based on data-driven deep learning and still had a challenging time with evolving fakes advancing by addressing shortcomings¹⁴.
Many early methods examined the presence of artifacts in either the pixel or frequency domain based on classical digital forensics characteristics including Photo Response Non-Uniformity (sensor noise) and Error Level Analysis; and input them to a CNN getting over 95% accuracy in the classification between AI-generated images versus real pictures¹⁵. Likewise, the Fourier spectrum implemented directly could help discern signatures, an analysis on StyleGAN outputs revealed that fake samples differ in terms of high frequencies, so in principle a DFT-based approach can be employed to flag fakes based on their spectral abnormalities¹⁶. In fact, the GAN up sampling does not replicate natural image spectra, emphasizing an essential frequency-domain cue, encouraging detectors to consider not putting all their eggs in the spectral artifact detection basket as generative models continue to improve, some spectral patterns will become weaker¹⁷. Space-time and spatial frequency combined detectors are thus more robust.
Another line of research investigates deep neural network intrinsic fingerprints left by generative models. T-GD, a Transferable GANs Detection method with teacher–student training on the data from one GAN to transfer to new GANs. This teacher–student self-training increased generalization to unseen generators¹⁸. This is also in line with a patch-based CNN which searches local neighborhoods of the probe image for AI artefacts and have shown that patch size can be small enough to extract GAN glitches, which explicit the fact that an image has synthetic origin¹⁹. In addition, attention mechanisms were also studied to model non-local dependencies using attention layers to capture long-range dependencies between pixels by capturing semantic gated deep features which model holistic anomalies in AI images, and that cannot be detected if only local filters are involved²⁰.
The noise pattern considered and designed a dedicated block to recover latent noise from images. They noticed that real images suffer from natural sensor noise, while GAN ones were often found to have their own unique sources of atypical noise and by training on these discrepancies, their simple noise-based classifier worked well for any set of GAN/flow models²¹. In another work, unsupervised domain adaptation for addressing the problem of domain gap. They suggested to couple feature distributions of real and fake domains without labels using new generators to make a detector more robust against AI images from new sources. By the meantime, using augmentation techniques to prevent overfitting on known fakes²². Furthermore, introduced a multi-head system to integrate the global image context and local patch-level information. Their method collected global-scale statistics and local-level features from numerous image patches, which were later integrated through an attention-based channel pooling strategy²³. Similarly, CNNs over diverse image patches to vote for authenticity. Different sub-detectors can recognize color aberration, or shape distortion and with such an ensemble architecture, it will converge with better decisions. The downside is a higher computational cost, which some are tackled using light models²⁴. By contrastive learning, they encouraged attention-based branches to concentrate on complementary clues and reached a good generalization with moderate complexity²⁵.
The diffusion models DALL-E 2, Stable Diffusion brought new challenges, introduced Diffusion Inference Asymmetry Index (DIRE) by using an image into a pre-trained diffusion model and measured how well it can be recovered by diffusion sampling; the key insight being that GANs can reproduce their own fakes visible better than actual images²⁶. Vision transformers with attention to process diffusion outputs. They proposed a hybrid model constructed from an attention-guided CNN branch and a ViT-based branch, for detailed texture as well as global context. This model achieved an 86%+ accuracy at detecting AI-generated interior design images and the artwork on test sets, showing that it performed favorably compared to vanilla CNNs²⁷.
ViT with adapter modules for continual learning tackles a more diverse collection of generative models by learning new “domains” of fakes incrementally without forgetting old domains. A content-agnostic adapter on the ViT learns universal forensic cues with the introduction of a token shuffling technique, which can alleviate overfitting local patterns in image semantics²⁸. In addition to raw performance, other works targeted efficiency and practicality, presented a ResNet-SE hybrid model which incorporates Squeeze-and-Excitation attention modules with ResNets. The model was 96.1% accurate at recognizing faces in a public AI-art dataset (CIFAKE) and more computationally efficient than transformer-based models. SE attention could force the CNN to highlight subtle differences with minor computational cost²⁹. However, it is agreed in most of the recent research that multi-scale and attention-based approaches are necessary for robustness. Most of the top performances these days ensemble frequency features, spatial cues and deep features using pre-trained models³⁰.
In conclusion, the SOTA of AI art detection changes fast. Second, the early CNN classifiers based on easy-to-learn GAN weaknesses have turned into sophisticated multi-scale and transformer-based models that combine frequency analysis, attention mechanisms, and even incremental learning elements. Accuracy on seen-generation scenarios is currently above 92%, in hybrid models and tailored transformers³¹. Signals of authenticity are also becoming a point of interest as supplementary approaches, but newly developed research shows metadata can itself be an effective feature in some cases.
This section presents the systematic detection of AI-generated art, using state-of-the-art neural networks. The procedure begins with dataset preparation to model evaluations on which detection capabilities benchmark is decided, framework methodology shown in Fig. 1. This strategy results in a clean, robust, and scalable pipeline to tackle the separation of real and AI artworks.
Methodological diagram showing the complete flow of the applied process, where each figure concept is defined and explained throughout the manuscript.
Before training the deep learning classifier to detect AI-generated art, the database passes through several preprocessing steps, aimed at improving quality, regularity, and diversity. Image enhancement methods (:mathcal{E}) are used to modulate brightness, contrast or sharpness resulting in subtle generative artefacts coming into the view of the model³². Resizing (:mathcal{R}) allows every image to be the same size so requires batch processing in modeling. Removing noise (:mathcal{N}) aids in the reduction of undesired distortions and compression artefacts that can potentially bias the learning, presenting more informative details to be learnt³³. The dataset is magnified, and overfitting is reduced by performing flipping, cropping (:mathcal{C}), rotation (:{mathcal{T}}_{theta:}:)etc. Both horizontal and vertical flipping (:mathcal{F}) give the mirrored versions of images, whilst after cropping the model has seen more types of partial regions, and rotation also exposes the network to some kinds of local orientation variations³⁴.
All pre-processing techniques augment data quality, improve generalization and stabilize model behavior in discriminating real from AI synthesized images, as results after preprocessing shown in Fig. 2.
Distinguishing between AI-created and real artworks requires strong representation that can learn local artifacts as well as global semantics. To attain this goal, DINO v2, a self-supervised vision transformer framework that has demonstrated success in learning transferable features without explicit annotations³⁵. The pipeline DINOv2 relies on two stages: (i) feature extraction, which learns deep hierarchical embeddings in a process of self-distillation; and (ii) classification, where such embeddings are projected to a discriminative space to distinguish real from AI-generated images.
Samples images accessed during image preprocessing phase for analysis of applied preprocessing steps outcomes.
First divide each image into non-overlapping patches in DINOv2, and project them to the high-dimensional embedding space. These embeddings are then fed into transformer encoder layers with multi-head self-attention mechanisms that can model local and long-range dependences. A notable component of DINOv2 is the use of the teacher–student distillation mechanism, where the student network learns to match soft distributions from the teacher that are expected to result in stable and transferable representations, using the patch-level feature extraction.
Here, the attention mechanism computes a weighted aggregation of values (:{v}_{j,h}) across all patches, scaled by queries (q_{{i,h}}^{{rm T}}) and keys (:{k}_{j,h}), this extraction shown in Fig. 3. The additional nonlinear residual term (:{upbeta:}cdot:{upsigma:}left(U{x}_{i}right)) enhances sensitivity to subtle AI-generated distortions such as texture irregularities or edge inconsistencies.
Image Feature extraction analysis using proposed model.
The [CLS] token of the last transformer block represents a global summary of the image after features are extracted. This embedding is projected to a discriminative space by DINOv2, and the representations are optimized using a self-distillation loss, which forces the student ‘s predictions on different augmentations of the same image to match with teacher³⁶. This orientation serves robustness against the variations like cropping, out Flip and noise, for classification probability.
This equation generalizes the standard SoftMax by adding a quadratic embedding regularize (:gamma:), which discourages overconfident embedding³⁷. This is particularly important in the AI-vs-fake context, (:|{F}_{CLS}{|}_{2}^{2}) where the forgery generation process could produce synthetic images that are convincingly realistic, this classification flow shown in Fig. 4. The model becomes more robust against adversarial generated or well-crafted generative outputs when the confidence is calibrated.
Working of classification layer with SoftMax on sample image prediction.
The experimental architecture will need a powerful computing system, a modern-day graphics card, enough computers, memory, and SSD disk to manage massive image processing. The models are coded in Python in a deep learning setup installed on Ubuntu or Windows and with CUDA-enabled acceleration, as seen in Table 1. This arrangement provides effective training, assessment and imitation of outcomes. The base models that will be used in this work are the SAM, ConvNeXt, and Swin Transformers models as they are complementary to each other in extracting features that can determine the AI vs. Real difference of Art. S-CAM (Segment to Classify Anything Model) is one of the vision foundation models, based on the strong prompt-based segmentation, to produce fine structured and boundary level features capable of effectively spotting artifacts in synthetic images³⁸. Based on these considerations, motivated by the efficiency of CNNs, but equipped with large kernels alongside depth wise convolutions to capture both local textures and wider spatial dependencies (key for detection of generative irregularities), also propose ConvNeXt, a modernized version of convolutional network design inspired by Transformers³⁹. In contrast, Swin Transformers employ SW-MSA for modeling image patches hierarchically and capture local details and global context at a lower computational complexity than full self-attention⁴⁰. Together, these models serve as a complementary suite of feature extractors with SAM concentrating on structural cues, ConvNeXt emphasizing texture and spatial hierarchies, and Swin Transformers capturing multi-scale contextual relationships that create a solid base for AI-artwork identification.
The FauxFinder dataset is a balanced set of 21,642 images to be used for a binary real vs. AI system classification task available on public repository Kaggle⁴¹. All images are resized into 256 × 256 pixels for consistent machine learning process, as sample images taken from dataset for each class shown in Fig. 5.
Samples images taken from dataset showing classes contain images.
For experiments, dataset is split into 70 − 30 ratio, distribution shown in Table 2, which makes it suitable for developing while providing benchmarking performance on the problem of AI-generated artwork detection using computer vision models.
This section presents the outcomes achieved by the proposed DINOv2 framework alongside the applied baseline models for classifying real and AI-generated artworks. Further, compare the performance of baseline and proposed models with standard classification metrics (accuracy, precision, recall and F1-score). Accuracy is the number of correctly classified images overall but can be misleading in imbalanced data sets. Recall is how many AI-generated images were identified by model as being AI generated out of the total number of AI-generated images there are. Recall measures how well the model can successfully detect all AI-generated images and have little false negatives. Given that precision and recall often come at the expense of each other, the F1-score combines them into one harmonic mean, providing a more balanced way to evaluate compactness detection. Both measures when taken together give a complete assessment of how well the models separate AI-generated artworks from real images.
The baseline model evaluation starts with the accuracy results and Table 3 shows that SAM achieves best performance (97.28%), followed by ConvNeXt (94.96%) and Swin Transformers produces 91.37%. These findings suggest that SAM is the most consistent among the baselines in separating real artworks from AI-generated ones. The training and validation accuracy curves also support this, as SAM shows a consistent opposite-side bias with the smallest gap which means it generalizes well. ConvNeXt shows robust learning with close accuracy trends and slight overfitting. Swin Transformers, instead, suffer from evident fluctuations in validation accuracy (i.e., training instability). These findings are in line with analysis on loss convergence, shown in Fig. 6, SAM achieves smooth and fast convergence for both training and validation losses, ConvNeXt has good convergence quality with mild variance, while Swin suffers higher instability of validation loss. Overall, the results demonstrate that SAM strikes the best trade-off between accuracy and stability of predictions, ConvNeXt yields competitive yet less stable solutions while Swin struggles to reach level of generalization strength as exhibited by other models.
Analysis of training and validation accuracy & loss (a) SAM (b) ConvNeXt (c) Swin transformers.
The results of these analysis on the baseline models bring even more attention to a great divide between their performance when discriminating real from GAN work for art experimentation. SAM obtained best performance with accuracy, precision recall, and F1-score of 97.28%, 94.28%, 94.59% and 94.43%. The corresponding confusion matrix is consistent with high reliability; 97.4% of fake and 97.1% of real images were accurately classified, with misclassification rates being extremely low (2.6% and 2.9%, respectively). This is evidence of the effectiveness of SAM in capturing fine-grained features and reducing false positives/negatives. ConvNeXt had accuracy, precision, and recall of 94.96%, 91.49%, and 93.29% respectively, while its F1-score was found to be 92.38%. Confusion matrix also accepts these results: 94.9% and 95.2% accuracy on real and fake images, respectively. Despite the slightly lower performance than SAM, ConvNeXt still shows strong classification with moderate misclassifications (5.1% and 4.8%). This result indicates that ConvNeXt strikes a good balance between texture and spatial learning while it is less accurate for disaggregating artefacts on boundary level.
Swin Transformers performed the worst but still effective with an 91.37% accuracy, 86.39% precision, 88.29 recall and F1-score is 87.32%. More obvious misclassifications are presented in the confusion matrix Fig. 7: 90.8% and 91.4% for fake and real contents of Io urinary, respectively, compared with rates of error 9.2% and 8.6%. This suggests Swin Transformers failed to adapt to subtle texture changes and were less dependable in capturing fine content details compared with the other two models. The experimental results in total demonstrate that SAM consistently improves the baselines with better accuracy and well-balanced precision–recall trade-offs, as shown in both quantitative results and confusion matrix analysis. ConvNeXt is still competitive with strong results, but slightly worse performance than that in residual in terms of generalization. Swin Transformers are also left behind in generalization analysis due to its hierarchical attention mechanism.
Confusion matrix analysis (a) SAM (b) ConvNeXt (c) Swin transformers.
The developed DINOv2 model significantly outperformed baseline models, obtaining an impressive accuracy of 99.01%, precision of 95.29%, recall of 94.58% and F1-score of 94.93%. These findings show that not only can the model classify Real and AI generated artworks with high accuracy, but this is done by maximizing a balanced trade-off between precision and recall.
The model is functional with the F1 one-score being high, which confirms that the model is effective in reducing the false positive and false negative translating to a sound deployment in the practical context in classification activities. Training and validation curves also confirm these results. The graph of the test accuracy shows that converging is rapid, and the accuracy of validation is almost parallel to the training curve, which is a positive indication of high generalization without any significant overfitting. The validation loss is varying but the overall direction is downward and indicates that models are learning to converge and generalize within epochs as indicated in Fig. 8. This shows the natural capability of DINOv2 to draw finer visual clues and complex generative artifacts that are normally ignored by traditional architectures.
One of the advantages of model is anchored on its self-distillation mechanism whereby, it can acquire transferable and discriminative representations without overt supervisory stimuli. Using the transformer backbone, DINOv2 is helpful to capture global dependencies and local fine-grained information within images. This doubled up functionality enables it to find concealed spatial inconsistencies such as differences in textures, brushstrokes and compositing patterns which distinguish the generated AI images from genuine artworks. In conclusion, the results validate that the proposed DINOv2 model significantly improves upon baseline models with more robustness and adaptability as well as better classification capabilities in this challenging AI-generated artwork detection domain.
Proposed model accuracy and loss analysis based on training and validation process.
The proposed model was set with well-tuned hyperparameters and achieves robust learning. Hyper-parameters involved parameters specific to input representation, the depth of transformers, attention types and optimization choice, displayed in Table 4. The following regularizations were applied to avoid overfitting and for learning convergence stability: dropout, weight decay, drop-connect and a learning rate schedule. These two settings supported the model to better learn the global structure and fine-grained features, which led to high accuracy and powerful generalization for AI-generated artwork detection. The interpretability of the DINOv2 model was also investigated through Grad-CAM and LIME explanations, which could enlighten us more about its decision-making process. Grad-CAM visualisations show the part of the image that most influenced classification. For real works the model attends to fine texture patterns and brushstroke layout whereas for generated content it focuses on irregular structural areas and how the animation betrays its generative process. This supports the model’s capacity to identify minute spatial cues and slight stylistic aberrations hardly visible by human eye.
LIME explanations are a good complement to Grad-CAM, by giving local feature-level interpretations. Visualized boundaries in the LIME maps indicate how the model connects certain pixel regions (and related structure edges) to classification decisions. As an illustration, real paintings have high reliance on natural limits and artistic features, unlike in counterfeited pictures where they favor unnatural edges and contours, in Fig. 9. All these interpretability techniques, combined, propel the point that the model suggested not only reaches high performance but also gives a clear reasoning output in its support, which proves its strength in the context of detecting AI-generated artworks. Through global attention patterns (via Grad-CAM) and local feature importance weights (via LIME), DINOv2 condenses a sound solution; that is, the final compromise between performance and explainability, which is crucial to applications in art authentication and digital forensics.
The suggested DINOv2 model was statistically validated through Chi-Square, ANOVA, T-Test and Z-Test over textural details, frequency artifacts, pixel intensity, edge severity, color histograms, and metadata integrity of the images. The observations on trends in p-value confirm that model is predicting statistically significant differences between real and AI-generated images. For true and authentic samples, the p-values for texture and metadata integrity are higher, indicating more consistent distributions. Contrastingly, for GAN-generated, deepfake and AI-altered images p-values are persistently lower across both the spectral artifacts and pixel intensity (Fig. 10) confirming irregularities in synthetic image structures.
Interpretation analysis based on proposed model performance using GRAD CAM and LIME.
Crucially, all results are far outside the p = 0.01 confidence level, confirming that the classifications made by model are non-random and therefore statistically significant. These results reemphasize the strength of the proposed model and suggest that it does in fact successfully separate between imagery type (real vs. AI-altered) while predictions are rooted in statistically verifiable patterns.
Overall, the results show that DINOv2 model significantly outperforms all previous models for detecting AI-generated art images. These findings indicate that DINOv2 provides the best stand-alone framework to disentangle real images from those generated by AI models and serve as a clear benchmark above previous baseline methods.
Statistical test validation analysis based on various test supporting proposed model performance to analyze image based on image features to capture whether AI or not.
Comparison with related works shows the advantage of proposed DINOv2 model in detecting AI generated fashion images. Performance of the conventional CNN models was limited (89% accuracy) due to an inability to retain complex generative features, while GAN-based detection methods reached 93%, still constrained by reliance on unique artifact patterns, as displayed in Table 5. The Transformer-based ViT models showed even better generalization with 94% accuracy, benefiting from global attention mechanisms, but lacked the ability to capture fine-grained properties.
In contrast, the proposed DINOv2 model resulted in a 99% accuracy which presents substantially better performance, due to its capability of global structure-awareness while at the same time local feature extraction is performed through self-distillation. This significant performance boost demonstrates the maturity of the model and its readiness to effectively combat generations of increasingly sophisticated generative techniques and establishes a new milestone for art authentication and AI-generated image detection.
In this study, the findings demonstrate that the proposed DINOv2-based framework effectively identifies AI-generated artwork with exceptional precision and reliability. By integrating the model’s self-distillation and hierarchical feature representation capabilities, it successfully captured both global artistic structures and subtle generative deviations that distinguish real art from AI-created content. The model achieved outstanding results with 99.01% accuracy and an AUC of 99.29, surpassing baseline models such as SAM, ConvNeXt, and Swin Transformers. These results confirm the strong generalization ability and adaptability of DINO v2 across diverse artistic styles. Overall, the findings validate the model’s robustness for digital art authentication and forgery detection, offering a significant advancement in AI-driven cultural preservation and visual integrity assurance. For future work, this study aims to expand the dataset to include a broader variety of artistic styles and media forms, integrating multimodal features such as textual descriptions, brushstroke metadata, and visual semantics. Additionally, efforts will focus on enhancing model interpretability and fairness, ensuring transparent, explainable, and ethically aligned decision-making in real-world art authentication and digital forensics applications.
Dataset is available at: (https:/www.kaggle.com/datasets/doctorstrange420/real-and-fake-ai-generated-art-images-dataset)
Wiratno, T. A. & Callula, B. Transformation of beauty in digital fine arts aesthetics: an artpreneur perspective. APTISI Trans. Technopreneurship 6(2), 231–241. https://doi.org/10.34306/att.v6i2.395 (2024).
Article Google Scholar
Park, J., Kang, H. & Kim, H. Y. Human, do you think this painting is the work of a real artist? Int. J. Hum. Comput. Interact. 40, 5174–5191. https://doi.org/10.1080/10447318.2023.2232978 (2023).
Article Google Scholar
Bansal, G., Nawal, A., Chamola, V. & Herencsar, N. Revolutionizing Visuals: The Role of Generative AI in Modern Image Generation. ACM Trans. Multimedia Comput. Commun. Appl. 20(11), 356. https://doi.org/10.1145/3689641 (2024).
Article Google Scholar
Miah, J., Duc, M., Cao, A., Sayed, & Haque, M. S. Generative AI Model for Artistic Style Transfer Using Convolutional Neural Networks. J. Comput. Sci. Technol. Stud. https://doi.org/10.32996/JCSTS.2023.5.4.9 (2023).
Article Google Scholar
Chiu, M. C., Hwang, G. J., Hsia, L. H. & Shyu, F. M. Artificial intelligence-supported art education: a deep learning-based system for promoting university students’ artwork appreciation and painting outcomes. Interact. Learn. Env. 32(3), 824–842. https://doi.org/10.1080/10494820.2022.2100426 (2024).
Article Google Scholar
Mochamad Nursalim, A. N. & Masitoh, S. Aesthetics and Artificial Intelligence: Impact and Criticism of Art. Educ. Achievement: J. Sci. Res. https://doi.org/10.51178/jsr.v4i3.1667 (2023).
Article Google Scholar
Say, T., Alkan, M. & Kocak, A. Advancing GAN deepfake detection: mixed datasets and comprehensive artifact analysis. Appl. Sci. 15(2), 923. https://doi.org/10.3390/APP15020923 (2025).
Article Google Scholar
Md, Z., Hossain, F. U., Zaman & Islam, M. R. Advancing AI-Generated Image Detection: Enhanced Accuracy through CNN and Vision Transformer Models with Explainable AI Insights. In 2023 26th International Conference on Computer and Information Technology (ICCIT). 1–6 https://doi.org/10.1109/ICCIT60459.2023.10440990 (2023).
Zhang, S., Qi, Y. & Wu, J. Applying deep learning for style transfer in digital art: enhancing creative expression through neural networks. Sci. Rep. 15 (1), 11744. https://doi.org/10.1038/s41598-025-95819-9 (2025).
Article ADS PubMed PubMed Central Google Scholar
Bellaiche, L. et al. Humans versus AI: whether and why we prefer human-created compared to AI-created artwork. Cogn. Res. Princ Implic. 8 (1), 42. https://doi.org/10.1186/s41235-023-00499-6 (2023).
Article PubMed PubMed Central Google Scholar
Zhang, H. & Zhang, R. Generative artificial intelligence (AI) in built environment design and planning – A state-of-the-art review. Progress Eng. Sci. 2 (1), 100040. https://doi.org/10.1016/j.pes.2024.100040 (2025).
Article Google Scholar
Yuan, W. et al. Transformer in reinforcement learning for decision-making: a survey. Front. Inform. Technol. Elect. Eng. 25(6), 763–790. https://doi.org/10.1631/FITEE.2300548 (2024).
Article Google Scholar
S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros, CNN-Generated Images Are Surprisingly Easy to Spot… for Now. (2020). https://www.motherjones.com/politics/2019/03/ (2025).
Amerini, I. et al. Deepfake Media Forensics: Status and Future Challenges. J. Imag. 11(3), 73. https://doi.org/10.3390/JIMAGING11030073 (2025).
Article Google Scholar
Martin-Rodriguez, F., Garcia-Mojon, R. & Fernandez-Barciela, M. Detection of AI-Created Images Using Pixel-Wise Feature Extraction and Convolutional Neural Networks. Sens. 2023 23(22), 9037. https://doi.org/10.3390/S23229037 (2023).
Article Google Scholar
Convertini, V. N., Impedovo, D., Lopez, U., Pirlo, G. & Sterlicchio, G. Discrete fourier transform in unmasking deepfake images: a comparative study of stylegan creations. Information 15(11), 711. https://doi.org/10.3390/INFO15110711 (2024).
Article Google Scholar
Zhang, Y., Pang, Z., Huang, S., Wang, C. & Zhou, X. Unmasking AI-created visual content: a review of generated images and deepfake detection technologies. J. King Saud Univ. – Comput. Inf. Sci. 37(6), 1–31. https://doi.org/10.1007/S44443-025-00154-8 (2025).
Article Google Scholar
Jeong, Y., Kim, D., Ro, Y., Kim, P. & Choi, J. FingerprintNet: Synthesized Fingerprints for Generated Image Detection, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 13674 76–94 https://doi.org/10.1007/978-3-031-19781-9_5 (2022).
Rana, M. S., Nobi, M. N., Murali, B. & Sung, A. H. Deepfake detection: A systematic literature review. IEEE Access. 10, 25494–25513. https://doi.org/10.1109/ACCESS.2022.3154404 (2022).
Article Google Scholar
Wang, X. et al. Spotting the Fakes: A Deep Dive into GAN-Generated Face Detection. ACM Trans. Multimedia Comput. Commun. Appl. 21(7), 1–24. https://doi.org/10.1145/3742786 (2025).
Article Google Scholar
Liu, B. et al. Detecting Generated Images by Real Images, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 13674 95–110 https://doi.org/10.1007/978-3-031-19781-9_6 (2022).
Zhang, M., Wang, H., He, P., Malik, A. & Liu, H. Improving GAN-Generated image detection generalization using unsupervised domain adaptation. Proc. (IEEE Int. Conf. Multimed Expo) 2022-July. https://doi.org/10.1109/ICME52920.2022.9859763 (2022).
Article Google Scholar
Yun, Q. Vision Transformers (ViTs) for feature extraction and classification of AI-Generated visual designs. IEEE Access. https://doi.org/10.1109/ACCESS.2025.3562130 (2025).
Article Google Scholar
Golda, A. et al. Privacy and security concerns in generative AI: A comprehensive survey. IEEE Access. 12, 48126–48144. https://doi.org/10.1109/ACCESS.2024.3381611 (2024).
Article Google Scholar
Wu, J. & Li, H. Artificial intelligence-driven visual feature extraction and transfer learning for automatic identification of paintings and photographs. Int. J. Inf. Commun. Technol. 26 (29), 1–18. https://doi.org/10.1504/IJICT.2025.147879 (2025).
Article Google Scholar
Tan, C. et al. Mar., Data-Independent Operator: A Training-Free Artifact Representation Extractor for Generalizable Deepfake Detection. https://arxiv.org/pdf/2403.06803 (2024).
Wang, H. Vision Transformer-Based framework for AI-Generated image detection in interior design. Informatica (Slovenia). 49 (16), 137–150. https://doi.org/10.31449/inf.v49i16.7979 (2025).
Article Google Scholar
Tang, S. et al. Towards extensible detection of AI-Generated images via Content-Agnostic Adapter-Based Category-Aware incremental learning. IEEE Trans. Inf. Forensics Secur. 20, 2883–2898. https://doi.org/10.1109/TIFS.2025.3546845 (2025).
Article Google Scholar
Gunukula, A. R., Das Gupta, H. & Sheng, V. S. Detecting AI-generated images using a hybrid ResNet-SE attention model. Appl. Sci. 15(13), 7421. https://doi.org/10.3390/APP15137421 (2025).
Article Google Scholar
Alqahtani, A. et al. A transfer learning based approach for COVID-19 detection using inception-v4 model. Intell. Automat. Soft Comput. 35(2), 1721–1736. https://doi.org/10.32604/iasc.2023.025597 (2023).
Article MathSciNet Google Scholar
Kwan, C., Li, B., Gong, L. Y. & Li, X. J. A Contemporary Survey on Deepfake Detection: Datasets, Algorithms, and Challenges. Electronics 13(3), 585. https://doi.org/10.3390/ELECTRONICS13030585 (2024).
Article Google Scholar
Ji, J. et al. DPA-MVSNet: Dynamic Context Perception Multi-view Stereo with transformers and data augmentation. Knowl. Based Syst. 325, 113852. https://doi.org/10.1016/J.KNOSYS.2025.113852 (2025).
Article Google Scholar
Huang, J. et al. ASAP: Interpretable Analysis and Summarization of AI-generated Image Patterns at Scale. http://arxiv.org/abs/2404.02990 (2024).
Chen, J. et al. 3D Surface Highlight Removal Method Based on Detection Mask. Arab. J. Sci. Eng. https://doi.org/10.1007/S13369-025-10573-4/METRICS (2025).
Article Google Scholar
Ahsan, M. Exploring progress in Text-to-Image synthesis: an In-Depth survey on the evolution of generative adversarial networks. IEEE Access. 12, 178401–178440. https://doi.org/10.1109/ACCESS.2024.3435541 (2024).
Article Google Scholar
Lađević, A. L., Kramberger, T., Kramberger, R. & Vlahek, D. Detection of AI-Generated Synthetic Images with a Lightweight CNN. AI. 5(3) 1575–1593 https://doi.org/10.3390/AI5030076 (2024).
Wang, T. et al. ResLNet: deep residual LSTM network with longer input for action recognition. Front. Comput. Sci. 16 (6), 166334. https://doi.org/10.1007/s11704-021-0236-9 (2022).
Article Google Scholar
Gu, K. et al. Perceptual information fidelity for quality Estimation of industrial images. IEEE Trans. Circuits Syst. Video Technol. 35 (1), 477–491. https://doi.org/10.1109/TCSVT.2024.3454160 (2025).
Article Google Scholar
Li, Z. et al. ConvNeXt-based fine-grained image classification and bilinear attention mechanism model. Appl. Sci. 12(18), 9016. https://doi.org/10.3390/APP12189016 (2022).
Article Google Scholar
Ryabko, B. & Tan, L. Causally-Informed Instance-Wise Feature Selection for Explaining Visual Classifiers. Entropy 27(8), 814. https://doi.org/10.3390/E27080814 (2025).
Article ADS Google Scholar
Real and Fake (AI-Generated). Art Images Dataset. https://www.kaggle.com/datasets/doctorstrange420/real-and-fake-ai-generated-art-images-dataset (2025).
Liu, X., Zhao, Y., Wang, S. & Wei, J. G-SAM: GMM-based segment anything model for medical image classification and segmentation. Cluster Comput. 27(10), 14231–14245. https://doi.org/10.1007/S10586-024-04679-X (2024).
Article Google Scholar
Alsakar, Y. M. et al. Multi-label dental disorder diagnosis based on MobileNetV2 and Swin transformer using bagging ensemble classifier. Sci. Rep. 14 (1), 25193. https://doi.org/10.1038/s41598-024-73297-9 (2024).
Article ADS PubMed PubMed Central Google Scholar
Alyami, S. & Luqman, H. Swin-MSTP: Swin transformer with multi-scale Temporal perception for continuous sign Language recognition. Neurocomputing 617, 129015. https://doi.org/10.1016/j.neucom.2024.129015 (2025).
Article Google Scholar
Download references
(2023HZ1802) Special Project for Philosophy and Social Science Research of Shaanxi Province — “Research on the Digital Collaborative Innovation and Development of Cultural Creativity in Shaanxi Museums”. (2024JK0163) General Special Project of the 2024 Scientific Research Program of the Shaanxi Provincial Department of Education — “Design Strategy and Application Research of Shaanxi Museum Digital Cultural Creative Products Based on the JTBD Model”. (2023HZ1763) Special Project for Philosophy and Social Science Research of Shaanxi Province — “Digital Preservation and Innovative Revitalization of Fengxiang Woodblock New Year Paintings”. (2024MDN13) Mingde Innovation Fund Project — “Research on the Innovation and Development Strategy of Museum Digital Cultural Creativity Based on the JTBD Model” (SGH25Y3363) Research on the 5C Creativity Cultivation Model for Art and Design Majors under the Background of Digital-Intelligence Empowerment—A 2025 Project of the Shaanxi Province “14th Five-Year Plan” for Educational Science.
School of Art and Design, Xi’an Mingde Institute of Technology, Xi’an, 710124, Shaanxi, China
Wang Yinghua, Li Linyan & Ma Wenjuan
Department of Cross-Media Arts, Xi’an Academy of Fine Arts, Xi’an, 710065, Shaanxi, China
Zhang Yunzhe
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Wang Yinghua, Li Linyan, Ma Wenjuan, and Zhang Yunzhe equally contributed to the research design, data analysis, and manuscript writing. Both authors reviewed and approved the final version of the manuscript.
Correspondence to Wang Yinghua.
The authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
Yinghua, W., Linyan, L., Wenjuan, M. et al. AI-generated artwork detection using self-distilled transformers with global–local feature learning and Grad-CAM interpretability. Sci Rep 16, 497 (2026). https://doi.org/10.1038/s41598-025-29229-2
Download citation
Received: 30 September 2025
Accepted: 14 November 2025
Published: 06 January 2026
Version of record: 06 January 2026
DOI: https://doi.org/10.1038/s41598-025-29229-2
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative
Advertisement
Scientific Reports (Sci Rep)
ISSN 2045-2322 (online)
© 2026 Springer Nature Limited
Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

source

AI-generated artwork detection using self-distilled transformers with global–local feature learning and Grad-CAM interpretability | Scientific Reports – Nature

Leave a Reply Cancel Reply