Deep Network Representations as Reliable Indicators of Synthetic Content in Audiovisual and Clinical Contexts

Abstract:

This study introduces an interpretable framework for detecting synthetic audiovisual content using deep neural representations, applied to the DeepFake RealWorld (DFRW) dataset (46 371 clips; 77% with audio). Visual, acoustic, and cross-modal embeddings from ResNet, Vision Transformer, SlowFast, Wav2Vec2, and ECAPA-TDNN were evaluated with frequency-based metrics (Δp ≥ 0.15, PR ≥ 1.5). The strongest indicators were facial embedding variance (Δp = 0.29, PR = 3.4), Mahalanobis distance (Δp = 0.25), and audiovisual coherence (Δp = 0.23), all stable within 15% under compression and re-capture. In teledentistry and telemedicine, such explainable AI markers enhance authenticity verification of digital evidence and strengthen medico-legal reliability.