Abstract:
This study presents a reproducible multimodal deepfake detection framework integrating visual, acoustic, and cross-modal coherence features for dental applications. Using the DeepFake RealWorld dataset (46,371 clips; 77% with audio), forty-seven interpretable descriptors were extracted across visual and bioacoustic domains. Cross-modal metrics, such as LSE-D/LSE-C and Δt₍AV₎, achieved the highest accuracy (Δp=0.21), face–voice coherence (Δp=0.19), and scene–audio consistency (Δp=0.18). Acoustic markers such as RT₆₀ and DRR reached Δp=0.16 with <15% degradation under compression. In teledentistry, the framework supports verification of teleconsultations and detection of altered audiovisual data. Its interpretable, XAI-compliant design ensures reliable authenticity assessment and medico-legal trust in remote clinical recordings.
