A survey on description and modeling of audiovisual documents

Manel Fourati¹,
Anis Jedidi¹ &
Faiez Gargouri¹

316 Accesses
6 Citations
Explore all metrics

Abstract

The number of audiovisual documents available on the web is exponentially increasing due to the rise of the number of videos produced every day. The recent progress in audiovisual documents field has made it possible to popularize the exchange of these documents in many domains. More generally, the interest in the indexing potential of audiovisual documents has significantly increased in different disciplines, namely films, sports events, etc. Within this framework, several research studies focused on implementing this indexation based on the segmentation of the audiovisual document in fragments. This segmentation was brought by the appropriate descriptions. Although the indexing process seems essential, the way of exploiting and searching audiovisual documents remains unsatisfactory. Indeed, annotations based on generic descriptions (title, creator, publisher, etc.) are insufficient to describe the content of the audiovisual documents. With the proliferation of audiovisual documents and the mentioned indexing limits, the question that should be answered is: “What is the relevant information of the audiovisual content?”. In this paper, we present a survey to characterize the description and the modeling of audiovisual documents. We classify the existing description methods into three categories: a low-level description, a documentary description and a semantic description. The main objective of this study is to propose an approach that helps describe and organize the content of an audiovisual document so as to conduct a better inquiry.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A deep learning-based classification for topic detection of audiovisual documents

Article 02 August 2022

Topic and Thematic Description for Movies Documents

Evaluating unsupervised thesaurus-based labeling of audiovisual content in an archive production environment

Article Open access 23 June 2016

Notes

http://www.nist.gov/itl/iad/mig/med11.cfm
http://dublincore.org/

References

AE Abduraman, SA Berrani, and B Merialdo (2012). “TV Program Structuring Techniques,” TV Content Anal. Tech. Appl., p. 157
S Antol, A Agrawal, J Lu, M Mitchell, D Batra, C Lawrence Zitnick, and D Parikh (2015). “Vqa: Visual question answering,” in Proceedings of the IEEE international conference on computer vision, pp. 2425–2433
D Arthur and S Vassilvitskii (2007), “k-means++: The advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035
B Bachimont (1994). “Le Contrôle Dans les Systèmes À Base de Connaissances Contribution À l’Épistémologie de l'Intelligence Artificielle”
Ballan L, Bertini M, Del Bimbo A, Seidenari L, Serra G (2011) Event detection and recognition for semantic annotation of video. Multimed Tools Appl 51(1):279–302
Article Google Scholar
S Banerjee and A Lavie (2005). “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72
Bhardwaj RK, Margam M (2017) Metadata framework for online legal information system in indian environment. Libr Rev 66(1/2):49–68
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
MATH Google Scholar
Burghouts GJ, Geusebroek J-M (2009) Performance evaluation of local colour invariants. Comput Vis Image Underst 113(1):48–62
Article Google Scholar
Caillet M, Roisin C, Carrive J (2014) Multimedia applications for playing with digitized theater performances. Multimed Tools Appl 73(3):1777–1793
Article Google Scholar
X Chang, Y Yang, A Hauptmann, EP Xing, and YL Yu 2015. “Semantic concept discovery for large-scale zero-shot event detection,” in Twenty-fourth international joint conference on artificial intelligence
M Chen and A Hauptmann (1995). “Mosift: recognizing human actions in surveillance videos,”
Chuttur MY (2014) Investigating the effect of definitions and best practice guidelines on errors in Dublin Core metadata records. J Inf Sci 40(1):28–37
Article Google Scholar
N Dalal, B Triggs, and C Schmid (2006). “Human detection using oriented histograms of flow and appearance,” in European conference on computer vision, pp. 428–441
Dasiopoulou S, Tzouvaras V, Kompatsiaris I, Strintzis MG (2010) Enquiring MPEG-7 based multimedia ontologies. Multimed Tools Appl 46(2–3):331–370
Article Google Scholar
Z De Linde and N Kay (2016). The semiotics of subtitling. Routledge
Del Fabro M, Böszörmenyi L (2013) State-of-the-art and future challenges in video scene detection: a survey. Multimedia Systems 19(5):427–454
Article Google Scholar
Deldjoo Y, Elahi M, Cremonesi P, Garzotto F, Piazzolla P, Quadrana M (2016) Content-based video recommendation system based on stylistic visual features. J Data Semant 5(2):99–113
Article Google Scholar
Deldjoo Y, Elahi M, Quadrana M, Cremonesi P (2018) Using visual features based on MPEG-7 and deep learning for movie recommendation. Int J Multimed Inf Retr 7(4):207–219
Article Google Scholar
B Dervin (1992). “From the mind’s eye of the user: the sense-making qualitative-quantitative methodology,” Sense-making Methodol. Read
E Egyed-Zsigmond, Y Prié, A Mille, and JM Pinon (2000). “A graph based audio-visual document annotation and browsing system,” in Content-Based Multimedia Information Access-Volume 2, pp. 1381–1389
Elleuch N, Ben Ammar A, Alimi A (2015) A generic framework for semantic video indexing based on visual concepts/contexts detection. Multimed Tools Appl 74(4):1397–1421
Article Google Scholar
Fang Z, Liu J, Li Y, Qiao Y, Lu H (2019) Improving visual question answering using dropout and enhanced question encoder. Pattern Recogn 90:404–414
Article Google Scholar
Fourati M, Jedidi A, Ben Hassin H, Gargouri F (2015) Towards fusion of textual and visual modalities for describing audiovisual documents. Int J Multimed Data Eng Manag 6(2):52–70
Article Google Scholar
Fourati M, Jedidi A, Gargouri F (2015) Topic and Thematic Description for Movies Documents. In: Arik S, Huang T, Lai WK, Liu Q (eds) Neural Information Processing SE - 54, vol. 9492. Springer International Publishing, pp 453–462
Z Gan, C Gan, X He, Y Pu, K Tran, J Gao, L Carin, and L Deng (2017). “Semantic compositional networks for visual captioning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5630–5639
Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video captioning with attention-based LSTM and semantic consistency. IEEE Trans Multimed 19(9):2045–2055
Article Google Scholar
M Gluck (1997). “Making sense of semiotics: privileging respondents in revealing contextual geographic syntactic and semantic codes,” in Proceedings of an international conference on Information seeking in context, pp. 53–66
A Holzinger, G Searle, A Auinger, and M Ziefle (2011). “Informatics as Semiotics Engineering: Lessons Learned from Design, Development and Evaluation of Ambient Assisted Living Applications for Elderly People BT - Universal Access in Human-Computer Interaction. Context Diversity,”, pp. 183–192
NJ Janwe and KK Bhoyar (2013). “Video shot boundary detection based on JND color histogram,” in 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013), pp. 476–480
Jedidi A (2005) Modélisation générique de documents multimédia par des métadonnées: mécanismes d’annotation et d'interrogation. Université Paul Sabatier-Toulouse III
Jiang Y-G, Yang J, Ngo C-W, Hauptmann AG (2009) Representations of keypoint-based semantic concept detection: a comprehensive study. IEEE Trans Multimed 12(1):42–53
Article Google Scholar
S Kim, H Hong, and J Nang (2015). “A Gradual Shot Change Detection using Combination of Luminance and Motion Features for Frame Rate Up Conversion,” in 2015 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), pp. 295–299
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Article Google Scholar
Li L, Tang S, Zhang Y, Deng L, Tian Q (2017) Gla: global–local attention for image description. IEEE Trans Multimed 20(3):726–737
Article Google Scholar
Z Liu (2013). “A semiotic interpretation of sense-making in information seeking,” Libr. Philos. Pract., p. 1
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Article Google Scholar
Lu Z-M, Shi Y (2013) Fast video shot boundary detection based on SVD and pattern matching. IEEE Trans Image Process 22(12):5136–5145
Article MathSciNet Google Scholar
Luo B, Li H, Meng F, Wu Q, Huang C (2017) Video object segmentation via global consistency aware query strategy. IEEE Trans Multimed 19(7):1482–1493
Article Google Scholar
I Mademlis, N Nikolaidis, and I Pitas (2015). “Stereoscopic video description for key-frame extraction in movie summarization,” in 2015 23rd European Signal Processing Conference (EUSIPCO), pp. 819–823
JP Martin (2005). “Description sémiotique de contenus audiovisuels,” Université de Paris-Sud. Faculté des Sciences d’Orsay (Essonne)
P Mickan and E Lopez (2016). Text-based research and teaching: a social semiotic perspective on language in use. Springer
Mingers J, Willcocks L (2017) An integrative semiotic methodology for IS research. Inf Organ 27(1):17–36
Article Google Scholar
Morris RCT (1994) Toward a user-centered information service. J Am Soc Inf Sci 45(1):20–30
Article Google Scholar
Naphade M, Smith JR, Tesic J, Chang S-F, Hsu W, Kennedy L, Hauptmann A, Curtis J (2006) Large-scale concept ontology for multimedia. IEEE Multimed 13(3):86–91
Article Google Scholar
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175
Article MATH Google Scholar
F Orlandi, J Debattista, IA Hassan, C Conran, M Latifi, M Nicholson, FA Salim, D Turner, O Conlan, and D O’sullivan (2018). “Leveraging Knowledge Graphs of Movies and Their Content for Web-Scale Analysis,” in 2018 14th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), pp. 609–616
K Papineni, S Roukos, T Ward, and WJ Zhu (2002). “BLEU: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318
Patel U, Shah P, Panchal P (2013) Shot detection using pixel wise difference with adaptive threshold and color histogram method in compressed and uncompressed video. Int J Comput Appl 64(4):38–44
Google Scholar
Peirce CS (2009) Writings of Charles S. Peirce: A Chronological Edition, Volume 8: 1890–1892, vol. 8. Indiana University Press
Poli J-P (2008) An automatic television stream structuring system for television archives holders. Multimedia Systems 14(5):255–275
Article Google Scholar
S Ren, K He, R Girshick, and J Sun (2015). “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Adv Neural Inf Proces Syst, pp. 91–99
Rinaldi AM (2014) A multimedia ontology model based on linguistic properties and audio-visual features. Inf. Sci. (Ny). 277:234–246
Article Google Scholar
LA Rowe, JS Boreczky, and CA Eads (1994). “Indexes for user access to large video databases,” in IS&T/SPIE 1994 International Symposium on Electronic Imaging: Science and Technology, pp. 150–161
Sánchez-Nielsen E, Chávez-Gutiérrez F, Lorenzo-Navarro J (2019) A semantic parliamentary multimedia approach for retrieval of video clips with content understanding. Multimedia Systems:1–18
Shrivastav S, Kumar S, Kumar K (2017) Towards an ontology based framework for searching multimedia contents on the web. Multimed Tools Appl 76(18):18657–18686
Article Google Scholar
LF Sikos (2017). “The Semantic Gap,” in Description Logics in Multimedia Reasoning, Springer, pp. 51–66
LF Sikos (2018). “Ontology-based structured video annotation for content-based video retrieval via spatiotemporal reasoning,” in Bridging the Semantic Gap in Image and Video Analysis, Springer, pp. 97–122
LF Sikos and DMW Powers (2015). “Knowledge-driven video information retrieval with LOD: from semi-structured to structured video metadata,” in Proceedings of the Eighth Workshop on Exploiting Semantic Annotations in Information Retrieval, pp. 35–37
Smeaton AF, Over P, Doherty AR (2010) Video shot boundary detection: seven years of TRECVid activity. Comput Vis Image Underst 114(4):411–418
Article Google Scholar
Song J, Gao L, Liu L, Zhu X, Sebe N (2018) Quantization-based hashing: a general framework for scalable image and video retrieval. Pattern Recogn 75:175–187
Article Google Scholar
Song J, Gao L, Nie F, Shen HT, Yan Y, Sebe N (2016) Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE Trans Image Process 25(11):4999–5011
Article MathSciNet MATH Google Scholar
J Song, Y Guo, L Gao, X Li, A Hanjalic, and HT Shen (2018). “From deterministic to generative: multimodal stochastic RNNs for video captioning,” IEEE Trans. neural networks Learn. Syst
Song J, Zhang H, Li X, Gao L, Wang M, Hong R (2018) Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans Image Process 27(7):3210–3221
Article MathSciNet MATH Google Scholar
P Stockinger (2003). “Le document audiovisuel,” Hermes, Lavoisier
P Stockinger (2011). Les archives audiovisuelles : description, indexation et publication. Lavoisier
Stockinger P (2013) Audiovisual archives: digital text and discourse analysis. John Wiley & Sons
A Tamrakar, S Ali, Q Yu, J Liu, O Javed, A Divakaran, H Cheng, and H Sawhney (2012). “Evaluation of low-level features and their combinations for complex event detection in open source videos,” in 2012 IEEE Conference on Computer Vision and Pattern Recogn, pp. 3681–3688
Tang P, Wang C, Wang X, Liu W, Zeng W, Wang J (2019) Object detection in videos by high quality object linking. IEEE Trans. Pattern Anal. Mach. Intell
R Vedantam, C Lawrence Zitnick, and D Parikh (2015). “Cider: Consensus-based image description evaluation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575
Wang X, Gao L, Song J, Shen H (2016) Beyond frame-level CNN: saliency-aware 3-D CNN with LSTM for video action recognition. IEEE Signal Process Lett 24(4):510–514
Article Google Scholar
Wang X, Gao L, Wang P, Sun X, Liu X (2017) Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimed 20(3):634–644
Article Google Scholar
W Wang, J Shen, and F Porikli (2015). “Saliency-aware geodesic video object segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3395–3402
Wu Q, Teney D, Wang P, Shen C, Dick A, van den Hengel A (2017) Visual question answering: a survey of methods and datasets. Comput Vis Image Underst 163:21–40
Article Google Scholar
Xu Z, Hu C, Mei L (2016) Video structured description technology based intelligence analysis of surveillance videos for public security applications. Multimed Tools Appl 75(19):12155–12172
Article Google Scholar
Z Xu, F Zhi, C Liang, M Lin, and X Luo (2014). “Semantic annotation of traffic video resources,” in 2014 IEEE 13th International Conference on Cognitive Informatics and Cognitive Computing, pp. 323–328
Yasser CM (2011) An analysis of problems in metadata records. J Libr Metadata 11(2):51–62
Article Google Scholar
G Ye, Y Li, H Xu, D Liu, and SF Chang (2015). “Eventnet: A large scale structured concept library for complex event detection in video,” in Proceedings of the 23rd ACM international conference on Multimedia, pp. 471–480
Q You, H Jin, Z Wang, C Fang, and J Luo (2016). “Image captioning with semantic attention,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4651–4659
W Zhou, H Li, and Q Tian (2017). “Recent advance in content-based image retrieval: A literature survey,” arXiv Prepr. arXiv1706.06064
Zlitni T, Bouaziz B, Mahdi W (2016) Automatic topics segmentation for TV news video using prior knowledge. Multimed Tools Appl 75(10):5645–5672
Article Google Scholar

Download references

Author information

Authors and Affiliations

MIR@CL Laboratory, University of Sfax, Sfax, Tunisia
Manel Fourati, Anis Jedidi & Faiez Gargouri

Authors

Manel Fourati
View author publications
You can also search for this author in PubMed Google Scholar
Anis Jedidi
View author publications
You can also search for this author in PubMed Google Scholar
Faiez Gargouri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manel Fourati.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fourati, M., Jedidi, A. & Gargouri, F. A survey on description and modeling of audiovisual documents. Multimed Tools Appl 79, 33519–33546 (2020). https://doi.org/10.1007/s11042-020-09589-9

Download citation

Received: 26 June 2018
Revised: 15 May 2020
Accepted: 11 August 2020
Published: 15 August 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s11042-020-09589-9

A survey on description and modeling of audiovisual documents

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A deep learning-based classification for topic detection of audiovisual documents

Topic and Thematic Description for Movies Documents

Evaluating unsupervised thesaurus-based labeling of audiovisual content in an archive production environment

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A survey on description and modeling of audiovisual documents

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A deep learning-based classification for topic detection of audiovisual documents

Topic and Thematic Description for Movies Documents

Evaluating unsupervised thesaurus-based labeling of audiovisual content in an archive production environment

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation