140 Views
September 11, 18
スライド概要
CollabTech2018で発表した際のプレゼンスライドになります。
明治大学 総合数理学部 先端メディアサイエンス学科 中村聡史研究室
Can Social Comments Contribute to Estimate Impression of Music Video Clips? Shunki Tsuchiya (Meiji University) Naoki Ono, Satoshi Nakamura, and Takehiro Yamamoto
Outline of our study • We want to be able to search music video clips (MV) by impressions. Exciting Music video clips • We must estimate the impression of MV using social comments. Cute? Exciting? Happy? Cheerful? Cool?
Contributions 1. We generated the impression dataset of chorus part of 500 MV in three media types. • music only • video picture only • combined 2. It is better to use proper parts-of-speech in social comments depending on each media/impression type.
Background
Background 1 The number of MV on the Web has dramatically increased. • Anyone can publish their own MV by the spread of Consumer-generated media websites and DTM software.
Background 2 A standard method of searching for MV is to input keywords. Beatles Let It Be If a user doesn’t know the information, it is not easy to find the target MV. ????
Background 2 A standard method of searching for MV is to input keywords. Beatles Let It Be If a user doesn’t know the information, it is not easy to find the target MV. In this search method, a user need to ???? know the information in advance.
When you listen to music… How do you search for that BGM? The name of song The name of artist
When you listen to music… You will receive some impression such as “fierce” or “cheerful”.
The impression-based search We research ambiguous searches based I have happy things on the user’s subjective impressions. ex) exciting, cheerful, cute, sorrowful I’m feel like crying, so I want to watch and listen to sad MV. I feel happy, so I want to watch and listen to cheerful MV.
Impression information It is necessary that impression information is given to the MV. There are few impression tags attached. Percentage of impression tag 5%[Yamamoto2009] 14%[Hu2007] It is too difficult for people to evaluate the impressions of all MV.
Impression estimation We have to evaluate and provide subjective impressions on individual MV in advance. Exciting? Happy? Cute? Cheerful? Cool? If it becomes possible to estimate the impression and available by API, it can be used for creating new service using MV
Impression estimation We have to evaluate and provide subjective impressions on individual MV in advance. Exciting? Happy? Cute? Cheerful? Cool? If it becomes possible to estimate the impression and available by API, it can be used for creating new service using MV We expect that impressions are used not only for searching but also various things.
Related work 1 Studies on estimates of impressions of MV • Understanding Affective Content of music video through learned representations. [Acar 2014] • Multimedia content analysis for emotional characterization of music video clips. [Ashkan 2013] Music video clips Music Movies These studies estimated impressions by using these features.
Related work 1 Studies on estimates of impressions of MV • Understanding Affective Content of music video through learned representations. [Acar 2014] • Multimedia content analysis for emotional characterization of music video clips. [Ashkan 2013] Music video clips Music Movies Features of music and movies are mechanical, These studies estimated impressions no human emotions are reflected bywhere using these features.
Related work 2 Studies on estimates of impressions of… • music using lyrics • Lyric text mining in music mood classification. [Hu 2009] • movies using viewer’s expressions • Video indexing and recommendation based on affective analysis of viewers. [Sicheng 2009] The estimation accuracy is improved by using subjective features. We also use subjective features.
Social comments 1 Nico Nico Douga and BiliBili has a function to provide comments in real time to the video. 35 million users 72 million users http://www.nicovideo.jp/watch/sm13252011
Social comments 2 • Users can use this function to communicate with others. • We consider that these comments express impressions that users directly in real time. We estimate impressions of MV from the comments used for this communication
Study of Yamamoto [2013] • Yamamoto’s study used only adjectives in comments to estimate impression. → We also target other parts-of-speech. • Yamamoto’s study used the whole of a MV for estimation. Some MV Start Verse 1st Chorus Bridge 2nd Chorus Exciting Happy Painful Happy →We estimate the impressions of only the chorus part. End
Media types We considered media types that are music only, video picture only, and combined. The impression received from music only, video picture only and combined was different.
Purpose We examine the degree of impression estimation accuracy of MV using social comments. • Estimate impressions by using the part-of-speech in comments • Estimate impressions of the chorus part • Consider differences by media type
Impression dataset
Generating the dataset • Target 500 MV ReflaiD [Goto2006] “VOCALOID” 30s of the chorus part • Divides a MV into 3 media types ( ) ( (
Impression types Impression names Adjectives representing impressions C1 (exciting) Exciting, bustling, proudly, & dignified C2 (cheerful) Cheerful, happy, hilarious, & comfortable C3 (painful) Painful, gloomy, bittersweet, & sorrowful C4 (fierce) Fierce, aggressive, emotional, & active C5 (humorous) Humorous, funny, strange, & capricious C6 (cute) Cute, lovely, awesome, &tiny Valence Bright feelings, fun & dark feelings, sad Arousal Aggressive, bullish & passive, bearish
Impression types Impression names Adjectives representing impressions C1 (exciting) Exciting, bustling, proudly, & dignified C2 (cheerful) Cheerful, happy, hilarious, & comfortable C3 (painful) Painful, gloomy, bittersweet, & sorrowful C4 (fierce) Fierce, aggressive, emotional, & active C5 (humorous) Humorous, funny, strange, & capricious C6 (cute) Cute, lovely, awesome, &tiny Bright feelings, fun & [Hu dark 2008] feelings, sad Valence Five impressions used in MIREX Arousal Aggressive, bullish & passive, bearish
Impression types Impression names Adjectives representing impressions C1 (exciting) Exciting, bustling, proudly, & dignified C2 (cheerful) Cheerful, happy, hilarious, & comfortable Painful, gloomy, bittersweet, & sorrowful C3 (painful) Many tags that are “cute” are attached on Fierce, [Yamamoto aggressive, emotional, 2013]. & active C4 (fierce)Nico Nico Douga C5 (humorous) Humorous, funny, strange, & capricious C6 (cute) Cute, lovely, awesome, &tiny Valence Bright feelings, fun & dark feelings, sad Arousal Aggressive, bullish & passive, bearish
Impression types Impression names Adjectives representing impressions C1 (exciting) Exciting, bustling, proudly, & dignified C2 (cheerful) Cheerful, happy, hilarious, & comfortable C3 (painful) Painful, gloomy, bittersweet, & sorrowful Fierce, aggressive, emotional, & active C4 (fierce) Two impressions called valence-arousal Humorous, funny, strange,1980]. & capricious C5 (humorous) space proposed by Russell [Russell C6 (cute) Cute, lovely, awesome, &tiny Valence Bright feelings, fun & dark feelings, sad Arousal Aggressive, bullish & passive, bearish
Method of impression evaluation • One of the media for 30s • Random regardless of media type • Answered each impression with a 5 rank Likert scale (-2, -1, 0, +1, +2) 3 subjects evaluated one of the media.
Impression dataset We used the average value of 3 subjects as the impression evaluation value for each media/impression type. Some MV C1 C2 C3 C4 Combined -1.3 -2.0 -0.3 0.0 Music only -1.7 -2.0 2.0 Video picture only 0.3 1.3 -0.3 C5 C6 V A 1.7 -2.0 -0.7 -0.7 0.0 -1.7 -2.0 0.3 -1.7 -0.7 -0.7 1.7 -0.3 1.7
Evaluation experiment
Outline of evaluation experiment We tested and verified by using SVMs whether impression having an evaluation of more than a certain value could be mechanically estimated. Details 1. 2. 3. 4. 5. Method of impression estimation Number of MV in each evaluation group Collecting and extracting social comments Generation of bag-of-words for MV Method of bag-of-words generation
Method of impression estimation 3 media 8 impressions = 24 pattern Low evaluation group High evaluation group Training data Test data We verified high evaluation group could be mechanically estimated by using SVMs.
Number of MV in each group High group C1 C2 C3 C4 C5 C6 V A Combined 76 105 87 54 83 104 101 150 Music only 133 127 46 69 49 73 124 178 Video picture only 21 50 142 49 81 78 57 111 Low group C1 C2 C3 C4 C5 C6 V A Combined 105 169 191 209 178 215 62 94 Music only 65 92 232 195 180 209 61 43 Video picture only 252 272 165 247 207 234 96 155
Number of MV in each group High group C1 C2 C3 C4 C5 C6 V A Combined 76 105 87 54 83 104 62 94 Music only 65 92 46 69 49 73 61 43 Video picture only 21 50 142 49 81 78 57 111 Low group C1 C2 C3 C4 C5 C6 V A Combined 76 105 87 54 83 104 62 94 Music only 65 92 46 69 49 73 61 43 Video picture only 21 50 142 49 81 78 57 111
Collecting and extracting social comments • We collected all comments (860,455) for 500 MV using Nico Nico Douga API. • We extracted comments (132,036) posted in the chorus part. Some MV Start Verse 1st Chorus Bridge End 2nd Chorus
Generation of bag-of-words for MV Comments of a MV / / / == “Miku melody is cool.” “Miku “Miku /isgood” good” “Mikuis/ cute.” “The “Melody / cool.” All parts-of-speech Miku Cute Melody Cool Good 2 1 1 1 1 Nouns Adjectives Miku Melody Cute Cool Good 2 1 1 1 1
Methods of bag-of-words generation Method names Parts-ofspeech used Method names Parts-ofspeech used All method All parts-ofspeech Noun-Verb method Nouns, Verbs All2 method Nouns, Verbs, Adjectives, Adverbs Noun-Adj method Nouns, Adjectives Nouns Noun-Adv method Nouns, Adverbs Verb method Verbs Verb-Adj method Verbs, Adjectives Adj method Adjectives Verb-Adv method Verbs, Adverbs Adv method Adverbs Adj-Adv method Adjectives, Adverbs Noun method
Results
Noun vs Verb vs Adj vs Adv Noun C1 C2 C3 C4 C5 C6 V A Comb 0.575 0.720 0.644 0.653 0.704 0.680 0.646 0.652 Music 0.698 0.606 0.528 0.621 0.721 0.661 0.708 0.650 Video 0.700 0.640 0.608 0.600 0.620 0.688 0.552 0.641 C1 C2 C3 C4 C5 C6 V A Comb 0.667 0.627 0.440 0.544 0.642 0.714 0.575 0.574 Music 0.615 0.622 0.133 0.658 0.587 0.500 0.600 0.551 Video 0.588 0.549 0.606 0.517 0.584 0.573 0.508 0.654 Verb
Nous vs Verb vs Adj vs Adv Adj C1 C2 C3 C4 C5 C6 V A Comb 0.733 0.869 0.710 0.750 0.667 0.838 0.650 0.842 Music 0.667 0.635 0.595 0.667 0.581 0.775 0.706 0.733 Video 0.714 0.736 0.733 0.759 0.536 0.829 0.603 0.850 C1 C2 C3 C4 C5 C6 V A Comb 0.618 0.586 0.522 0.576 0.520 0.481 0.556 0.603 Music 0.679 0.600 0.580 0.537 0.545 0.481 0.642 0.538 Video 0.879 0.759 0.211 0.632 0.519 0.451 0.777 0.805 Adv
Nous vs Verb vs Adj vs Adv Adjective > Noun, Verb, Adverb • Users most often expressed impressions using adjective. • We consider that the adjective words have features for impression, and noun, verb, and adverb didn’t have.
Method without including Adj Noun-Verb N-V C1 C2 C3 C4 C5 C6 V Comb 0.687 0.699 0.648 0.620 0.681 0.714 0.661 0.636 Music 0.683 0.580 0.489 0.642 0.689 0.672 0.729 0.658 Video 0.881 0.760 0.308 0.614 0.595 0.639 0.805 0.859 Noun-Adv N-Av C1 C2 C3 C4 A C5 Verb-Adv C6 V A V-Av C1 C2 C3 C4 C5 C6 V A Comb 0.592 0.714 0.644 0.654 0.722 0.673 0.656 0.649 Comb 0.667 0.568 0.535 0.531 0.657 0.630 0.600 0.660 Music 0.672 0.589 0.538 0.621 0.711 0.661 0.694 0.632 Music 0.667 0.560 0.458 0.566 0.587 0.513 0.589 0.581 Video 0.879 0.763 0.372 0.636 0.622 0.683 0.805 0.852 Video 0.882 0.729 0.250 0.622 0.488 0.529 0.724 0.814
Method including Adj Noun-Adj N-Aj C1 C2 C3 C4 C5 C6 V Comb 0.662 0.854 0.690 0.780 0.750 0.778 0.694 Music 0.754 0.644 0.612 0.750 0.707 0.772 0.740 0.806 Video 0.888 0.792 0.409 0.706 0.657 0.768 0.821 0.874 Verb-Adj V-Aj C1 C2 C3 C4 A C5 0.80 Adj-Adv C6 V A Aj-Av C1 C2 C3 C4 C5 C6 V A Comb 0.781 0.811 0.711 0.684 0.667 0.856 0.652 0.784 Comb 0.700 0.837 0.679 0.690 0.681 0.848 0.695 0.844 Music 0.692 0.627 0.520 0.714 0.682 0.740 0.673 0.707 Music 0.733 0.646 0.581 0.634 0.683 0.743 0.667 0.718 Video 0.921 0.734 0.400 0.734 0.511 0.764 0.779 0.871 Video 0.911 0.765 0.477 0.653 0.622 0.757 0.840 0.884
Method of combining Method of Method using combining only one > part-of-speech part-of-speech The combination of parts-of-speech improved the accuracy of estimation. Noun Noun Noun-Verb N-V C1 C2 C3 C4 C5 C6 V A Comb 0.687 0.699 0.648 0.620 0.681 0.714 0.661 0.636 Music 0.683 0.580 0.489 0.642 0.689 0.672 0.729 0.658 0.881 0.760 0.308 0.614 0.595 0.639 0.805 0.859 C2 C3 C4 C5 C6 V A Comb 0.575 0.720 0.644 0.653 0.704 0.680 0.646 0.652 Music 0.698 0.606 0.528 0.621 0.721 0.661 0.708 0.650 Video 0.700 0.640 0.608 0.600 0.620 0.688 0.552 0.641 Verb Verb Video C1 C1 C2 C3 C4 C5 C6 V A Comb 0.667 0.627 0.440 0.544 0.642 0.714 0.575 0.574 Music 0.615 0.622 0.133 0.658 0.587 0.500 0.600 0.551 Video 0.588 0.549 0.606 0.517 0.584 0.573 0.508 0.654
Method of combining Method of Method using combining only one > part-of-speech part-of-speech The combination of parts-of-speech improved the accuracy of estimation. Noun-Adj N-Aj C1 C2 C3 C4 C5 C6 V A Comb 0.662 0.854 0.690 0.780 0.750 0.778 0.694 Music 0.754 0.644 0.612 0.750 0.707 0.772 0.740 0.806 Video 0.888 0.792 0.409 0.706 0.657 0.768 0.821 0.874 Adj-Adv AjAv C1 C2 C3 C4 C5 C6 V Adj 0.80 Adj C1 C2 C3 C4 C5 C6 V A Comb 0.733 0.869 0.710 0.750 0.667 0.838 0.650 0.842 Music 0.667 0.635 0.595 0.667 0.581 0.775 0.706 0.733 Video 0.714 0.736 0.733 0.759 0.536 0.829 0.603 0.850 A Comb 0.700 0.837 0.679 0.690 0.681 0.848 0.695 0.844 Music 0.733 0.646 0.581 0.634 0.683 0.743 0.667 0.718 Video 0.911 0.765 0.477 0.653 0.622 0.757 0.840 0.884
The highest value and its method Comb Music Video Ave C1 C2 C3 V-Adj Adj All C4 C5 C6 N-Adj N-Adj V-Adj V A All Aj-Av 0.781 0.869 0.713 0.780 0.750 0.856 0.783 0.844 N-Adj All N-Adj N-Adj All2 All2 N-Adj N-Adj 0.754 0.671 0.612 0.750 0.725 0.787 0.740 0.806 V-Adj N-Adj All Adj N-Adj Adj Aj-Av Aj-Av 0.921 0.792 0.752 0.759 0.657 0.829 0.840 0.884 Ave 0.797 0.730 0.804 0.819 0.777 0.692 0.763 0.711 0.824 0.788 0.845 0.777
The highest value and its method Comb Music Video Ave C1 C2 C3 V-Adj Adj All C4 C5 C6 N-Adj N-Adj V-Adj V A All Aj-Av 0.781 0.869 0.713 0.780 0.750 0.856 0.783 0.844 N-Adj All N-Adj N-Adj All2 All2 N-Adj N-Adj 0.754 0.671 0.612 0.750 0.725 0.787 0.740 0.806 V-Adj N-Adj All Adj N-Adj Adj Aj-Av Aj-Av 0.921 0.792 0.752 0.759 0.657 0.829 0.840 0.884 Ave 0.797 0.730 0.804 0.819 0.777 0.692 0.763 0.711 0.824 0.788 0.845 0.777
The highest value and its method Comb Music Video Ave C1 C2 C3 V-Adj Adj All C4 C5 C6 N-Adj N-Adj V-Adj V A All Aj-Av 0.781 0.869 0.713 0.780 0.750 0.856 0.783 0.844 N-Adj All N-Adj N-Adj All2 All2 N-Adj N-Adj 0.754 0.671 0.612 0.750 0.725 0.787 0.740 0.806 N-Adj All Adj proper N-Adj Adjparts-ofAj-Av Aj-Av ItV-Adj is better to use 0.921 0.792depending 0.752 0.759 0.657 0.829 0.840 0.884 speech on each types. Ave 0.797 0.730 0.804 0.819 0.777 0.692 0.763 0.711 0.824 0.788 0.845 0.777
The highest value and its method Comb Music Video Ave C1 C2 C3 V-Adj Adj All C4 C5 C6 N-Adj N-Adj V-Adj V A All Aj-Av 0.781 0.869 0.713 0.780 0.750 0.856 0.783 0.844 N-Adj All N-Adj N-Adj All2 All2 N-Adj N-Adj 0.754 0.671 0.612 0.750 0.725 0.787 0.740 0.806 V-Adj N-Adj All Adj N-Adj Adj Aj-Av Aj-Av 0.921 0.792 0.752 0.759 0.657 0.829 0.840 0.884 Ave 0.797 0.730 0.804 0.819 0.777 0.692 0.763 0.711 0.824 0.788 0.845 0.777
The highest value and its method Comb Music Video Ave C1 C2 C3 V-Adj Adj All C4 C5 C6 N-Adj N-Adj V-Adj V A All Aj-Av 0.781 0.869 0.713 0.780 0.750 0.856 0.783 0.844 N-Adj All N-Adj N-Adj All2 All2 N-Adj N-Adj 0.754 0.671 0.612 0.750 0.725 0.787 0.740 0.806 V-Adj N-Adj All Adj N-Adj Adj Aj-Av Aj-Av 0.921 0.792 0.752 0.759 0.657 0.829 0.840 0.884 Ave 0.797 0.730 0.804 0.819 0.777 0.692 0.763 0.711 0.824 0.788 0.845 0.777
The highest value and its method Comb Music C1 C2 C3 V-Adj Adj All C4 C5 C6 N-Adj N-Adj V-Adj V A All Aj-Av 0.781 0.869 0.713 0.780 0.750 0.856 0.783 0.844 N-Adj All N-Adj N-Adj All2 All2 N-Adj N-Adj 0.754 0.671 0.612 0.750 0.725 0.787 0.740 0.806 V-Adj N-Adj from All Adj N-Adj comments Adj Aj-Av Aj-Av Estimation social is Video 0.921 0.792 0.752 0.759C6 0.657 0.829 0.840 0.884 effective for C1, and Arousal. Ave Ave 0.797 0.730 0.804 0.819 0.777 0.692 0.763 0.711 0.824 0.788 0.845 0.777
The highest value and its method Comb Music Video Ave C1 C2 C3 V-Adj Adj All C4 C5 C6 N-Adj N-Adj V-Adj V A All Aj-Av 0.781 0.869 0.713 0.780 0.750 0.856 0.783 0.844 N-Adj All N-Adj N-Adj All2 All2 N-Adj N-Adj 0.754 0.671 0.612 0.750 0.725 0.787 0.740 0.806 V-Adj N-Adj All Adj N-Adj Adj Aj-Av Aj-Av 0.921 0.792 0.752 0.759 0.657 0.829 0.840 0.884 Ave 0.797 0.730 0.804 0.819 0.777 0.692 0.763 0.711 0.824 0.788 0.845 0.777
Example of C6 (cute) MV C6 MV impression values = 2 It was able to learn well in SVM because the express impression words are similar.
The highest value and its method Comb Music Video Ave C1 C2 C3 V-Adj Adj All C4 C5 C6 N-Adj N-Adj V-Adj V A All Aj-Av 0.781 0.869 0.713 0.780 0.750 0.856 0.783 0.844 N-Adj All N-Adj N-Adj All2 All2 N-Adj N-Adj 0.754 0.671 0.612 0.750 0.725 0.787 0.740 0.806 V-Adj N-Adj All Adj N-Adj Adj Aj-Av Aj-Av 0.921 0.792 0.752 0.759 0.657 0.829 0.840 0.884 Ave 0.797 0.730 0.804 0.819 0.777 0.692 0.763 0.711 0.824 0.788 0.845 0.777
The highest value and its method Comb Music Video Ave C1 C2 C3 V-Adj Adj All C4 C5 C6 N-Adj N-Adj V-Adj V A All Aj-Av 0.781 0.869 0.713 0.780 0.750 0.856 0.783 0.844 N-Adj All N-Adj N-Adj All2 All2 N-Adj N-Adj 0.754 0.671 0.612 0.750 0.725 0.787 0.740 0.806 V-Adj N-Adj All Adj N-Adj Adj Aj-Av Aj-Av C3 and C5 are hard to estimate impressions from social comments. 0.921 0.792 0.752 0.759 0.657 0.829 0.840 0.884 Ave 0.797 0.730 0.804 0.819 0.777 0.692 0.763 0.711 0.824 0.788 0.845 0.777
Example of C3 (painful) MV C3 MV impression value = 1 C3 Comb Music Video Ave All 0.713 N-Adj 0.612 All 0.752 0.692 There is a possibility that a part-of-speech other than 4 have features.
Example of C5 (humorous) MV C5 MV impression value = 2 impression value = 1 I couldn’t successfully learn because there are a wide range of ways to receive from humorous.
The highest value and its method Comb Music Video Ave C1 C2 C3 V-Adj Adj All C4 C5 C6 N-Adj N-Adj V-Adj V A All Aj-Av 0.781 0.869 0.713 0.780 0.750 0.856 0.783 0.844 N-Adj All N-Adj N-Adj All2 All2 N-Adj N-Adj 0.754 0.671 0.612 0.750 0.725 0.787 0.740 0.806 V-Adj N-Adj All Adj N-Adj Adj Aj-Av Aj-Av 0.921 0.792 0.752 0.759 0.657 0.829 0.840 0.884 Ave 0.797 0.730 0.804 0.819 0.777 0.692 0.763 0.711 0.824 0.788 0.845 0.777
The highest value and its method Comb C1 C2 C3 V-Adj Adj All C4 C5 C6 N-Adj N-Adj V-Adj V A All Aj-Av 0.781 0.869 0.713 0.780 0.750 0.856 0.783 0.844 Ave 0.797 Impressions of video picture onlyN-Adj can 0.730 N-Adj All N-Adj N-Adj All2 All2 N-Adj Music 0.671 0.612 0.750 social 0.725 0.787 0.740 0.806 be0.754 estimated from comments. Video Ave V-Adj N-Adj All Adj N-Adj Adj Aj-Av Aj-Av 0.921 0.792 0.752 0.759 0.657 0.829 0.840 0.884 0.804 0.819 0.777 0.692 0.763 0.711 0.824 0.788 0.845 0.777
Summary • We analyzed the impression estimation accuracy of music video clips from social comments. We generated the impression dataset of chorus part. We revealed that it is better to use proper parts-of-speech in social comments depending on each media/impression type. Accuracy of method including adjective is high. C1, C6 and Arousal > C3 and C5 Video picture only > Music only