1.2K Views
October 12, 23
スライド概要
Invited Lecture Talk at Tours University (IUT Blois).
http://d-kitamura.net/links_en.html
6th Oct. 2023 Talk at Tours University Audio source separation based on spectral and spatial models Daichi Kitamura National Institute of Technology, Kagawa College, Japan
2 Self introduction • Daichi Kitamura, Ph.D. in Informatics • National Institute of Technology (KOSEN), Kagawa College Japan • Research interests – – – – – Audio source separation Array signal processing Machine learning Music signal processing Biosignal processing Tokyo Kagawa
3 Contents • About audio source separation – Applications – Problem – Solution Spectral and spatial models • Approaches based on spectral models – Modeling of time-frequency structure • Nonnegative matrix factorization • Resent trend (of course, deep learning) • Approaches based on spatial models – Using microphone array (multiple microphones) • Beamforming • Hybrid methods • Conclusion
Contents • About audio source separation – Applications – Problem – Solution • Approaches based on spectral models – Modeling of time-frequency structure • Nonnegative matrix factorization • Resent trend (of course, deep learning) • Approaches based on spatial models – Using microphone array (multiple microphones) • Beamforming • Hybrid methods • Conclusion 4
What is audio source separation? • Estimate specific sources from observation Separate Café au lait (observation) Sources (estimation) • for audio signals, Separate Mixture (observation) Sources (estimation) 5
What is audio source separation? • Audio source separation – Speech, vocals, musical instruments, noise, etc. – Simulating cocktail-party effect • Selective listening ability of human • Similar research topics – Noise reduction eliminates background noise • Stable and unstable noise canceller – Speech enhancement extracts only speech signals • Front-end process for almost all speech processing – Source separation estimate sources for various signals • Brain signals, biological signals, radar signals, images, videos, etc. 6
Applications of audio source separation • Noise reduction (speech enhancement) Observed signal Estimated signal – Applications • Hearing-aid systems – to provide clearer speech • Automatic speech recognition – Car navigation systems and smart speakers • Acoustic event detection for robots – UAV with microphones 7
Applications of audio source separation • Separation of speech sources – Automatic transcription of meetings • must recognize speaker-wise signals – On-site and hybrid meeting • On-site speakers must be enhanced and be put on online • Ex.: Voice-Lift systems by SHURE Ceiling microphone array lifts up speaker-wise signals (even moving speakers) and enhance them Network loudspeaker provides clear sound to all attendee https://www.shure.com/ja-JP/conferencing-meetings/applications/voice-lift-and-sound-reinforcement 8
Applications of audio source separation • Separation of music signals – Automatic transcription of music scores Monophonic transcription Merge Separation Mixture Sources Sourcewise scores – Remix of live-performed music • Live performance is once in a lifetime – All sources will instantaneously be mixed – Even the players cannot listen their own solo-played sounds • Music separation enables us to re-editing and remix pre-recorded (already mixed) music – This can be used for, e.g., education of musical instruments 9
10 Applications of audio source separation • Front-end processing for almost all audio processing – All microphones cannot avoid noise – Undesired sounds degrade the quality Observed signal Microphones Front-end process Audio source separation Spot recording Ideally, apply separation at first Downstream applications Various process Output Music recording Ultrasonography Speech recognition Non-destructive acoustic inspection Radio communications
11 Problem and difficulty in solution • Audio source separation – separates café au lait into coffee and milk • But, how? Given Unknow (want to estimate) • Well-posed and ill-posed problems – Well-posed (forward) problem • Make café au lait from coffee and milk • (1) It has a solution. • (2) The solution is unique. • (3) The solution depends continuously on data and parameters. Given We can know – Ill-posed (inverse) problem • Many “coffee-and-milk combinations” exist, which provide “the same café au lait” • Audio source separation – is a typical ill-posed problem Both equations hold (multiple solutions)
Problem and difficulty in solution 12 • In audio source separation problem, – many solutions that explain the observed signal exist ? ? = + + ? ・・・ + Spectral and spatial models • To solve this ill-posed problem, – we need some clues (models) to limit the solution space • Inappropriate clues (models) cause poor performance
13 What clues can we use? • Spectral models – Prior knowledge or training data of sources – Type of sources: speech, guitar, drums, noise, ... – Statistics of sources: stable/unstable, training data, ... – Time-frequency structures of sources: harmonicity, sparseness, low-rankness, ... Drums • Spatial models – Observed by a microphone array – Number of microphones: monaural, stereo, more channels – Microphone array geometry: linear, circular, spherical mic arrays – Source locations: distance and direction from mic array Source 1 Source 2 # of mics Direction Geometry
14 Contents • About audio source separation – Applications – Problem – Solution Spectral and spatial models • Approaches based on spectral models – Modeling of time-frequency structure • Nonnegative matrix factorization • Resent trend (of course, deep learning) • Approaches based on spatial models – Using microphone array (multiple microphones) • Beamforming • Hybrid methods • Conclusion
15 Time-frequency representation of audio signals • Audio waveform in time domain (speech)
16 Time-frequency representation of audio signals • Time-varying frequency structure – Short-time Fourier transform (STFT) Waveform … Discrete Fourier transform Window function Shift length DFT length Discrete Fourier transform Discrete Fourier transform Time-frequency domain Frequency Time domain … Time Spectrogram Complex-valued matrix Entry-wise absolute and power Power spectrogram Nonnegative real-valued matrix
Power spectrogram of speech 17
Power spectrogram of music 18
What can we say from the spectrogram? Drums Guitar Vocals Speech 19
20 What can we say from the spectrogram? • Time-frequency (TF) structure depends on the source Conversation speech Co-occurrence of all frequency Music spectrograms Guitar Horizontal lines Harmonicity Drums • Spectral model (clue) – Assumption of TF structure for each of specific sources Limit the solution space of the ill-posed problem Sparsity Group sparsity Vertical lines Lowrankness ... L1 norm regularizer Mixed norm Nuclear norm regularizer regularizer
Low-rank property of TF matrix • Low-rankness of power spectrogram – corresponds to a simplicity measure of TF structure – can be measured by a cumulative singular value (CSV) 95% line 7 29 Around 90 Number of bases when CSV reaches 95% (Spectrogram size is 1025x1883) – Music spectrograms can be well modeled by few patterns • because they include many repetitions along time 21
Modeling technique of low-rank TF structures • Nonnegative matrix factorization (NMF) [Lee, 1999] – is a low-rank approximation of a nonnegative matrix • Basis vectors and their coefficients must be nonnegative – is often utilized to model audio power spectrograms • Spectral patterns (typical timbres) and their time-varying gains Amplitude Basis matrix Activation matrix (spectral patterns) (time-varying gains) Frequency Frequency Nonnegative matrix (power spectrogram) Time Activation Time Amplitude Basis : # of frequency bins : # of time frames : # of bases 22
Modeling technique of low-rank TF structures • Optimization in NMF – Minimize “similarity measure” between and – Arbitrarily measure for similarity can be used • Squared Euclidian distance , etc. – There is no closed form solution for and – Iterative calculation was developed to minimize • Multiplicative update rules [Lee, 2000] (for the case of squared Euclidian distance) 23
Modeling technique of low-rank TF structures 24 • Example Pf. and Cl. Superposition of two rank-1 spectrograms
Modeling technique of low-rank TF structures • Example Pf. and Cl. Pf. Cl. – Pf. and Cl. are separated! – Audio source separation based on NMF • is a clustering problem of the extracted spectral bases in – But how? 25
Supervised audio source separation with NMF 26 • If the sourcewise (few-shot) training data is available, • Supervised NMF [Smaragdis, 2007], [Kitamura1, 2014] Training stage Spectral dictionary of Pf. Other bases Separation stage Given Only , , and are optimized
Supervised audio source separation with NMF • Demonstration – Stereo music separation with supervised NMF [Kitamura, 2015] Training sound of Pf. Original song Separated sound (Pf.) Training sound of Ba. Separated sound (Ba.) 27
28 Resent supervised approaches • Data-driven spectral models trained by DNN Hand-crafted spectral model Sparsity Group sparsity DNN-based spectral model Lowrankness Guitar DNN model Drums DNN model ... L1 norm Mixed norm Nuclear norm regularizer regularizer regularizer and NMF ... Trained using solo-guitar database Trained using solo-drums database – Ex.: DNN that always enhances only guitar components in the mixture (trained by using huge dataset) • Main focus to achieve “state-of-the-art” performance – Input-output design: spectrogram vs waveform – Loss function: mean squared error, sound distortion, etc. – Network architecture: CNN, RNN, U-Net, transformer, etc.
29 Contents • About audio source separation – Applications – Problem – Solution Spectral and spatial models • Approaches based on spectral models – Modeling of time-frequency structure • Nonnegative matrix factorization • Resent trend (of course, deep learning) • Approaches based on spatial models – Using microphone array (multiple microphones) • Beamforming • Hybrid methods • Conclusion
What can we know by multiple microphones? 30 • Microphone array – enables us to capture sounds at multiple locations • Array geometry: linear, circular, spherical, etc. • Number of microphones (channels): 2, 3, 4, 8, 16, 32, ... We have it! – Almost laptop PC has a two-channel microphone array Mic. Mic. Mic. https://www.rtri.or.jp/rd/maibara-wt/open04.html https://www.sifi.co.jp/product/micro phone-array/ https://www.imperial.ac.uk/speechaudio-processing/projects/sphericalmicrophone-arrays/ – All the microphones are connected to the same A-D conv. • “Synchronized” audio recording • No mismatches of sampling frequency and recording time b/w mics.
What can we know by multiple microphones? 31 • Spatial model (clue) – Spatial information can also be observed by mic. array Recording with one mic. Observed signal Recording with two mics. Observed signals Monaural signal has poor spatial clues Spatial clues can be obtained as a (Reverberation may contain the room size and relative features among microphones; distance between source and mic.) 1. Volume difference 2. Time Difference of arrival Single-channel audio source separation relies on only the spectral clues Both spectral and spatial models can be employed for multichannel separation
Spatial-model-based source separation 32 • Beamforming technique (a.k.a. beamformer) Beamformer is a spatial band-pass filter Filter for the left source Filter for the right source not for the frequency but for the space (direction) – Fixed beamformer • calculates “spatial band-pass filter” based on physical spatial model • Ex.: delay-and-sum beamformer, null beamformer – Adaptive beamformer • estimates optimal “spatial band-pass filter” in a specific sense • Ex.: Minimum-variance and distortionless response (MVDR) BF, frequency-domain independent component analysis (ICA)
Fixed beamformer with physical spatial model • Time difference of arrival (TDOA) – Physical model: plane wave with constant sound speed Plane wave of sound TDOA 0° Distance Distance – Observed signal Time domain Freq. domain Dirac’s delta function 33
Fixed beamformer with physical spatial model • Delay-and-sum beamformer Delay filter Delay filter Delay filter Phase alignment using delays – Waves coming from the direction – Waves coming from the other direction – Filter design TDOA in the observed signal Phase vector a.k.a. steering vector Filter to modify TDOA Enhanced Reduced 34
Directivity of delay-and-sum beamformer 35 Gain [dB] Gain [dB] Red line: 0.5 kHz Blue line: 1 kHz Green line: 2 kHz # of mics.: 11, array size: 1m, mic spacing: 10cm # of mics.: 21, array size: 2m, mic spacing: 10cm Gain [dB] Direction [deg.] Gain [dB] Direction [deg.] Direction [deg.] # of mics.: 5, array size: 1m, mic spacing: 25cm Direction [deg.] 37.5cm 12.5cm # of mics.: 5, array size: 1m, octave geometry
36 Adaptive beamformer • MVDR beamformer minimum variance and distortionless response: MVDR – Restrict filter gain for the target direction to unity (distortionless) – Minimize signal power (variance) for the other direction • Minimize volume of beamformer output while keeping the unity gain for the target direction Restrict filter to be unity gain Directivity of VDR beamformer Minimize output pow. Automatically set null direction to the other source Null direction (-∞ dB)
Hybrid methods of spectral and spatial models 37 • Independent low-rank matrix analysis (ILRMA) [Kitamura+, – assumes each source has a low-rank TF structure Not low rank Low rank 2016] Low rank – is a unification of • adaptive beamformer based on independent component analysis • low-rank TF modeling of each source by NMF Estimated signal Adaptive beamformer STFT Update beamformer so that estimated signals are 1. mutually independent (ICA) 2. have low-rank TF structures (NMF) Low-rank approximation by NMF Frequency Frequency Observed signal Time Time
Hybrid methods of spectral and spatial models • Adaptive beamformer with DNN-based spectral models [Mogami&Kitamura, 2018], [Makishima&Kitamura, 2021] Adaptive beamformer STFT Update beamformer so that estimated signals are 1. mutually independent (ICA) 2. have pretrained TF structures (DNN) Specific source enhancement Time Time Frequency Frequency Estimated signal Frequency Frequency Observed signal Time Time • Adaptive beamformer with various TF structures in plug-and-play manner [Yatabe&Kitamura, 2018], [Yatabe&Kitamura, 2021] Low-rank Sparse Group-sparse Adaptive beamformer Plug and play 38
Contents • About audio source separation – Applications – Problem – Solution • Approaches based on spectral models – Modeling of time-frequency structure • Nonnegative matrix factorization • Resent trend (of course, deep learning) • Approaches based on spatial models – Using microphone array (multiple microphones) • Beamforming • Hybrid methods • Conclusion 39
Conclusion • Audio source separation – is a technique to estimate specific sources – can be used as front-end processing for all audio apps. – is an ill-posed problem that requires some clues • Spectral models – Modeling of time-frequency structures of each source • Nonnegative matrix factorization (NMF) • Deep learning • Spatial models – Using microphone array to observe spatial information • Beamforming • Hybrid methods with both spectral and spatial models Thank you for your attention! 40