Variance Versus Invariance
We contribute an unsupervised method V3 (variance-versus-invariance) that learns disentangled content and style representations from sequences of observations. Unlike most methods that rely on domain-specific labels or knowledge, our method is based on the domain-general statistical differences between content and style --- content varies more among different fragments within a sample but maintains an invariant vocabulary across data samples, whereas style remains relatively invariant within a sample but exhibits more significant variation across different samples. V3 outperforms existing unsupervised methods in disentanglement and surpasses supervised models in out-of-distribution generalization under few-shot adaptation. Also, the learned content codebook exhibits symbolic-level interpretability, aligning machine representations closely with human knowledge.
Motif-Centric Music Representation Learning
The formation of music structure heavily relies on repetitions and variations of music motifs. Understanding the manifestations and behaviors of these motifs is crucial for effective music structure analysis and high-quality automatic music composition. However, capturing music motifs' implicit nature is often challenging. In this study, we employ deep learning techniques to explore an efficacious method for learning robust representations of music motifs.
SingStyle111: A Multilingual Singing Dataset with Style Transfer
Singing voice research has long lacked publicly accessible data, especially in language and style diversity. We introduce SingStyle111, a studio-quality singing dataset featuring 111 songs by eight professional singers across English, Chinese, and Italian, spanning 12.8 hours. It includes bel canto opera, Chinese folk, pop, jazz, and children's singing, with 80 songs performed in multiple styles by the same singer. All recordings are clean, dry mono tracks (44.1 kHz) from professional studios, segmented into phrases with lyrics, MIDI, scores, and phoneme alignment. Acoustic features such as Mel-Spectrogram, F0 contour, and loudness curves are also provided. SingStyle111 supports various MIR tasks, including Singing Voice Synthesis, Singing Transcription, Score Following, and Singing Style Transfer.
Timbre Transfer with Flexible Timbre Control
Timbre style transfer has been an intriguing but mysterious sub-topic in music style transfer. We use a concise autoencoder model with one-hot representations of instruments as the condition, and a Diffwave model trained especially for music synthesis. The results proved that our method is able to provide one-to-one style transfer outputs comparable with the existing GAN-based method, and can transfer among multiple timbres with only one single model.
A to I
Nowadays, AI models excel with impressive performance on various tasks that were once only considered humans-exclusive. As AI models grow more powerful, their role in co-creation expands. However, ethical concerns arise with the rise of AI: will AI models replace or be harmful to humans? In this song, we try to answer those open questions by exploring the possible roles AI models could play in the co-creation process and try to resolve the ethical concern from the perspective of AI themselves. The AI models in this song act not only as tools but also as collaborators, song and lyrics writers, performers, storytellers, and even the mentor and first-person narrator.
Speech Anonymization with Pseudo Voice Conversion
The widespread adoption of speech-based online services raises security and privacy concerns regarding the data that they use and share. If the data were compromised, attackers could exploit user speech to bypass speaker verification systems or even impersonate users. To mitigate this, we propose DeID-VC, a speaker de-identification system that converts a real speaker to pseudo speakers, thus removing or obfuscating the speaker-dependent attributes from a spoken voice.
Project Ming
Project Ming is a Cycling 74 Max/MSP program that can simulate the sound ambience in ancient Chinese cities. By moving your mouse on the map and pressing different keys on the keyboard, you can experience through the colorful and realistic soundscape in ancient China. It was completed during my time at Berkeley summer school in 2019, and a variety of synthesizing techniques were adopted.