InstructME: An Instruction Guided Music Edit Framework with Latent Diffusion Models
Abstract
Music editing primarily entails the modification of instrument tracks or remixing in the whole, which offers a novel reinterpretation of the original piece through a series of operations. These music processing methods hold immense potential across various applications but demand substantial expertise. Prior methodologies, although effective for image and audio modifications, falter when directly applied to music. This is attributed to music's distinctive data nature, where such methods can inadvertently compromise the intrinsic harmony and coherence of music. In this paper, we develop InstructME, an Instruction guided Music Editing and remixing framework based on latent diffusion models. Our framework fortifies the U-Net with multi-scale aggregation in order to maintain consistency before and after editing. In addition, we introduce chord progression matrix as condition information and incorporate it in the semantic space to improve melodic harmony while editing. For accommodating extended musical pieces, InstructME employs a chunk transformer, enabling it to discern long-term temporal dependencies within music sequences. We tested InstructME in instrument-editing, remixing, and multi-round editing. Both subjective and objective evaluations indicate that our proposed method significantly surpasses preceding systems in music quality, text relevance and harmony.
Overview of InstructME
Left: Overview of InstructME diffusion process for music editing. Audio signal is processed by VAE (encoder `\mathcal{E}` and decoder `\mathcal{D}` ), meanwhile extractor `\mathcal{C}` extracts the chord matrix of source music and together with text embedding extracted by `\mathcal{T}` as condition information, latent embedding `z_{s}` and `z_{t}` are fused by multi-scale aggregation and converted by chunk transformer to produce the final edited music. Right: Architecture of chunk transformer(C-T) blocks which in various positions of U-net will selectively incorporate chord or text embedding, and `z_{s}` will only input when chunk transformer is in down sampler.
Samples for atomic editing operations
Our InstructME supports atomic editing operations on music, including adding, removing, extracting, and replacing instruments.
Text Prompt: The command used for editing.
Source: The music before editing.
Target: The groundtruth after music editing.
InstructME and AUDIT: The edited music generated by our proposed InstructME and baseline AUDIT respectively.
Atomic operations - Add
Text Prompt |
Source |
InstructME |
AUDIT |
Target |
Atomic operations - Remove
Text Prompt |
Source |
InstructME |
AUDIT |
Target |
Atomic operations - Extract
Atomic operations - Replace
Text Prompt |
Source |
InstructME |
AUDIT |
Target |
Samples for Remix operations
Remixing can be understood as an advanced version of music editing that mixes various atomic operations with style and genre considered.
accompliment is accompany only and original music is accompany with vocal. The harmony of vocals can better evaluate the chord consistency of edited music.
Remix
Text Prompt |
Source-accompliment |
Source-original music |
InstructME-accompliment |
InstructME-original music |
AUDIT-accompliment |
AUDIT-original music |
Target-accompliment |
Target-original music |
Remix with genres
Text Prompt |
Source-accompliment |
Source-original music |
InstructME-accompliment |
InstructME-original music |
AUDIT-accompliment |
AUDIT-original music |
Target-accompliment |
Target-original music |
Remix and Guided to soft mood music
Use guidance to control the mood of edited music. Guidance prompt: "a soft music".
Text Prompt |
Source-accompliment |
Source-original music |
InstructME-accompliment |
InstructME-original music |
AUDIT-accompliment |
AUDIT-original music |
Target-accompliment |
Target-original music |
Remix and Guided to happy mood music
Use guidance to control the mood of edited music. Guidance prompt: "a happy music".
Text Prompt |
Source-accompliment |
Source-original music |
InstructME-accompliment |
InstructME-original music |
AUDIT-accompliment |
AUDIT-original music |
Target-accompliment |
Target-original music |
Diversity and Stability
Different editing operations require different modeling capabilities.
For creativity-oriented tasks including remixing, adding and replacing, our proposed InstructME can generate diverse edited results.
For tasks requiring precision including extracting and removing, our proposed InstructME can consistently generates results congruent with the ground truth.
Diversity in atomic editing tasks
Text Prompt |
Source |
InstructME-1 |
InstructME-2 |
InstructME-3 |
Target |
Diversity in remix tasks
Text Prompt |
Source-accompliment |
Source-original music |
InstructME-accompliment-1 |
InstructME-original music-1 |
InstructME-accompliment-2 |
InstructME-original music-2 |
InstructME-accompliment-3 |
InstructME-original music-3 |
Target-accompliment |
Target-original music |
Stability in atomic editing tasks
Text Prompt |
Source |
InstructME-1 |
InstructME-2 |
InstructME-3 |
Target |
Real Data
We provide some examples with editing real music data.
Real song editing
Text Prompt |
Song Title |
Source |
InstructME |
Real song editing with vocal
Text Prompt |
Song Title |
Source |
InstructME |
Multi-round Editing
Due to the consistency and harmony of our editing model, it also supports multiple rounds of editing.
Method |
Source |
Command 1: Add acoustic guitar |
Command 2: Add drum kit |
Command 3: replace acoustic guitar with piano |
Long Music Editing
InstructME supports long music editing.
Text Prompt |
Source |
Duration |
InstructME |
AUDIT |