DDP error when multi-gpu finetuning with speech encoder parameters unfrozen

#63
by rumourscape - opened

I receive the following error when finetuning with multiple GPUs with DDP. I have also set ddp_find_unused_parameters=True

[rank0]: RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
[rank0]: Parameter at index 875 with name model.embed_tokens_extend.audio_embed.encoder.encoders.23._checkpoint_wrapped_module.layer_norm.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.

Experienced similar error, this is my setup

def main():
    os.environ["TOKENIZERS_PARALLELISM"] = "false"
    args = parse_args()
    accelerator = Accelerator()
    logger.info("Loading datasets")
    dataset = load_data(datasets_paths=args.datasets_paths, sample_count=None)
    with accelerator.local_main_process_first():
        logger.info("Loading model and processor")
        model, processor = load_model_processor(args.model)
        model = unfreeze_speech_components(model)
        # Verify unfrozen parameters
        trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
        logger.info(f"Trainable parameters: {trainable_params:,}")
        logger.info("Unfrozen components:")


        # After unfreezing
        encoder_params = list(model.model.embed_tokens_extend.audio_embed.encoder.parameters())
        proj_params = list(model.model.embed_tokens_extend.audio_embed.audio_projection.parameters())

        assert any(p.requires_grad for p in encoder_params), "Encoder params frozen!"
        assert any(p.requires_grad for p in proj_params), "Projection params frozen!"
        logger.info("Components properly unfrozen")

    logger.info("Processing dataset")
    train_processed_dataset = DatasetProcessor(
        split="train",
        dataset=dataset["train"],
        processor=processor,
    )

    validation_processed_dataset = DatasetProcessor(
        split="validation",
        dataset=dataset["validation"],
        processor=processor,
    )
    
    num_cpus = mp.cpu_count()

    try:
        training_args = TrainingArguments(
            ddp_find_unused_parameters=True,
            num_train_epochs=args.epochs,
            per_device_train_batch_size=args.train_batch_size,
            per_device_eval_batch_size=args.eval_batch_size,
            gradient_checkpointing=True,
            gradient_checkpointing_kwargs={'use_reentrant': False},
            gradient_accumulation_steps=args.gradient_accumulation_steps,
            optim='adamw_torch',
            adam_beta1=0.9,
            adam_beta2=0.95,
            adam_epsilon=1e-7,
            learning_rate=4.0e-5,
            weight_decay=args.weight_decay,
            max_grad_norm=1.0,
            lr_scheduler_type='linear',
            warmup_steps=args.num_warmup_steps,
            logging_steps=50,
            output_dir=os.path.join(args.output_dir, 'checkpoints'),
            save_total_limit=10,
            save_only_model=True,
            remove_unused_columns=False,
            report_to='none',
            deepspeed=None,
            dataloader_num_workers=num_cpus-4,
            save_strategy="epoch",
        )

        trainer = Trainer(
            model=model,
            args=training_args,
            data_collator=collate_fn,
            train_dataset=train_processed_dataset
        )

        save_processor_callback = SaveProcessorCallback(processor, accelerator, trainer)
        logging_callback = LoggingCallback()
        evaluation_callback = EvaluationCallback(model, processor, validation_processed_dataset, training_args)

        trainer.add_callback(save_processor_callback)
        trainer.add_callback(logging_callback)
        trainer.add_callback(evaluation_callback)
        logger.info("Starting training...")
        trainer.train()
        logger.info("Training completed successfully")

    except Exception as e:
        logger.error(f"Training failed: {e}")
        raise e

I had the same problem. Replacing docker training at https://github.com/anastasiosyal/phi4-multimodal-instruct-server/blob/main/dockerfile helped

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment