Help for deepmd for MD simulation #1684

Luckaswww · 2024-12-05T12:18:08Z

Summary

I'm doing a dynamic simulation with a frozen model. The content of simulation is neb calculation and energy curve drawing. In the process of running the simulation. Let's say my initial state is A, my intermediate state is B, and my final state is C. I performed neb calculations of A-->B, A-->C on the M cluster. I moved the content to cluster N because the cluster permissions expired. In order to test the error, neb calculations of A-->B, A- >C are also performed in N cluster. The results show a slight difference in the energy curve, with an energy difference of 0.2 eV for the transition state. I want to ask is this caused by the cluster? In M cluster, deepmd is v2.2.6-1-g174f204a。In N cluster, deepmd is v2.2.6-1-g174f204a.

In addition, there may be an MPI error when I tried to do a B ->C calculation on the N cluster. I consulted the founder of the cluster and his testing suggestion may be a problem with deepmd itself. I really don't know what's causing it or how to fix it. Therefore, I am sending this issue to ask you for help. The following is an error output, and attached is an example.

DP-GEN Version

dpgen ==0.12.1

Platform, Python Version, etc

python=3.12.4
dpdata== 0.2.19
dpdispatcher==0.6.5
deepmd-kit v=2.2.6

Details

Error(1) ：

ARNING: There was an error initializing an OpenFabrics device.

Local host: cn9
Local device: hfi1_0

MPI_ABORT was invoked on rank 12 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

[cn9:47344] 15 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[cn9:47344] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[cn9:47344] 3 more processes have sent help message help-mpi-api.txt / mpi-abort

Error(2)：
2024-12-05 18:51:06.726189: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2024-12-05 18:51:06.726267: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2024-12-05 18:51:06.725967: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 6
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5

Error (3):

2024-12-05 15:11:25.911162: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2024-12-05 15:11:25.912789: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
Fatal error in PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(421).........................: MPI_Allgather(sbuf=0x7ffd3910cfd0, scount=4, MPI_DOUBLE, rbuf=0x55dc221066c0, rcount=4, datatype=MPI_DOUBLE, comm=comm=0x84000006) failed
MPIR_Allgather_impl(239)....................:
MPIR_Allgather_intra_auto(145)..............: Failure during collective
MPIR_Allgather_intra_auto(141)..............:
MPIR_Allgather_intra_recursive_doubling(216): Failure during collective
Fatal error in PMPI_Allgather: Message truncated, error stack:
PMPI_Allgather(421).........................: MPI_Allgather(sbuf=0x7ffcb0654ed0, scount=1, MPI_DOUBLE, rbuf=0x56528cc45a80, rcount=1, datatype=MPI_DOUBLE, comm=comm=0x84000006) failed
MPIR_Allgather_impl(239)....................:
MPIR_Allgather_intra_auto(145)..............: Failure during collective
MPIR_Allgather_intra_auto(141)..............:
MPIR_Allgather_intra_recursive_doubling(108):
MPIC_Sendrecv(340)..........................:
MPIDI_CH3U_Request_unpack_uebuf(516)........: Message truncated; 128 bytes received but buffer size is 32
MPIR_Allgather_intra_recursive_doubling(108):
MPIDI_CH3U_Receive_data_found(131)..........: Message from rank 2 and tag 7 truncated; 256 bytes received but buffer size is 128

njzjz · 2024-12-05T20:37:01Z

I don't know details about your simulation, but if you use any random method, there is no promise that the same random seed reproduces the same result on different machines.

Luckaswww · 2024-12-06T01:27:44Z

I don't know details about your simulation, but if you use any random method, there is no promise that the same random seed reproduces the same result on different machines.

Ok. For above three error about MPI when calculate neb via lammps, Do you have any good suggestions to solve it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help for deepmd for MD simulation #1684

Help for deepmd for MD simulation #1684

Luckaswww commented Dec 5, 2024

njzjz commented Dec 5, 2024

Luckaswww commented Dec 6, 2024

Help for deepmd for MD simulation #1684

Help for deepmd for MD simulation #1684

Comments

Luckaswww commented Dec 5, 2024

Summary

DP-GEN Version

Platform, Python Version, etc

Details

Local host: cn9 Local device: hfi1_0

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

njzjz commented Dec 5, 2024

Luckaswww commented Dec 6, 2024

Local host: cn9
Local device: hfi1_0

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.