Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help for deepmd for MD simulation #1684

Open
Luckaswww opened this issue Dec 5, 2024 · 2 comments
Open

Help for deepmd for MD simulation #1684

Luckaswww opened this issue Dec 5, 2024 · 2 comments

Comments

@Luckaswww
Copy link

Summary

I'm doing a dynamic simulation with a frozen model. The content of simulation is neb calculation and energy curve drawing. In the process of running the simulation. Let's say my initial state is A, my intermediate state is B, and my final state is C. I performed neb calculations of A-->B, A-->C on the M cluster. I moved the content to cluster N because the cluster permissions expired. In order to test the error, neb calculations of A-->B, A- >C are also performed in N cluster. The results show a slight difference in the energy curve, with an energy difference of 0.2 eV for the transition state. I want to ask is this caused by the cluster? In M cluster, deepmd is v2.2.6-1-g174f204a。In N cluster, deepmd is v2.2.6-1-g174f204a.

In addition, there may be an MPI error when I tried to do a B ->C calculation on the N cluster. I consulted the founder of the cluster and his testing suggestion may be a problem with deepmd itself. I really don't know what's causing it or how to fix it. Therefore, I am sending this issue to ask you for help. The following is an error output, and attached is an example.

DP-GEN Version

dpgen ==0.12.1

Platform, Python Version, etc

python=3.12.4
dpdata== 0.2.19
dpdispatcher==0.6.5
deepmd-kit v=2.2.6

Details

Error(1)

ARNING: There was an error initializing an OpenFabrics device.

Local host: cn9
Local device: hfi1_0


MPI_ABORT was invoked on rank 12 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

[cn9:47344] 15 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[cn9:47344] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[cn9:47344] 3 more processes have sent help message help-mpi-api.txt / mpi-abort

Error(2)
2024-12-05 18:51:06.726189: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2024-12-05 18:51:06.726267: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2024-12-05 18:51:06.725967: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 6
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5

Error (3):

2024-12-05 15:11:25.911162: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2024-12-05 15:11:25.912789: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
Fatal error in PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(421).........................: MPI_Allgather(sbuf=0x7ffd3910cfd0, scount=4, MPI_DOUBLE, rbuf=0x55dc221066c0, rcount=4, datatype=MPI_DOUBLE, comm=comm=0x84000006) failed
MPIR_Allgather_impl(239)....................:
MPIR_Allgather_intra_auto(145)..............: Failure during collective
MPIR_Allgather_intra_auto(141)..............:
MPIR_Allgather_intra_recursive_doubling(216): Failure during collective
Fatal error in PMPI_Allgather: Message truncated, error stack:
PMPI_Allgather(421).........................: MPI_Allgather(sbuf=0x7ffcb0654ed0, scount=1, MPI_DOUBLE, rbuf=0x56528cc45a80, rcount=1, datatype=MPI_DOUBLE, comm=comm=0x84000006) failed
MPIR_Allgather_impl(239)....................:
MPIR_Allgather_intra_auto(145)..............: Failure during collective
MPIR_Allgather_intra_auto(141)..............:
MPIR_Allgather_intra_recursive_doubling(108):
MPIC_Sendrecv(340)..........................:
MPIDI_CH3U_Request_unpack_uebuf(516)........: Message truncated; 128 bytes received but buffer size is 32
MPIR_Allgather_intra_recursive_doubling(108):
MPIDI_CH3U_Receive_data_found(131)..........: Message from rank 2 and tag 7 truncated; 256 bytes received but buffer size is 128

@njzjz
Copy link
Member

njzjz commented Dec 5, 2024

I don't know details about your simulation, but if you use any random method, there is no promise that the same random seed reproduces the same result on different machines.

@Luckaswww
Copy link
Author

I don't know details about your simulation, but if you use any random method, there is no promise that the same random seed reproduces the same result on different machines.

Ok. For above three error about MPI when calculate neb via lammps, Do you have any good suggestions to solve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants