In Our Production Environment, After Deploy a Marathon Application, the Marathon-LB, keep log out:
marathon_lb: server id collision for xxx was already assigned, retrying with xxx
Which rised the haproxy configuration file not be updated, then all new tasks cannot be accessed.
To recover production access, we deleted the old version tasks, then marathon-lb make new haproxy configuration file successfully, and stop the error logging.
Unlucky, we didnot reproduce the situation.
Some scene data
- mesos version: 1.8.2
- marathon version: 1.6.549
- marathon-lb version: 1.14.0
- haproxy version: 2.0.3
- marathon-lb mode: sse
- running mode: docker container
- before recover, max new server name length: recurved 84 times with 5417 bytes
- before the
marathon_lb: server id collision , the stdout file already repeat marathon_lb: backend server xxx on yyyy
Our analysis:
For some special reason, app backends reduplicated in a app or cross apps, the server id in haproxy is a global value.
And the method calculate_server_id is a recurving function, it will keep calling util find a not assigned value, and stop update haproxy configuration file.
Suggest Solution:
The backends property of MarathonService is a set type, so should define a hash method for the MarathonBackend object.