+ Responder ao Tópico



  1. #1

    Padrão Heartbeat parando sozinho

    Ola a todos.
    Aki estou enfrentando um problema estranho o drbd + heartbeat esta funcionando ok. quando o primario cai o secundario asume blz, do que ao passar dos dias o servidor primario não cai, simplesmente o heartbeat para de funcionar o o servidor primario assum, ai então andei olhando o log o servidor primario esta assim:

    Jul 3 05:31:58 cpd020 kernel: martian source 192.168.1.27 from 192.168.1.27, on dev eth1
    Jul 3 05:31:58 cpd020 kernel: ll header: ff:ff:ff:ff:ff:ff:00:08:54:1a:ef:3c:08:06
    Jul 3 05:31:58 cpd020 heartbeat[20857]: WARN: node cpd020.agrovale: is dead
    Jul 3 05:31:58 cpd020 heartbeat[20857]: ERROR: No local heartbeat. Forcing shutdown.
    Jul 3 05:31:58 cpd020 heartbeat[20857]: WARN: Late heartbeat: Node cpd021.agrovale: interval 4650 ms
    Jul 3 05:31:58 cpd020 heartbeat[20857]: info: hb_signal_giveup_resources(): current status: active
    Jul 3 05:31:58 cpd020 heartbeat[20857]: info: Heartbeat shutdown in progress. (20857)
    Jul 3 05:31:58 cpd020 heartbeat[20857]: WARN: node cpd020.agrovale: is dead
    Jul 3 05:31:58 cpd020 heartbeat[20857]: ERROR: No local heartbeat. Forcing shutdown.
    Jul 3 05:31:58 cpd020 heartbeat[24948]: info: Giving up all HA resources.
    Jul 3 05:32:00 cpd020 kernel: martian source 192.168.1.27 from 192.168.1.27, on dev eth1
    Jul 3 05:31:59 cpd020 heartbeat: info: Releasing resource group: cpd020.agrovale 192.168.1.27 datadisk smb postgresql
    Jul 3 05:32:03 cpd020 heartbeat[20853]: info: heartbeat: version 1.0.3
    Jul 3 05:32:01 cpd020 heartbeat[20857]: WARN: node cpd020.agrovale: is dead
    Jul 3 05:32:03 cpd020 heartbeat[20857]: ERROR: No local heartbeat. Forcing shutdown.
    Jul 3 05:32:03 cpd020 heartbeat[20857]: WARN: node cpd020.agrovale: is dead
    Jul 3 05:32:03 cpd020 heartbeat[20857]: ERROR: No local heartbeat. Forcing shutdown.
    Jul 3 05:32:03 cpd020 heartbeat[20857]: WARN: node cpd021.agrovale: is dead
    Jul 3 05:32:03 cpd020 kernel: ll header: ff:ff:ff:ff:ff:ff:00:08:54:1a:ef:3c:08:06
    Jul 3 05:32:03 cpd020 kernel: martian source 192.168.1.27 from 192.168.1.27, on dev eth1
    Jul 3 05:32:03 cpd020 kernel: ll header: ff:ff:ff:ff:ff:ff:00:08:54:1a:ef:3c:08:06
    Jul 3 05:32:03 cpd020 heartbeat[20857]: WARN: No STONITH device configured.
    Jul 3 05:32:03 cpd020 heartbeat[20857]: WARN: Shared disks are not protected.
    Jul 3 05:32:03 cpd020 heartbeat[20857]: info: Resource takeover cancelled - shutdown in progress.
    Jul 3 05:32:03 cpd020 heartbeat[20857]: info: Link cpd021.agrovale:eth0 dead.
    Jul 3 05:32:03 cpd020 heartbeat[20857]: WARN: Cluster node cpd021.agrovalereturning after partition.
    Jul 3 05:32:03 cpd020 heartbeat[20857]: WARN: Deadtime value may be too small.
    Jul 3 05:32:03 cpd020 heartbeat[20857]: info: See documentation for information on tuning deadtime.
    Jul 3 05:32:03 cpd020 heartbeat: info: Running /etc/init.d/postgresql stop
    Jul 3 05:32:03 cpd020 heartbeat[20857]: info: Link cpd021.agrovale:eth0 up.
    Jul 3 05:32:03 cpd020 heartbeat[20857]: WARN: Late heartbeat: Node cpd021.agrovale: interval 5160 ms
    Alguem saberia me informar o que podera esta acontecendo.
    Desde ja agradeço

  2. #2

    Padrão Heartbeat parando sozinho

    Gilmar você criou uma rede dedicada para o Heartbeat e o DRBD?

    Porque parece que o outro nó não está recebendo os keepalives configurados no ha.cf....

    Não será problema na sua infraestrutura de rede?

    []'s

    Marcos Pitanga
    System Engineer
    Gplus Energy Division
    High Performance Computing Expert
    www.gplus.com.br

  3. #3

    Padrão Heartbeat parando sozinho

    Pitanga valeu por tudo que vc me ajudou ate agora,
    Sobre a rede criei sim, coloquei uma placa de rede em cada um dos servidores com cabo cross. o interesante que este log que lhe enviei foi do dia 03, olha aki do dia 4 e 5 para vc ter uma ideia:
    Sera que e problema no cabo cross ou nas placas de rede?

    Jul 4 05:36:33 cpd020 heartbeat[25185]: info: Resource shutdown completed. Restart triggered.
    Jul 4 05:36:33 cpd020 heartbeat[25181]: info: control process Received SIGQUIT
    Jul 4 05:36:33 cpd020 heartbeat[25181]: info: Core process 25185 exited. 3 remaining
    Jul 4 05:36:33 cpd020 heartbeat[25181]: info: Core process 25184 exited. 2 remaining
    Jul 4 05:36:33 cpd020 heartbeat[25181]: info: Core process 25183 exited. 1 remaining
    Jul 4 05:36:33 cpd020 heartbeat[25181]: info: Heartbeat shutdown complete.
    Jul 4 05:36:33 cpd020 heartbeat[25181]: info: Restarting heartbeat.
    Jul 4 05:36:33 cpd020 heartbeat[25181]: info: Killing process 25183 with signal 15
    Jul 4 05:36:33 cpd020 heartbeat[25181]: info: Killing process 25184 with signal 15
    Jul 4 05:36:33 cpd020 heartbeat[25181]: info: Killing process 25185 with signal 15
    Jul 4 05:36:33 cpd020 heartbeat[25181]: info: Done killing processes for restart.
    Jul 4 05:36:33 cpd020 heartbeat[25181]: info: Performing heartbeat restart exec.
    Jul 4 05:36:33 cpd020 heartbeat[25181]: info: Closing files first...
    Jul 4 05:36:39 cpd020 heartbeat[25181]: WARN: WARNING: directive 'udp' replaced by 'bcast'
    Jul 4 05:36:39 cpd020 heartbeat[25181]: info: **************************
    Jul 4 05:36:39 cpd020 heartbeat[25181]: info: Configuration validated. Starting heartbeat 1.0.3
    Jul 4 05:36:39 cpd020 heartbeat[26509]: info: heartbeat: version 1.0.3
    Jul 4 05:36:40 cpd020 heartbeat[26509]: info: Heartbeat generation: 76
    Jul 4 05:36:40 cpd020 heartbeat[26509]: info: UDP Broadcast heartbeat started on port 694 (694) interface eth0
    Jul 4 05:36:40 cpd020 heartbeat[26511]: info: pid 26511 locked in memory.
    Jul 4 05:36:40 cpd020 heartbeat[26512]: info: pid 26512 locked in memory.
    Jul 4 05:36:40 cpd020 heartbeat[26513]: info: pid 26513 locked in memory.
    Jul 4 05:36:40 cpd020 heartbeat[26513]: info: Local status now set to: 'up'
    Jul 4 05:36:40 cpd020 heartbeat[26509]: info: pid 26509 locked in memory.
    Jul 4 05:36:40 cpd020 heartbeat[26513]: info: Link cpd020.agrovale:eth0 up.
    Jul 4 05:36:40 cpd020 heartbeat[26513]: info: Local status now set to: 'active'
    Jul 4 05:36:40 cpd020 heartbeat[26513]: info: Link cpd021.agrovale:eth0 up.
    Jul 4 05:36:40 cpd020 heartbeat[26513]: info: Status update for node cpd021.agrovale: status active
    Jul 4 05:36:40 cpd020 heartbeat: info: Running /etc/ha.d/rc.d/status status
    Jul 4 05:36:40 cpd020 heartbeat[26515]: info: Resource acquisition completed.
    Jul 4 05:36:40 cpd020 heartbeat: info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp
    Jul 4 05:36:40 cpd020 heartbeat: received ip-request-resp 192.168.1.27 OK no
    Jul 4 05:36:41 cpd020 heartbeat: info: Acquiring resource group: cpd020.agrovale 192.168.1.27 datadisk smb postgresql
    Jul 4 05:36:41 cpd020 heartbeat: info: Running /etc/ha.d/resource.d/IPaddr 192.168.1.27 start
    Jul 4 05:36:41 cpd020 heartbeat: info: /sbin/ifconfig eth1:0 192.168.1.27 netmask 255.255.255.0^Ibroadcast 192.168.1.255
    Jul 4 05:36:41 cpd020 heartbeat: info: Sending Gratuitous Arp for 192.168.1.27 on eth1:0 [eth1]
    Jul 4 05:36:41 cpd020 heartbeat: /usr/lib/heartbeat/send_arp eth1 192.168.1.27 0008542325B1 192.168.1.27 ffffffffffff
    Jul 4 05:36:41 cpd020 heartbeat: info: Running /etc/ha.d/resource.d/datadisk start
    Jul 4 05:36:42 cpd020 kernel: drbd0: blksize=1024 B
    Jul 4 05:36:42 cpd020 kernel: drbd0: blksize=4096 B
    Jul 4 05:36:42 cpd020 kernel: kjournald starting. Commit interval 5 seconds
    Jul 4 05:36:42 cpd020 kernel: EXT3 FS 2.4-0.9.19, 19 August 2002 on drbd(43,0), internal journal
    Jul 4 05:36:42 cpd020 kernel: EXT3-fs: mounted filesystem with ordered data mode.
    Jul 4 05:36:42 cpd020 heartbeat: info: Running /etc/init.d/smb start
    Jul 4 05:36:42 cpd020 smb: smbd startup succeeded
    Jul 4 05:36:42 cpd020 smb: nmbd startup succeeded
    Jul 4 05:36:42 cpd020 heartbeat: info: Running /etc/init.d/postgresql start
    Jul 5 05:36:40 cpd020 heartbeat[26513]: info: Daily informational memory statistics
    Jul 5 05:36:40 cpd020 heartbeat[26513]: info: MSG stats: 100/85628 age 1 [pid26509/CONTROL]
    Jul 5 05:36:40 cpd020 heartbeat[26513]: info: ha_malloc stats: 2300/2226326 89600/49000 [pid26509/CONTROL]
    Jul 5 05:36:40 cpd020 heartbeat[26513]: info: RealMalloc stats: 91008 total malloc bytes. pid [26509/CONTROL]
    Jul 5 05:36:40 cpd020 heartbeat[26513]: info: Current arena value: 170816
    Jul 5 05:36:40 cpd020 heartbeat[26513]: info: MSG stats: 0/85628 age 1 [pid26511/HBWRITE]
    Jul 5 05:36:40 cpd020 heartbeat[26513]: info: ha_malloc stats: 0/2055070 0/0 [pid26511/HBWRITE]
    Jul 5 05:36:40 cpd020 heartbeat[26513]: info: RealMalloc stats: 1152 total malloc bytes. pid [26511/HBWRITE]
    Jul 5 05:36:40 cpd020 heartbeat[26513]: info: Current arena value: 56128
    Jul 5 05:36:40 cpd020 heartbeat[26513]: info: MSG stats: 0/171257 age 1 [pid26512/HBREAD]
    Jul 5 05:36:40 cpd020 heartbeat[26513]: info: ha_malloc stats: 0/4110170 0/0 [pid26512/HBREAD]
    Jul 5 05:36:40 cpd020 heartbeat[26513]: info: RealMalloc stats: 1248 total malloc bytes. pid [26512/HBREAD]
    Jul 5 05:36:40 cpd020 heartbeat[26513]: info: Current arena value: 56128
    Jul 5 05:36:40 cpd020 heartbeat[26513]: info: MSG stats: 0/342514 age 1 [pid26513/MST_STATUS]
    Jul 5 05:36:40 cpd020 heartbeat[26513]: info: ha_malloc stats: 0/6764649 0/0 [pid26513/MST_STATUS]
    Jul 5 05:36:40 cpd020 heartbeat[26513]: info: RealMalloc stats: 1600 total malloc bytes. pid [26513/MST_STATUS]
    Jul 5 05:36:40 cpd020 heartbeat[26513]: info: Current arena value: 88896
    Jul 5 05:36:40 cpd020 heartbeat[26513]: info: These are nothing to worry about.
    Jul 5 06:10:28 cpd020 heartbeat[26513]: WARN: Late heartbeat: Node cpd021.agrovale: interval 3330 ms

  4. #4

    Padrão Heartbeat parando sozinho

    Apos muitas tentativas descobri o problema que o drbd 0.6.3 estava perdndo conexão com o servidor primario, aki vai a dica para quem esta enfrendo o mesmo problema
    o problema estava do timeout que esta 5 e tive que subir para 7 assim ele não perde mais conexão com o servidor secundario.