[Openais] problem running ocfs2/o2cb with openais/pacemaker

Andrew Beekhof andrew at beekhof.net
Mon Apr 12 05:25:55 PDT 2010


What versions of openais (corosync?) and pacemaker are you using?

On Mon, Apr 12, 2010 at 2:00 PM, Jürgen Herrmann
<Juergen.Herrmann at xlhost.de> wrote:
>
> hi!
>
> i'm on debian lenny and trying to run ocfs2 on a dual primary
> drbd device. the drbd device is already set up as msDRBD0.
>
> to get dlm_controld.pcmk i installed it from source (from
> cluster-suite-3.0.10)
> now i configured a resource "resDLM" with 2 clones:
>  primitive resDLM ocf:pacemaker:controld op monitor interval="120s"
>  clone cloneDLM resDLM meta globally-unique="false" interleave="true"
>  colocation colDLM_DRBD0 inf: cloneDLM msDRBD0:Master
>  order ordDRBD0_DLM inf: msDRBD0:promote cloneDLM:start
> -> seems to work.
>
>
> to get ocfs2_controld.pcmk i installed ocfs2-tools-1.4.3 from source.
> after adding the resource:
>  primitive resO2CB ocf:pacemaker:o2cb op monitor interval="120s"
>  clone cloneO2CB resO2CB meta globally-unique="false" interleave="true"
>  colocation colO2CB_DLM inf: cloneO2CB cloneDLM
>  order ordDLM_O2CB inf: cloneDLM cloneO2CB
>
> i get the following errors in crm_mon:
> ======================================
> Failed actions:
>    resO2CB:0_start_0 (node=app1b.xlhost.de, call=28, rc=1,
> status=complete): unknown error
>    resO2CB:0_start_0 (node=app1a.xlhost.de, call=38, rc=1,
> status=complete): unknown error
>
>
> the relevant syslog entries:
> ============================
> Apr 12 13:15:18 app1a corosync[4638]:   [pcmk  ] info: pcmk_notify:
> Enabling node
>  notifications for child 8311 (0xd83090)
> Apr 12 13:15:18 app1a ocfs2_controld.pcmk: Error opening control device:
> Unable to  access cluster service
>
>
>
> if i start "ocfs2_controld.pcmk -D" i get:
> ==========================================
> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: init_ais_connection:
> Creating connection to our AIS plugin
> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: init_ais_connection: AIS
> connection established
> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: get_ais_nodeid: Server
> details: id=569559765 uname=app1a.xlhost.de cname=pcmk
> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: crm_new_peer: Node
> app1a.xlhost.de now has id: 569559765
> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: crm_new_peer: Node
> 569559765 is now known as app1a.xlhost.de
> 1271072439 setup_stack at 168: Cluster connection established.  Local node
> id: 569559765
> 1271072439 setup_stack at 172: Added Pacemaker as client 1 with fd 5
> 1271072439 setup_ckpt at 609: Initializing CKPT service (try 1)
> 1271072439 setup_ckpt at 615: Connected to CKPT service with handle
> 0x327b23c600000000
> 1271072439 call_ckpt_open at 160: Opening checkpoint
> "ocfs2:controld:21f2cad5" (try 1)
> 1271072439 call_ckpt_open at 170: Opened checkpoint "ocfs2:controld:21f2cad5"
> with handle 0x6633487300000000
> 1271072439 call_section_write at 340: Writing to section
> "daemon_max_protocol" on checkpoint "ocfs2:controld:21f2cad5" (try 1)
> 1271072439 call_section_create at 292: Creating section "daemon_max_protocol"
> on checkpoint "ocfs2:controld:21f2cad5" (try 1)
> 1271072439 call_section_create at 300: Created section "daemon_max_protocol"
> on checkpoint "ocfs2:controld:21f2cad5"
> 1271072439 call_section_write at 340: Writing to section "ocfs2_max_protocol"
> on checkpoint "ocfs2:controld:21f2cad5" (try 1)
> 1271072439 call_section_create at 292: Creating section "ocfs2_max_protocol"
> on checkpoint "ocfs2:controld:21f2cad5" (try 1)
> 1271072439 call_section_create at 300: Created section "ocfs2_max_protocol"
> on checkpoint "ocfs2:controld:21f2cad5"
> 1271072439 start_join at 588: Starting join for group "ocfs2:controld"
> 1271072439 start_join at 592: cpg_join succeeded
> 1271072439 loop at 975: setup done
> ocfs2_controld[18489]: 2010/04/12_13:40:39 notice: ais_dispatch:
> Membership 156: quorum acquired
> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: crm_update_peer: Node
> app1a.xlhost.de: id=569559765 state=member (new) addr=r(0)
> ip(213.202.242.161)  (new) votes=1 (new) born=156 seen=156
> proc=00000000000000000000000000013312 (new)
> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: crm_new_peer: Node
> app1b.xlhost.de now has id: 586336981
> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: crm_new_peer: Node
> 586336981 is now known as app1b.xlhost.de
> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: crm_update_peer: Node
> app1b.xlhost.de: id=586336981 state=member (new) addr=r(0)
> ip(213.202.242.162)  votes=1 born=148 seen=156
> proc=00000000000000000000000000013312
> 1271072439 confchg_cb at 495: confchg called
> 1271072439 daemon_change at 398: ocfs2_controld (group "ocfs2:controld")
> confchg: members 1, left 0, joined 1
> 1271072439 cpg_joined at 909: CPG is live, we are the first daemon
> 1271072439 call_ckpt_open at 160: Opening checkpoint "ocfs2:controld" (try 1)
> 1271072439 call_ckpt_open at 170: Opened checkpoint "ocfs2:controld" with
> handle 0x2ae8944a00000001
> 1271072439 call_section_write at 340: Writing to section "daemon_protocol" on
> checkpoint "ocfs2:controld" (try 1)
> 1271072439 call_section_create at 292: Creating section "daemon_protocol" on
> checkpoint "ocfs2:controld" (try 1)
> 1271072439 call_section_create at 300: Created section "daemon_protocol" on
> checkpoint "ocfs2:controld"
> 1271072439 call_section_write at 340: Writing to section "ocfs2_protocol" on
> checkpoint "ocfs2:controld" (try 1)
> 1271072439 call_section_create at 292: Creating section "ocfs2_protocol" on
> checkpoint "ocfs2:controld" (try 1)
> 1271072439 call_section_create at 300: Created section "ocfs2_protocol" on
> checkpoint "ocfs2:controld"
> 1271072439 cpg_joined at 923: Daemon protocol is 1.0
> 1271072439 cpg_joined at 925: fs protocol is 1.0
> 1271072439 cpg_joined at 927: Connecting to dlm_controld
>>>>>>>>>>>>>>>>>>>>>>>>> here's the error <<<<<<<<<<<<<<<<<<<<<<
> 1271072439 cpg_joined at 934: Opening control device
> 1271072439 cpg_joined at 938: Error opening control device: Unable to access
> cluster service
> 1271072439 exit_dlmcontrol at 363: Closing dlm_controld connection
> 1271072439 start_leave at 613: leaving group "ocfs2:controld"
> 1271072439 start_leave at 626: cpg_leave succeeded
> 1271072439 exit_cpg at 760: closing cpg connection
> 1271072439 call_ckpt_close at 240: Closing checkpoint
> "ocfs2:controld:21f2cad5" (try 1)
> 1271072439 call_ckpt_close at 246: Closed checkpoint
> "ocfs2:controld:21f2cad5"
> 1271072439 exit_ckpt at 643: Disconnecting from CKPT service (try 1)
> 1271072439 exit_ckpt at 647: Disconnected from CKPT service
> 1271072439 exit_stack at 144: closing pacemaker connection
> ocfs2_controld[18489]: 2010/04/12_13:40:39 notice:
> terminate_ais_connection: Disconnected from AIS
>
>
> obviously ocfs2_controld.pcmk can connect to the openais CKPT service and
> to dlm_controld.pcmk, which then terminates the connection.
> here's the output from dlm_controld.pcmk -q 0 -D:
> (the last 6 lines show 3 connection attempts from ocfs2_controld.pcmk!)
> =======================================================================
> 1271072755 dlm_controld 3.0.10 started
> cluster-dlm[20608]: 2010/04/12_13:45:55 info: init_ais_connection:
> Creating connection to our AIS plugin
> cluster-dlm[20608]: 2010/04/12_13:45:55 info: init_ais_connection: AIS
> connection established
> cluster-dlm[20608]: 2010/04/12_13:45:55 info: get_ais_nodeid: Server
> details: id=569559765 uname=app1a.xlhost.de cname=pcmk
> cluster-dlm[20608]: 2010/04/12_13:45:55 info: crm_new_peer: Node
> app1a.xlhost.de now has id: 569559765
> cluster-dlm[20608]: 2010/04/12_13:45:55 info: crm_new_peer: Node 569559765
> is now known as app1a.xlhost.de
> 1271072755 found /dev/misc/dlm-control minor 58
> 1271072755 found /dev/misc/dlm-monitor minor 57
> 1271072755 found /dev/misc/dlm_plock minor 56
> 1271072755 /dev/misc/dlm-monitor fd 9
> 1271072755 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
> 1271072755 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
> 1271072755 confdb_key_get error 11
> 1271072755 group_mode 3 compat 0
> 1271072755 setup_cpg_daemon 11
> 1271072755 dlm:controld conf 2 1 0 memb 569559765 586336981 join 569559765
> left
> 1271072755 run protocol from nodeid 586336981
> 1271072755 daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1
> 1271072755 plocks 13
> 1271072755 plock cpg message size: 104 bytes
> cluster-dlm[20608]: 2010/04/12_13:45:55 notice: ais_dispatch: Membership
> 156: quorum acquired
> cluster-dlm[20608]: 2010/04/12_13:45:55 info: crm_update_peer: Node
> app1a.xlhost.de: id=569559765 state=member (new) addr=r(0)
> ip(213.202.242.161)  (new) votes=1 (new) born=156 seen=156
> proc=00000000000000000000000000013312 (new)
> cluster-dlm[20608]: 2010/04/12_13:45:55 info: crm_new_peer: Node
> app1b.xlhost.de now has id: 586336981
> cluster-dlm[20608]: 2010/04/12_13:45:55 info: crm_new_peer: Node 586336981
> is now known as app1b.xlhost.de
> cluster-dlm[20608]: 2010/04/12_13:45:55 info: crm_update_peer: Node
> app1b.xlhost.de: id=586336981 state=member (new) addr=r(0)
> ip(213.202.242.162)  votes=1 born=148 seen=156
> proc=00000000000000000000000000013312
> 1271072755 Processing membership 156
> 1271072755 Adding address ip(213.202.242.161) to configfs for node
> 569559765
> 1271072755 set_configfs_node 569559765 213.202.242.161 local 1
> 1271072755 Added active node 569559765: born-on=156, last-seen=156,
> this-event=156, last-event=0
> 1271072755 Adding address ip(213.202.242.162) to configfs for node
> 586336981
> 1271072755 set_configfs_node 586336981 213.202.242.162 local 0
> 1271072755 Added active node 586336981: born-on=148, last-seen=156,
> this-event=156, last-event=0
> 1271072763 client connection 5 fd 14
> 1271072763 connection 5 read error -1
> 1271072776 client connection 5 fd 14
> 1271072776 connection 5 read error -1
> 1271072779 client connection 5 fd 14
> 1271072779 connection 5 read error -1
>
>
>
> i'm pretty lost at the moment, as there's nothing i can find via google
> regarding the "core" problem:
> 1271072439 cpg_joined at 934: Opening control device
> 1271072439 cpg_joined at 938: Error opening control device: Unable to access
> cluster service
>
>
> any help would be greatly appreciated.
>
> best regards,
> jürgen herrmann
> --
>>> XLhost.de - eXperts in Linux hosting ® <<
>
> XLhost.de GmbH
> Jürgen Herrmann, Geschäftsführer
> Boelckestrasse 21, 93051 Regensburg, Germany
>
> Geschäftsführer: Volker Geith, Jürgen Herrmann
> Registriert unter: HRB9918
> Umsatzsteuer-Identifikationsnummer: DE245931218
>
> Fon:  +49 (0)800 XLHOSTDE [0800 95467833]
> Fax:  +49 (0)800 95467830
>
> WEB:  http://www.XLhost.de
> IRC:  #XLhost at irc.quakenet.org
> _______________________________________________
> Openais mailing list
> Openais at lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais


More information about the Openais mailing list