Saturday, November 30, 2013

Using Nagios++ Agent for Monitoring Windows Systems (Part 2, Windows Agent and installation of it)

In the former post i explained the way monitoring has been designed nowadays. Coming from there the next step is getting the framework up.
In this series of posts i will not elaborate on the Nagios Server, to get an idea of a installation of the Nagios server please refer to this post.

Design of the Windows Nagios++ Agent

The Nagios Agent for Windows is available on the following link. the designed of the Nagios++ Agent can be seen in the next image:
Nagios engine
The Agent is comprised of the following components:
  • an engine which is capable to do checks on Windows systems
  • Configuration ini-file that defines what the engine checks- and how to execute those checks
  • A collection of classes (modules) which are needed to execute the checks
  • An inbox which it needs to collect tasks from the Nagios Monitoring server
  • An outbox which is needed for the transfer of results to the Nagios monitoring server
The Nagios Server periodically tells the Agent via the Inbox what to check. After the check has been executed the results will be stored in the Outbox until the Nagios Server collects it in its next run.

The actual installation

The Nagios Agent for Windows has two types of sources available for installation: an installer file (.MSI) and a Zip file. both are sufficient to get the agent on a Windows system. the preferred way to install a Windows application is via .MSI because that mechanism has a lot of supporting features for installations on Windows systems the zip file does not have.
To get a Windows node in Nagios fully running the installation of the agent needs 3 components:
  • .ini configuration file
  • Nagios MSI installer
  • Supporting classes and modules
Windows Agent installatie
All three components must be installed on the Windows node on a folder of one of its drives (preferably the default Nagios installation folder C:\Program Files\NSCP)
The way to distribute these components can be done via a software distribution system-, script installation or by hand. the preferred way is by using a software distribution system.

Network requirements

For a Nagios Agent to work two ports must be opened locally. these are ports TCP 5666 (for incoming Server requests to the inbox) and port TCP 5667 (for outgoing communication to the Monitoring server)
That's it for this part. in the next post on this subject i will elaborate on the management of the Nagios Agent for Windows.

Friday, November 29, 2013

Using Nagios++ Agent for Monitoring Windows Systems (Part 1, introduction)

Hello again, in this series of post i will elaborate on the subject of monitoring Windows systems with Nagios++ and subsequent systems based on Nagios

a little introduction into Monitoring Computer systems:

Computer system monitoring is a practice of collecting representative counters for computer systems you want to know the status of.
With monitoring there are two endpoints ‘talking’ to each other: Server and Client. The Server’s primary role is to collect counters of Clients. The Client presents a set of counters to the Server system.

Monitoring globally knows two types of methods to collect these counters, Client push and Server Pull
basic monitoring 1basic monitoring 2

Ways of communication

There are 3 communication methods known to monitoring components:
  1. Real time monitoring
  2. Scheduled monitoring
  3. Triggered monitoring
With a client push system the client system pushes counter data to the Server, this can be a Real time push or a scheduled push of data. the other way around, a server system can poll a client ‘real time’ for certain counters or do timed checks on a client. another way of monitoring is based on creating baselines of a monitored system and subsequently only registering deviations of this baseline.
The main problem with Real time monitoring is the communication of the systems; with this methodology the client must always have access to the server system and it must also have a way to deliver its data to the ‘right destination desk’ on the Server system. with a limited amount of systems this can work without any addition overhead but when the amount of clients systems increases the problem of communication and counter delivery becomes prominent. this way of working can only work very good when the communication is managed by a very strict managing system but even then it becomes unmanageable when too much systems are plugged in.
When you use scheduled communication as means of monitoring the afore mentioned problems become less significant because the initiating component of the monitoring system will fetch the data of a system on fixed times and it already knows the data will be from system xyz and because of this will store the data on the place it holds all data of system xyz.
Using the 3rd method communication method between monitoring components, data transfer is kept to the very minimum, only when a deviation to a known baseline is detected the agent will report this to the server except when it is unable to communicate. in that scenario an action will be triggered on the monitoring server.

Monitoring Agents

The general way to get monitoring like this working is by using agents on client devices. Agents do have one primary function: to execute a task presented by the server component of the Monitoring system.
Monitoring agents do come it two categories: the passive agent, and the active agent.
The main difference between these two is the way the agent collects its counters and presents them to the Server.

Passive Agent

Generally passive agents are used by monitoring systems which are based on scheduled device polling, the way this goes is as follows:
  • The Agents is installed via software distribution or by hand locally
  • During installation or later from a external source the agent gets its configuration pushed
  • in its configuration, roles are defined to whom the agent may talk and what the agent can monitor
  • at a certain point in time the agent receives an external action request
  • the first step is a check weather the agent is allowed to talk to the external source, if not the request will  be dropped
  • when the request is valid the agent will try to execute the request, if it cannot the request is dropped
  • the agent reports its status back to the external source

Active Agent

The active agent differs from the passive agent in the way it is configured- and the way it communicates to the server. agents like this generally tend to use the 3rd communication method, it works like this:
  • Generally the installation of active Agents is begun by a discovery of systems by a monitoring server
  • When discovered the system is matched to specified criteria specified on the server, if they match an agent installation is started
  • After the installation of the agent, it tries to connect to a monitoring server by itself (generally the server that presented the agent)
  • When a server is found the agent requests configuration data, the Server sends configuration data of all available modules to the client
  • the agent downloads the configuration data compares it to the local configuration and requests modules and classes for its found configuration
  • The server sends the configuration and classes to the agent
  • The Agent applies the logic, runs checks and sends this data to the server
  • Apart from heartbeat no more communication between agent and server is started except when a deviation from a monitoring rule has been detected by the agent or in the event of a change in configuration on the client has been detected (like an installation of a new application, service etc.)

‘Alive checking’

Weather a system is alive is done differently between the two systems, using passive agents, servers tend to check alive by doing regular pings. On active agents the agent itself maintains a ‘heartbeat’ to the monitoring server. when the heartbeat stops, after a predefined amount of missed heartbeats an action is triggered.

So how does Nagios fit into this picture?

Nagios is an open source monitoring system, it is based on a passive agent. for an application of Nagios to a Windows based platform a few ‘roadblocks’ will have to be paved. in the next article about this subject i will elaborate on the installation of Nagios agents on Windows based machines.

Wednesday, November 13, 2013

Reinstall SCCM 2012 CAS server

This week i have been busy installing System Center Config Manager 2012 SP1 on a big new delivery site my company is building.

Long story short, installing SCCM 2012 in a new environment is not that difficult BUT when you experience an entire system down during an upgrade of config manager 2012 from SP1 to R2, the recovery of such a Hierarchy is some challenge.
this is the situation: (i use CM12 as a short for System Center Config Manager 2012)

  • CM12 CAS based on SP1 named CasServer installed with CM12 SP1 CU3
  • CM12 Primary site as child of the previous CAS installed with CM12 SP1 CU3
  • The CAS has a separate (physical) SQL server 2012 with a name like SQLServer
  • The Primary has SQL 2012 installed locally as its own Database server named PriServer
sccmsitecas-pri

 
  • The CAS was in the middle of the upgrade of CM12 SP1 to R2
The Entire Cluster went down at that moment the setup was configuring the CAS database. After the recovery of the cluster my CM12 configuration was just hopeless. none of the consoles worked and services were complaining about anything.

I tried the following recovery scenarios:
  • Restore of the CAS DB on SQLServer: SUCCESS
  • Recovery of the CAS with the option: "recover site from manually restored database" : SUCCESS
  • Check of full functionality in CM12 manager: FAIL - Database replication from CAS to PRI failed with error message:
  • Next scenario: Upgrade CAS to CM12 R2: FAIL replication will not start because of inconsistent content in the queue of the Primary site server
  • Next scenario: Uninstall PRI and CAS and reinstall CAS, recover old CAS DB and restart CM12: SUCCES though it has some hooks that i will explain here.
How to recover a CAS while the current CAS DB is in unknown state:
  • Have a good CAS Database backup ready and online to use
  • Uninstall All connected primary sites from the CAS (how you do that is another subject i will try to explore later)
  • Uninstall the CAS and remove the database while uninstalling (that's an option)
  • Open the registry on SQLServer. backup- and delete the RegKey tree: HKLM\SOFTWARE\Microsoft\SMS
  • Install the CAS as a brand new Server with a brand new Database
  • Stop all SMS services on the newly installed CAS
  • Go to the SQL manager and connect to the SQLServer Database engine
  • From the Databases entry right click the newly created DB and select task Detach.
    • check drop connections
  • Now right click on Databases and select task restore Database; use the option Source - Device
  • Select Backup media type: File and click the Add button; browse to the browse with the ... button to the .BAK file of the Database Backup you have saved previously
  • SQL will check the BAK file and, if checks ok, will display its contents. if OK by you hit the OK button and the restore will start
  • After all this went well the CAS server should restart as soon as you open the CM12 management interface.
  • Now do install all Cumulative updates you have applied to the level you did on the previous CM12 configuration (the one before the crash or restore)
At this point your CAS should be fully functional again

My next post will focus on the restore of the Primary sites.