Tuesday, January 20, 2015

Find Restarted VMs After a VMware HA Event

On several occasions I've had the privilege of finding a VMware ESXi host hung with a Purple Screen of Death (PSOD).  Assuming you have your VMware cluster setup properly, VMware HA will automatically restart the VMs that were running on the crashed host on other hosts in the cluster.



In my experience, it's fairly easy to identify the host with the PSOD in vCenter.  But it's a bit more tricky to identify what VMs were restarted by VMware HA.  (Aka - find the victims)

I've found that PowerCLI can help with this task.

Below is a quick PowerCLI one-liner to find the VMs that were restarted.

get-cluster "Cluster Name" | get-vm | Get-VIEvent | where {$_.FullFormattedMessage -match "vSphere HA restarted virtual machine"} | select ObjectName, CreatedTime, FullFormattedMessage

You can easily add an export-csv to the end if you want to export the list of VM's into something where you edit the data.

Saturday, March 9, 2013

“Identify pre-boot environment agent” failure during Cisco UCS firmware upgrade

Overview


We were upgrading the firmware in our Cisco UCS environment from version 1.4.(1j) to 2.1(1a) and ran into an issue when applying the firmware to the servers.  We had previously upgraded two other UCS domains to 2.1 and did not see the error while upgrading them.

Problem


We were getting failures on FSM tab in UCS Manager after applying a 2.1 firmware policy to a service profile.  It would fail when it got to a step labeled "Associate Pnu OS Ident" with a failure description of  "Identify pre-boot environment agent."  You can see the failure in the screenshot below:


We tried several things like rebooting the server, disassociating the services profile, and re-acknowledging the server.  All of which resulted in the same error.  At this point it was time to call Cisco TAC.

Solution


After working with Cisco TAC for a while we determined the problem was a BIOS setting on the server.  According to Cisco TAC, in versions 1.x of UCS Manager the servers would PXE boot from the FI's to enter into the Pnu OS.  (Pnu OS is what the blades boot into to apply their service profile customization)  Starting in version 2.x the servers mounted a USB device presented by the FI's to enter into the Pnu OS.  

There are two fixes for this problem:
  • Individually update the BIOS on a server to enable "Legacy USB Support"
  • Apply or update a BIOS policy to enable "Legacy USB Support"
Below are some screenshots that show how to update the setting in the server BIOS:

Launch the KVM viewer, reset the server, and press F2 to enter the BIOS when you see the prompt


Go to Advanced, USB Configuration


You will then see the Legacy USB Support is Disabled


Change Legacy USB Support to Enabled and hit F10 to Save and Exit


The server will then reboot and be able to enter the Pnu OS to get the firmware updates applied.

Finally, if you wanted to ensure this was applied to all of your server, you can create/modify a BIOS policy that is applied to your service profiles and/or templates in UCS Manager.

Below is a screenshot of where that setting resides.









Sunday, October 23, 2011

vCenter 5 Upgrade Error - SQL Server Agent Not Running

When upgrading our vCenter server from 4.1 to 5.0 I would get a SQL Server Agent error when going through the vCenter 5.0 installer.  The error said:

"Please make sure SQL Server Agent service is running on the database server"


Our vCenter environment is comprised of a dedicated vm for vCenter and a separate vm for SQL.
Obviously the first thing I checked was to ensure the SQL Server Agent server was running on our SQL server.  It was and I even restarted it for good measure.  But that did not seem to make the error go away.
Then I did a quick search of some VMware KB’s and I found KB1036518 which matched by symptoms, but did not resolve my issue.  It basically said make sure the SQL Agent is running and that you’ve got the correct media downloaded.  Helpful!  [sarcasm]
Next I turned to the install log
VMware VirtualCenter-build-455964: 10/12/11 22:25:52 SqlState: <42000>, NativeError: <14262>, msg: <[Microsoft][SQL Server Native Client 10.0][SQL Server]The specified @job_name ('Past Day stats rollup') does not exist.>, msgLen: <119>
VMware VirtualCenter-build-455964: 10/12/11 22:25:52 ODBC Error: [Microsoft][SQL Server Native Client 10.0][SQL Server]The specified @job_name ('Past Day stats rollup') does not exist.
VMware VirtualCenter-build-455964: 10/12/11 22:25:52 SQL Server Agent is not running or query error.
VMware VirtualCenter-build-455964: 10/12/11 22:25:52 Getting Property DB_DSN_SERVER_REMOTE = 1
VMware VirtualCenter-build-455964: 10/12/11 22:25:52 errorMessage is <Please make sure SQL Server Agent service is running on the database server.

I noticed the installer was complaining about the job “Past Day stats rollup” saying it doesn’t exist.
That reminded me that we had to rebuild those jobs (VMwareKB 1004382).  We were having troubles with some of our update stat's jobs failing in the early days of vCenter 4.1
I checked with our DBA and the account the DSN uses (VMVCenter_App in screenshot above) did not have rights to that job.  He then updated the jobs to give the user in the DSN owner rights on the stats jobs.  The installer was then able to continue with the upgrade.
Long story short, make sure the account used in your DSN has rights to the SQL Agent jobs.

Thursday, October 20, 2011

Wheel of blame

I saw this image today and it cracked me up.  Judging by the watermark it originally came from Quest Software, but I got it via email.

Working in a fairly good size IT department with different teams (server/storage/dba/app/etc) I see this a lot.

Monday, April 18, 2011

NtFrs Error 13568 - JRNL_WRAP_ERROR

Problem

A colleague of mine stopped by and reported users at one of our offices were not receiving settings from a recently updated GPO.  Upon further review, we noticed the gpt.ini file for the updated GPO did not update on the domain controller at that office.

After logging into the remote DC I went to look at the File Replication Service logs to see if I could tell why the files were not replicating.

That's when I noticed Error 13568 under NtFrs.  The contents of the was:

 The File Replication Service has detected that the replica set "DOMAIN SYSTEM VOLUME (SYSVOL SHARE)" is in JRNL_WRAP_ERROR.

 Replica set name is    : "DOMAIN SYSTEM VOLUME (SYSVOL SHARE)"
 Replica root path is   : "c:\Windows\sysvol\domain"
 Replica root volume is : "\\.\C:"
 A Replica set hits JRNL_WRAP_ERROR when the record that it is trying to read from the NTFS USN journal is not found.  This can occur because of one of the following reasons.

 [1] Volume "\\.\D:" has been formatted.
 [2] The NTFS USN journal on volume "\\.\C:" has been deleted.
 [3] The NTFS USN journal on volume "\\.\C:" has been truncated. Chkdsk can truncate the journal if it finds corrupt entries at the end of the journal.
 [4] File Replication Service was not running on this computer for a long time.
 [5] File Replication Service could not keep up with the rate of Disk IO activity on "\\.\C:".
 Setting the "Enable Journal Wrap Automatic Restore" registry parameter to 1 will cause the following recovery steps to be taken to automatically recover from this error state.
 [1] At the first poll, which will occur in 5 minutes, this computer will be deleted from the replica set. If you do not want to wait 5 minutes, then run "net stop ntfrs" followed by "net start ntfrs" to restart the File Replication Service.
 [2] At the poll following the deletion this computer will be re-added to the replica set. The re-addition will trigger a full tree sync for the replica set.

WARNING: During the recovery process data in the replica tree may be unavailable. You should reset the registry parameter described above to 0 to prevent automatic recovery from making the data unexpectedly unavailable if this error condition occurs again.

To change this registry parameter, run regedit.

Click on Start, Run and type regedit.

Expand HKEY_LOCAL_MACHINE.
Click down the key path:
   "System\CurrentControlSet\Services\NtFrs\Parameters"
Double click on the value name
   "Enable Journal Wrap Automatic Restore"
and update the value.

If the value name is not present you may add it with the New->DWORD Value function under the Edit Menu item. Type the value name exactly as shown above.

Resolution

After doing some research, the resolution was to rebuild the SYSVOL folder on the domain controller.  To do that preform the following steps


1.  Stop FRS.

2.  Start Registry Editor (Regedt.exe).

3.  Locate and click the following key in the registry:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NtFrs\Parameters\Backup/Restore/Process at Startup

4.  On the Edit menu, click Add Value , and then add the following registry value:

Value name: BurFlags
Data type: REG_DWORD
Radix: Hexadecimal
Value data: D2

5.  Quit Registry Editor.

6.  Restart FRS.
You will then see events under the File Replication Service source showing the SYSVOL being rebuilt.  It will take down the SYSVOL and NETLOGON shares until the replication has finished.  You can run the net share command to verify.

I ended up finding this error on a couple other domain controllers in our domain after combing for the error.  (we have about 165 domain controllers in our environment)

Rebuilding the SYSVOL folder fixed the issue on each of them.  We've seen this on both Windows 2003 and 2008 based servers.

Monday, April 11, 2011

Deploy BgInfo with Group Policy

I’ve always been a fan of using BgInfo to apply a standard background on servers I manage.  BgInfo allows you to put relevant information like Name, IP, Boot Time, Disk Space, etc on your desktop.  For more information on how to use BgInfo, please visit their site.

In this article I’m going to focus on how you can deploy BgInfo to machines via Group Policy.

BgInfo requires 3 basic files:
  • Bginfo.exe – The main executable
  • Custom.bgi – Configuration file saved in Bginfo.exe.  This is where you setup what you want the desktop ground to look like.  You can name the file whatever you’d like.
  • StartBgInfo.cmd – batch file used to start Bginfo.exe and apply the info to the desktop.
  • Eula.txt (Optional) – used so the first time it doesn’t display a EULA
Since this application does not need to be installed, we can use Group Policy to deploy the files to machines we target in the Group Policy Object (GPO).  We will use the Files setting under Computer Configuration, Preferences, Windows Settings.  This GPO will require Group Policy Preferences which was introduced in Windows Vista/2008.  Preferences require some client side extensions that are available by default in Vista/2008 and later machines and can be installed on XP/2003 systems. 

You can put the source of the files wherever you’d like, as long as it is accessible to the machine that needs the files.  I like the idea of placing them in the logon scripts folder in SYSVOL (ex -  \\domain.dom\NETLOGON\BgInfo\) so that way they will be on a domain controller that is closest to the user.  (Assuming AD Sites and Services is setup correctly)

The batch file that loads BgInfo on logon (StartBgInfo.cmd) is the only file that has a hard requirement on where it needs to go.  It needs to be put in the All Users Startup folder under the Start Menu.  The other files can go wherever you’d like to put them.  I’ve chosen to put them under “C:\Program Files\BgInfo” using the system variable %ProgramFiles%.

Also, here's what I use for the contents of StartBgInfo.cmd.  For other available command line switches available for BgInfo.exe, see their documentation.

echo off
"%ProgramFiles%\BGInfo\BGInfo.exe" "%ProgramFiles%\BGInfo\BGInfo.bgi" /NOLICPROMPT /timer:0
exit

Here’s what the GPO settings look like when editing the GPO

Here's a more zoomed in version to see the source and target locations


You’ll see that StartBgInfo.cmd is in there twice.  This is because of the different location of the Startup folder in XP/2003 systems vs Vista/2008 and later systems.  I used item level targeting on those files to have them only apply to the correct systems. 

When you double click on the file in the group policy editor, you'll see the following:


Deploying BgInfo via Group Policy makes it really nice to push a consistent background image out to all machines that you target in the GPO.  It also makes it much easier to do mass updates if you need to push out a new version or update the configuration file.


Tuesday, April 5, 2011

“The volume you have selected may not be extended" Error in DiskPart.exe on Windows 2003 Server SP2

I got an email from a colleague saying that he was having trouble increasing the size of a disk on a Windows 2003 SP2 server.  This server is a VMware virtual machine and he had already changed the size of the disk in VMware from 100 GB to 150 GB.


Problem

When trying to extend a partition on a Windows 2003 Server using diskpart.exe he got the error:

“The volume you have selected may not be extended.
Please select another volume and try again.”



Troubleshooting

I remembered seeing this error back in the VMware 3.0 days when resizing disks and then extending the partitions.  So without doing much other research I quickly found Microsoft KB 841650 and VMware KB 1007266 which both reference a bug in DiskPart.exe and a hotfix that is needed.

Well… It turns out that the Microsoft hotfix was only for pre-SP2 for Windows 2003 systems.  Since this server was running Windows 2003 SP2 the hotfix did not apply.

After looking closely at the screenshot I was sent, I noticed that the “Extended Partition” and “Free Space” sections did not look correct.  I wasn’t used to seeing those green colors and tried to figure out why it was showing like that.  I’m used to seeing show the unallocated in Disk Manager.


So took a look at a test Windows 2003 virtual machine to see if could reproduce the issue.  When I initially resized the disk in VMware and rescanned by disks in Disk Manager, I noticed the new space was showing "Unallocated" and not like the screenshot I was sent.


I then needed to try and get the disk to look like the screenshot that he sent me.  I was able to do this by right-clicking on unallocated space, selecting New Partition, Extended partition, and then selecting all of the new space.





Then my test machine was appearing like the screenshot I was sent.  I was then able to confirm that I got the same error when I’d try to run diskpart.exe extend on that volume.


Resolution

To fix I had him simply delete the new “Free Space” partition so that the space was showing unallocated.  At that point we were able to successfully extend the volume on that
disk.



Once the space was showing "Unallocated" I was able to sucessfully extend the volume with DiskPart.exe and now the partition is using all of the new space.



The lesson learned was that the space needs to show "Unallocated" if you want to extend a volume (partition) on a basic disk in Windows.

I passed this information back to my colleague and he was able to sucessfully delete the free space partition and extend his volume.