Restarting a stalled workflow where the delayActivity has stopped responding

I released a number of changes to my projects workflows, making sure I had released them as a new dll version in the GAC.  The workflows use a DelayActivity to “sleep” for about four hours, before waking and checking if the due dates of the tasks involved are out of date.  If so, then an email is sent to the task owner.  After I released the new workflow version, new workflows instances would start fine, but after the four hour delay, an ominous error was found in the Workflow History and now email escalations occurred:

“Workflow Failed to run”.

Checking the ULS log, there was also the following error:

“OWSTIMER.EXE (0x07EC) 0x0E58 Windows SharePoint Services Workflow Infrastructure 98d8 Unexpected System.NullReferenceException: Object reference not set to an instance of an object. at Microsoft.SharePoint.Workflow.SPWorkflowManager.RunWorkflowElev(SPWorkflow originalWorkflow, SPWorkflow workflow, Collection`1 events, SPRunWorkflowOptions runOptions)”

A quick Google search brought up this old Microsoft Blog post where a number of users where experiencing the same problems.  Although the bug has been fixed in later hotfixes and service packs (which my system already had), one of the causes stood out as the reason for my problem.  That is; when releasing new versions of code into the GAC, the timer service is still using old versions which are non existent and so it falls over.

One recommendation to prevent this problem is to restart the Timer service, which is fine, but it doesn’t fix the workflows already running which are experiencing the problem.  However, reading through the blog someone mentioned being able to view the status of the timer jobs in the content database using the following SQL command:

SELECT * FROM ScheduledWorkItems Where id = ’20EF2AC4-64D1-47E4-91C1-4E5F4BFC7CD4′ — where this is the problem workflow ID.

When you run this command, you will notice the DeliveryDate.  This is the time the system will wake up the workflow. There is also the column “InternalState” which for problem workflows seemed to always be set to “9” and for ok workflows set to “8”.

Using a broken test workflow, I did some hacking around, and changed the internal state value of the problem workflow to 8, and set the DeliveryDate value to sometime shortly in the future.  I also, noted that the time originally in this field was 1 hour behind what I expected, and assume this is because I am running under GMT +1 timezone.  Therefore I set the DeliveryDate to be one hour behind the time in the future, or else I would be waiting another hour longer than I expected! Hope that makes sense!

I then restarted the Timer service on the SharePoint server and low and behold, the problem workflow shortly “woke up” and did its job as expected when the DeliveryDate time had passed.

I changed all the other problem workflows in the same way and they all came back to life with no problems.

I have now made sure I have amended my deployment procedures of new workflow versions to restart the timer service so this problem shouldn’t happen again.


SharePoint 2007 Workflow wont autostart when using IP address

I have had a lot of trouble recently with one user who was using a SharePoint 2007 site configured with a custom workflow, that should autostart when a new item is added to a list.  The WF wouldn’t start just for this user and so I would start it for him, assigning the list item to him.  The WF would then create a task for him, but on the user submitting the task, the workflow wouldn’t progress.   No error, nothing.  Just stop.  I initially thought it was some bad code in the custom WF I had developed as I kept seeing the tp_workflowversion value in the userdata table of the SharePoint Content database being set to 512.  I would set it back to 1, then submit the task again, but to no avail.

I then read this blog post form Corey Roth:

And realised the user in question, (being a network admin), was accessing the site via its IP address instead of hostname.  A quick test later and I could replicate the problem perfectly!  Asking the user to access the site via the hostname allowed the workflows to start automatically this time and progress as normal.

Now all that is left is to work out why this is such a problem……