AOS crash investigation

Hi all,

Lately my client has been facing some problems with the AOS, it seems to be crashing quite often and after investigating for about two weeks, I’ve decided to seek help here…

The company has three instances - development (AX 2009 SP1 with RU5), test (AX 2009 SP1 with RU5) and production (AX 2009 SP1, RU5 will be installed soon). All the AOSes are listed on the server configuration list in the production environment, they are all marked as batch servers but actually only the production AOS executes batch jobs (checked that while browsing through batch groups). The developement and test instances can execute 8 threads, production instance can execute 3 threads. (I have a question right here - how does adding AOSes to a server configuration work? How do they cooperate? Does each one of them execute their own code on production data? If so, could different app versions or differences in code on each instance cause these problems? Could someone explain that to me in more detail?)

The application event log regularly throws a message “Object Server 01: RPC error: RPC exception 1702 occurred in session x”. I’ve checked that that might come from executing client code in a batch job, but I’ve checked all the client’s customizations and I didn’t find a piece of code that could fit the pattern. Right before the crash, the server event log shows an error “A timeout (30000 milliseconds) was reached while waiting for a transaction response from the AOS50$01 service”. It keeps popping out for about a minute and then the AOS crashes. Earlier the crashes happened once every 9-11 days, right now they happen once every 2-4 days.

I believe that the crashes are caused by running AIF jobs and not by clients (crashes sometimes happen at night or early in the morning). The inbound AIF job runs every two minutes and the outbound AIF job runs every hour. The actions include posting purchase packing slips and purchase invoices, picking lists and collections letters. There are also some other batch jobs including direct delivery, master planning and purchase order posting. I’ve checked the code, there are some minor modifications and posting customizations, but they are high level and the code should not cause any problems, at least from what I think. The weird thing is that we’ve ran the AIF jobs to do some processing - sometimes it takes only a while and sometimes it takes ages (even though we’re processing the same documents). My colleague managed to provoke an AOS crash on the test instance while trying to read a huge block of xml files but when I tried to do the same - the jobs executed just fine (it took a while, but the AOS didn’t crash). Also, I’m sure that the RPC errors are caused by AIF jobs by matching execution times and event log appearances, but I couldn’t find code that would be responsible for them.

My main suspects are (from highest to lowest level of probability): Ax 2009 RU5 not being installed on the production instance, overlaping tasks ran by batch causing deadlocks or transaction errors, faulty x++ code, issues with differences in app versions & code on different instances… But these are only my ideas and I would like to hear Your opinion on the subject.

If somebody faced similar problems or has an idea what might be going on - please help. I’ve been investigating it for over two weeks and I’m running out of ideas. If You have any more questions or need details that I didn’t provide - I’ll try to answer.

Regards,

Lucas

We have had similar problems that seems to be caused by memory leaks. If you are in a 32-bit environment, the AOS will crash when getting close to 2GB af memory (memory fragmentation can cause crashes even earlier).

We have not been able to find the cause of these memory leaks, but things got somewhat better after installing kernel version 5.0.1500.2355, but I think that version is included in RU5 (not sure, though).

One thing that could at least make things a bit better is to upgrade to a 64-bit server that does not have the 2GB per process limitation. It does not solve the problem, but at least the system will run longer between the crashes (assuming you have a memory problem…)

/Jonas

Thanks for writing Jonas!

I once saw the issue You’ve mentioned on AX 4.0. Unfortunately, I doubt that memory leaks are the problem right now. In some cases the AOS restarted after just two hours of uptime and in some cases it would run perfertly fine for days, there isn’t a time pattern here which would be seen if the memory leak was the reason for crashes. I will take a closer look at the memory level though. Thank You for contributing.

Regards,

Lucas

Hi Lucas,

There is a white paper available on customer source / partner source on improving stability of AOS server. Though it is for Ax 4.0, there are some useful info which might be of use.

Another reason I can think of is - mismatch of kernels (for ex - AOS is in SP1 RU5 while one of the client is SP1 RU3) will also trigger AOS crash. But looking at your error, I suppose it might have been caused due to deadlocks. For more info on this issue, please refer to this article - http://blogs.msdn.com/b/daxis/archive/2009/01/16/troubleshooting-blocked-spids-in-aos.aspx

Hope this helps,

Hello Harish,

I got familiar with the whitepaper some time ago and the production environment meets all the suggestions. One thing I’m concerned about is the necessity to run AIF inbounds jobs every two minutes, including the normal working time. As You’ve suggested, that might cause deadlocks, but I am almost sure that the primary cause is not client activity because of the time the crashes took place, which included nighttime and early mornings. This leads me to conclusion that the jobs themselves might be suffering from deadlocks.

All the clients have the same kernel version as the AOS so the kernel mismatch is also crossed out.

Yesterday I followed Jonas’ suggestion and observed memory usage on the production instance. There was a weird moment when the memory usage of ax32serv.exe peaked from about 600 MB to about 1.2 GB, then decreased to about 730 MB and it has been on this level since yesterday. Since the memory was freed, I doubt that is a memory leak issue, though I doubt that such high memory usage is something normal. Does anybody know what could cause it? My first clue was a huge container or a number of blob-like objects, but I couldn’t find code that might have caused this.

Harish, I followed Your link and the info is very interesting but I don’t really know how to make something out of it in order to improve AOS stability. From what I understood (maybe I’m mistaken) troubleshooting here is done live when observing high numbers of open, unused connections. I’m not sure 100% but I believe that the AOS crashes are not related with client activity.

Well, I’ll keep investigating… I’ll try to provide info on the solution if I find one.

Regards,

Lucas

Hi,

Well, it turns out that it is a memory leak issue, so Jonas was right… Since yesterday I’ve noticed that the AOS process uses more and more memory freeing only a small part of it. I ran the Financial statement report several times and it consumed about 50 - 200 MB of RAM of the ax32srv.exe process which weren’t freed after the report was done. I also ran the report on the development and test instances (which have RU5 installed) and the memory leak didn’t occur so it seems that RU5 fixes the problem.

I still wonder how it was possible for my colleague to provoke an AOS crash on the test instance while running AIF inbound processes even though RU5 was installed there. I’ve noticed that the maximum buffer size in AX server configuration utility is set to 1024. Isn’t that too high? I suppose that the large value was set so that XML files would be processed correctly in the AIF jobs (I read that it’s a common problem if this parameter is too low), but still… From what I know the default value is 24 and should be increased gradually, I also read somewhere that if the maximum buffer size is too high, it might cause high memory consumption. Is that correct?

Regards,

Lucas

The financial statement report uses a lot memory and CPU, especially when you are using multiple dimensions.

We have discussed this with MS, and the solution we ended up with was to activate the old Ax 3.0 version of Financial Statement that actually still exists in Ax2009. For our customers that was ok, since they had recently been converted from Ax3 and could use their existing configuration that had been converted.

I do not remember that we had any problems with memory not being released, but we upgraded to the RU5 kernel rather early and we were not notified of any performance issues with the Financial Statement until afterwards.

During the testing of our crashes I also noticed one other thing. If you want exceed 10 mb in a variable, e.g. when reading large XML files, you must change a registry value to allow higher memory allocation. If this maxbuffersize registry value (note: this is NOT the same Maximum Buffer Size that you are referring to, see link below) then there is an exception AND this exception causes a memory leak with the same size as the message. This has not been fixed by MS, as far as I know.

http://blogs.msdn.com/b/emeadaxsupport/archive/2009/06/15/error-executing-code-insufficient-memory-to-run-script.aspx

/Jonas

The financial statement report uses a lot memory and CPU, especially when you are using multiple dimensions.

We have discussed this with MS, and the solution we ended up with was to activate the old Ax 3.0 version of Financial Statement that actually still exists in Ax2009. For our customers that was ok, since they had recently been converted from Ax3 and could use their existing configuration that had been converted.

I do not remember that we had any problems with memory not being released, but we upgraded to the RU5 kernel rather early and we were not notified of any performance issues with the Financial Statement until afterwards.

During the testing of our crashes I also noticed one other thing. If you want exceed 10 mb in a variable, e.g. when reading large XML files, you must change a registry value to allow higher memory allocation. If this maxbuffersize registry value (note: this is NOT the same Maximum Buffer Size that you are referring to, see link below) then there is an exception AND this exception causes a memory leak with the same size as the message. This has not been fixed by MS, as far as I know.

http://blogs.msdn.com/b/emeadaxsupport/archive/2009/06/15/error-executing-code-insufficient-memory-to-run-script.aspx

/Jonas

Thank You for the info Jonas. I think my client will not have any problems with the fact that the report runs so slowly, as long as it doesn’t eat up memory. I believe that RU5 will fix the problem.

As for the maximum buffer size, right before You wrote the post I’ve done some digging and read the article You just send the link to. Right now there aren’t any errors You’ve mentioned (“Insufficient memory to run script”) which is too bad because if they were, that would fit the pattern perfectly (AIF causing memory leaks). :wink: The maxbuffersize registry key is not set which would mean it’s still default, but the high Maximum Buffer Size parameter in the AX server configuration utility might be the one causing the problems. Probably someone else also got these parameters mixed up while configuring the system.

Thank You very much for Your help.

Best regards,

Lucas