Performance Experience with SATA for NAV on SQL DB (was: Navision DB)

Hi

IHAC who has deployed Navision for Retail stores. The deployment architecture is described below:

Intel Blade with 1 Xeon E5320 Quad Core 1.86 GHz

16 GB memory

4x 146GB SAS Drives

The App sits on the C Drive of the Blade

The DB sits on an external storage connected via FC HBA

The external storage has a 4Gbps backplane. Uses 6 x 500 GB SATA Disks @ 7200 RPM in RAID 1+0 Configuration

There are 2 RAID controllers both being used for the volume

The deployment is facing huge performance issue. We observe the i/o wait queue is very very high (100% consistently)

Any suggestions on this?

Rgds

Hi Mitesh,

first welcome to The Dynamics User Group, please if you get time go to the introductions section, and tell us more about you.

Anyway, I know that at first glance, it looks like a lot of information in your post, but to answer you we really need to know more. Firstly quad core makes no sense, since Navision uses only one thread, NAV also uses a max of 1 gig, so the extra ram is wasted. Next the SAN is not clear, since its configured wrong, AND you havent said if its dedicated to NAV or shared.

Before starting, we need to know how many concurrent users, how big is the database, how many trnasactions per day, how many store locations do you have, and more. Please give tis information so we cna help.

Hi

Many thanks for the response. Here is some data that you asked…if u need further clarifications, pls let me know

Q. I need to know how many concurrent users are there?

90 Logins on an Avg…though the database is serializabily accessable (means no two users can modify a particular table…this is done to maintain data integrity) it happens fast if resources are in place.

Q. How big is the database?
Right now its around 160 GB

Filegroup Name Logical File Name Physical File Name Space Reserved Space Used
Data Filegroup 1 HO_LIVE_1_Data F:\DATAFILE_1\HO_LIVE_1_Data.ndf 93.74 GB 25.51 GB
Data Filegroup 1 HO_LIVE_2_Data G:\DATAFILE_2\HO_LIVE_2_Data.ndf 93.74 GB 29.48 GB
Data Filegroup 1 HO_LIVE_3_Data F:\DATAFILE_1\HO_LIVE_3_Data.ndf 80.00 GB 16.26 GB
Data Filegroup 1 HO_LIVE_4_Data G:\DATAFILE_2\HO_LIVE_4_Data.ndf 80.00 GB 16.26 GB
Data Filegroup 1 HO_LIVE_5_Data F:\DATAFILE_3\HO_LIVE_5_Data.ndf 80.00 GB 16.26 GB
Data Filegroup 1 HO_LIVE_6_Data G:\DATAFILE_4\HO_LIVE_6_Data.ndf 80.00 GB 16.28 GB
PRIMARY HO_LIVE_Data D:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\HO_LIVE_Data.mdf 1000.00 MB 48.00 MB

Q. How many transactions per day?
More than 5000 transactions per day …considering 24 x 7

Q. How many store locations do you have supporting this?
Right now only 40 store and growing soon…

Q. Exact RAID Levels?
Raid Levels implemented …RAID 1+0 (on 6 disks) for DATAFILES and RAID 1 (on 2 disks) for Transaction Logs

Q. Are you sharing this SAN for any other application at the moment?

Not at the moment

Hi.

90 users and 160GB DB on that server can cause nothing but trouble.

Based on the information You gave You should go for something like this

  • Windows Ent. Ed 64-bit

  • SQL 2005 Ent. Ed. 64 bit

  • NAV 5.0 executables with platform update 6

  • 4GB RAM

  • 4x3GHz CPU (not 1 as You have today!)

  • 12pcs 15Krpm disks in RAID 0+1 for data

  • 6 pcs 15Krpm disks in RAID 0+1 for log

You’ll find the above recommendations and som more if you read the “Dynamics HW Sizing guide v3.pdf”

Minor mistake

It’s Nav 4.0 executables with platform update 6.

Also why RAID 0+1 instead of RAID 1+0?

I think this is SQL DB, not Navision DB. Correct?

yes sql. in one of his post he has the .ndf files listed.

Sorry I should have been more clear, I meant the message to Mitesh, to confirm before I moved this tot he SQL forums and changed the subject from “Navision DB” to “SQL DB”.

But yes its pretty obvious, so I will just move it.

Moved to SQL forum.

I agree.

Lars any reason you prefer 0+1 ?

It’s You own choice to run RAID 10/1+0 or 0+1. Microsoft suggests 0+1 in their sizing guide. RAID 10 is more fault tolerant but RAID 0+1 is faster.

The key difference from RAID 1+0 is that RAID 0+1 creates a second striped set to mirror a primary striped set. You could say it’s a RAID0 with a mirror of the complete RAID0 set. The array continues to operate with one or more drives failed in the same mirror set, but if two or more drives fail on different sides of the mirroring, the data on the RAID system is lost.

RAID 1+0 is mirrored sets in a striped. The key difference from RAID 0+1 is that RAID 1+0 creates a striped set from a series of mirrored drives. The array can sustain multiple drive losses as long as no two drives lost comprise a single pair of one mirror.

So if You wan’t performance but still some redundancy choose 0+1. But if You loose disks on both sides of the mirror You are lost. If You choose 1+0 You can loose more disks and still be up and running. RAID 10 is more redundant but You pay with a bit lower performance.

You can read more here http://www.acnc.com/04_00.html (nice graphics) and here http://en.wikipedia.org/wiki/RAID

Rgds

Lars

Thanks Lars, that makes very good sense. i have always erred on the side of caution, and thus suggested RAID 10, but of course the likely hood of two drives failing at the same time on separate arrays is so slim that it does make sense to go RAID 0.

0+1: striping mirrored sets

1+0: mirroring striped arrays

The way it was explained to me: 0+1 is faster as well as more fault tolerant. You can theoretically lose one disk in each mirrored set and still run, losing one disk only affects the disk itself, since there is still a mirrored disk available. With 1+0, if you lose one disk, the whole striped array that this disk belongs to becomes useless.

actually its the other way around.

If you have 10 drives, then in 0+1 you take a RAID zero of 5 drives , i.e a stripe, and then you take two of them and create a raid one from those two stripes. 1+0 you take 2 drives and RAID 1 them, and then combine the 5 RAID 1s as a stripe.

Really it all comes down to statistics, and its interesting to hear other peoples experience of what is better. In my opinion, the performance advantage of a RAID 0+1 is only in read, since statistically a read only requires a response from the fastest of the stripes, where as a in RAID1+0 you need to wait for the slowest response of each of the pairs. But in write, you still need to write to 10 drives, so theoretically there is no write difference. And Write is what generally kills us, so I go for RAID 10.

As to reliability, both RAID 10 and RAID zero can continue to operate with 5 failed drives out of 10, but in raid 0 this only applies if all the drives are in the same stripe, in RAID 0 it works only if it is one of each pair or RAID 1’s fails. Statistically the chance or a second drive fail killing RAID 01 is definitely higher than the chance of a RAID 10 fail. So I would always go for a RAID 10, but thinking more of what Lars says, the odds are so slim, that really its probably better to get that slightly better read performance. Do the math, and work out the odds, but either way the MTBF in both cases is more than our life times, so…

Now if we look at the log file, then our concern is Write, so logically there is no advantage to RAID 01, so the Log we should put on RAID 10.

At this point I need to go an find some of my old Statistics books (which I have no idea if I even have anymore) to try and work out which is best. Ah Mr. Pascal, all your time spent in Casinos was not wasted after all. Just think, without Pascal where we would all be.

I really think it comes down to practical experience and what works best, which is why its great that we can have discussions like this.

Now this seems difficult :slight_smile:

You can think of 1+0 as an ordinary RAID1 but you mirror the whole RAID1 set with a striped RAID0-set.

0+1 is then an ordinary RAID0 but you double it so you have two mirrored RAID0’s.

0+1 is faster, but since it is two RAID0-stripes mirrored you can’t lose disks on both sides of the mirror. You need at least one functioning stripe.

1+0 is slower than 0+1 but more redundant since it consists of mirrored disks put in a stripe and that whole stripe is mirrored. Since the same data is stored on three disks in 1+0 instead of two disks in 0+1 RAID1+0 is more fault tolerance.

Normally we think RAID1 is redundancy good enough. Since RAID 0+1 offers the same redundancy as RAID1 but better performance i would recommend RAID 0+1 (and so does Microsoft)

I hope that made sense. Nested RAID isn’t allways easy to understand.

Well technically RAID really only covers 1,2,3,4 and 5. All the rest are mde up anyway. So i think we can all be forgiven for not being able to makes sense out of the rest, since they really do live in fairy tale land. At least things like 1+0, 0+1 5+0 etc. sort of can make sense. Its when people talk nonsense like RAID 6 that I really lose it [:P]

Now if we really want to start confusing things, lets talk RAID 4 DP now Navision really flies on that, but there is no way I can explain why.

Yes. Navision (both Native and SQL) should fly on RAID4 since it’s nothing less then RAID0 with a dedicated disk for parity (maybe that makes it even faster than RAID0, i don’t know). But You don’t have any redundancy in RAID4. So I suggest we just forget about that RAID-level :slight_smile: //Lars

Here’s an update on RAID-levels:

In the HW Sizing guide for NAV 5.0 Microsoft has changed their recommendation from RAID0+1 to RAID10. Surely it’s because RAID10 offers higher redundancy than 0+1.

A difference to consider between the two RAID levels performance when the system has lost one or more drives but has not lost the right combination of drives to completely fail.

In a RAID 0+1 configuration, the loss of any drive in a set causes the failure of that entire set and the set is removed from the RAID system. Then You have a RAID 0 system made up of the remaining set of disks. In a RAID 1+0 system, you would see the same effect on each mirror that loses a drive, but not the whole system. That means that RAID10 will perform better when one disk is down

Maybe more important is the speed at which the RAID system recovers once the failed disk is replaced. RAID 1+0 only has to re-mirror one drive, whereas RAID 0+1 has to re-mirror the entire failed set. So RAID 1+0 will recover significantly faster.

So I must reconsider and change what I said earlier. RAID10 must be a better choice than 0+1.

The HW-guys also likes RAID10 better since it’s easier and faster to set up.

BTW: Here is an excellent picture of the different layouts of RAID10 and 0+1 http://oracle-unix.blogspot.com/2007/05/difference-between-raid-01-and-raid-10.html

/Lars

Hi,

This is regarding to the same setup as Mitesh has mentioned and put up in my place, as I maintain the operations out here. We have changed the scenario a bit. Firstly what i found out was Navision doesnt work with SATA disks which is true as I was the one along with Mitesh doing the test on the same.

The Blade has itself 4 x 146 GB SAS disks which I have configured as 2 sets of RAID 1. Here I have installed the OS, SQL binaries and Navision 4.0 with SP1. On the SAN we have changed the 8 SATA disks to 8 FC (Fibre Channel) disks. Now I have configured 4 sets of 146 GB FC Disks with RAID 1. The data as you see on Mitesh second post 6 ndf files lies on the 3 sets. The last set contains the Transaction Log and the Tempdb. Most of you would agree that I should have kept all data in one big volume of 6 FC disks of RAID 1+0 but believe me I tried this out too and concluded that the above config is the best way.

Going ahead is the changes I have made to the SQL database is I have changed the primary NONCLUSTERED KEY to CLUSTERED which gave a good performance. But as we go ahead with more stores opening, its degrading day by day. We have introduced Batch Posting too , a feature in LS Retail to avoid locking issues but this too is getting slower and slower day by day i.e., by the time 1 post is processed there are 3 posts dropped in Waiting. Our daily processess are also taking too long to complete.

One of my vendors suggested that keep the SQL SERVER 2005 seperate from NAVISION. And then have a farm of NAVISION apps servers clustered together and then connect to the databse that way there will be a great boost in terms of performance. He was telling me that NAVISION uses CIFS(Common Interface File System) to communicate with SQL SERVER which is why there is huge bottleneck on the server. Will this work?

Another one suggested me that since you have already clustered INdex on the tables why dont you disable the MaintainSIFTindex on the NAVISION side. This mighth help speed up performance. I am confused…Please help

Will incresing the memory from 16GB to 32GB help? Will adding another processor help? Do we go to Microsoft to justify the code written by its partner? This is very critical to us and any inputs would be of great help. Would be glad to provide any information pertaining to the same.

Best Regards,
Prashant

Hi Prashant!

Welcome to DynamicsUsers.net [<:o)]

Oh, oh … this is a big mistake! The Transaction Log of the NAV database MUST be stored exclusively on a dedicated physical drive - means nothng else is stored there! And especially not the tempdb! The TLog is used sequentially, thus when having it on a dedicated volumen - alone - the read/write head(s) of the disk(s) would virtually always be in the optimal position, the mechanical (slow!) work of the spindles is reduced to a minimum. As all transactions are depnding on the physical writes to the TLog, it is most important to have it as fast as possible!

tempdb is randomly accessed. Hence, if tempdb (or anything else) is stored together with the TLog, the spindles have to perform plenty of mechanical work, thus slowing down the system.

tempdb should also be placed on a dedicated physical drive, RAID 1 or 0 if possible.

Usually NAV is doing this itself, even though if the “Clustered” property is not set. In 4.00 before SP1 there was a bug, not creating a Clustered Index for the PK, but this should have been fixed ages ago! Every table needs a Clustered Index (the right one, of course)!

What does this mean? How should this work? NAV is not a “component” of SQL Server, the Clients - and NAS - are just using it.

This is crap. The Clustered Index has nothing to do if you need SIFT or not. This depends on several other issues, e.g. the number of records in the source table, the way of querying the sums, etc… There are options to replace SIFT by “Indexes with INCLUDED columns” on the source table (see my BLOG), but this is something one has to analyze thoroughly!

What you could try is this:

SIFT Optimization: Reduce the number of buckets in all “Ledger Entry” tables. Without a detailed analysis a rule of thumb could be: “Only the one-before-the-last-one remains”. Example: If you have a SIFT Index with 10 buckets (0 - 9) only number 8 should be enabled, all others could be disabled. If you encounter problems, then add bucket number 7, etc…

Dimensions: The major problem when posting documents is usually the Dimensions. You could speed up things by changing the Clustered Index of T357 “Document Dimension”:

Key: Table ID,Document Type,Document No.,Line No.,Dimension Code

Set property “SQL Index” to: Document No.,Document Type,Line No.,Table ID,Dimension Code

Set “Clustered” to TRUE

You could proceed similar with “Posted Doc. Dimension”, “Ledger Entry Dimension”, etc.

Yes - The more RAM the better.

Probably Yes - depends on the current “Processor Queue Length”. QuadCores are not really recommended for a SQL Server, better use 2 Dual Cores instead.

I doubt that MS will fix this …

Best regards,

Jörg