Teknologia

Optimizing Entity Framework (3): It’s all SQL at the end of the day

So now that we have our integration tests ready to freely experiment and optimize our code, we can easily attach a profiler such as dotTrace to our tests and we’re ready to have informed decisions about our bottlenecks instead of educated guesses.

In the test’s fixture setup, you might have noticed a commented line EntityFrameworkProfiler.Initialize(), this tells EF profiler to start monitoring your EF usage, namely the DbContext instances being created, the SQL being generated and sent to the server and the data returned. It is important to comment that line out when doing doing execution profiling – with dotTrace or any other profiler – otherwise it will mess up your stats as it adds to the overhead of some operations showing them as bottlenecks in the trace when they are not.

Before we start, forget about DbContext

Depending on your architecture, you will probably find yourself creating many DbContext instances when initializing your Repository or UoW. These could be created on a Repository constructor, or in its simplest form, you will have some variation of this code repeated as many times as the database operations you have:

using(DbContext context = new SomeDbContext())
{
    // Do something with DbContext
}

At first glance, without measurement, this feels like an obvious bottleneck to your application, you might feel tempted to cache your DbContext or share it somehow … don’t! Unless you really really know what you’re doing, this will create more problems than it will solve.

Quoting an MSDN blog post:

Some people are, quite understandably, concerned about the cost of recreating the ObjectContext again and again. The reality is this cost is actually pretty low, because mostly it simply involves copying, by reference, metadata from a global cache into the new ObjectContext. Generally I don’t think this cost is worth worrying about, but as always, there will be exceptions to that rule.

DbContext is just a wrapper around ObjectContext so the same statement applies to it. For me, even after reading such statements, I was still suspicious about the cost of creating these contexts, but profiling and measuring my application, it did turn out that the cost is actually negligible both in absolute and relative terms.

EF generates SQL .. DuH!

Everything that you’re doing with EF will eventually be transformed into SQL. As naive as you might think this observation is, it’s easy to lose track of that fact thinking that you’re simply working on objects. Most of your optimization will fall in the category of optimizing this generated this SQL, and I found that EF can really surprise you when it comes to the SQL that it generates.

Here is a very contrived example of a very simple model compromising of Artists, Songs and ProductionCompanies. We want to write a query to return all songs that were produced by a company based in “Sudan”. I know that I have three songs produced by a Sudan-based company, here are three variations of how to get this information

[Test]
public void ShouldReturnThreeSongs_Query1()
{
    using (var context = new MusicContext())
    {
        var sudanBasedCompanies = context.Companies
            .Where(company => company.Country == "Sudan")
            .Select(company => company.Id).ToList();

        var result = context.Songs
            .Where(song => sudanBasedCompanies.Contains(song.ProductionCompany_Id));

        Debug.WriteLine(result.ToString());

        Assert.AreEqual(3, result.Count());
    }
}
        
[Test]
public void ShouldReturnThreeSongs_Query2()
{
    using (var context = new MusicContext())
    {
        var sudanBasedCompanies = context.Companies
             .Where(company => company.Country == "Sudan");

        var result = context.Songs
             .Where(song => sudanBasedCompanies.Contains(song.ProductionCompany));

        Debug.WriteLine(result.ToString());

        Assert.AreEqual(3, result.Count());
    }
}

[Test]
public void ShouldReturnThreeSongs_Query3()
{
    using (var context = new MusicContext())
    {
        var result = context.Songs
            .Where(song => song.ProductionCompany.Country == "Sudan");

        Debug.WriteLine(result.ToString());

        Assert.AreEqual(3, result.Count());
    }
}

The three methods are returning 3 rows as expected but let’s look at the SQL they generate.

Query1 uses the IN keyword:

SELECT 
[Extent1].[Id] AS [Id], 
[Extent1].[Name] AS [Name], 
[Extent1].[IsClassic] AS [IsClassic], 
[Extent1].[ProductionCompany_Id] AS [ProductionCompany_Id], 
[Extent1].[Artist_Id] AS [Artist_Id]
FROM [dbo].[Songs] AS [Extent1]
WHERE [Extent1].[ProductionCompany_Id] IN (1,2)

while Query2 used WHERE EXISTS:

SELECT 
[Extent1].[Id] AS [Id], 
[Extent1].[Name] AS [Name], 
[Extent1].[IsClassic] AS [IsClassic], 
[Extent1].[ProductionCompany_Id] AS [ProductionCompany_Id], 
[Extent1].[Artist_Id] AS [Artist_Id]
FROM [dbo].[Songs] AS [Extent1]
WHERE  EXISTS (SELECT 
	1 AS [C1]
	FROM [dbo].[ProductionCompanies] AS [Extent2]
	WHERE ([Extent1].[ProductionCompany_Id] = [Extent2].[Id]) AND (N'Sudan' = [Extent2].[Country])
)

And Query3 used INNER JOIN:

SELECT 
[Extent1].[Id] AS [Id], 
[Extent1].[Name] AS [Name], 
[Extent1].[IsClassic] AS [IsClassic], 
[Extent1].[ProductionCompany_Id] AS [ProductionCompany_Id], 
[Extent1].[Artist_Id] AS [Artist_Id]
FROM  [dbo].[Songs] AS [Extent1]
INNER JOIN [dbo].[ProductionCompanies] AS [Extent2] ON [Extent1].[ProductionCompany_Id] = [Extent2].[Id]
WHERE N'Sudan' = [Extent2].[Country]

This is just a simplified example to show how difficult it is to understand what EF is going to do unless you peak into the generated SQL. In my real-life case, I had a variation of Query1 except that the list of IDs I was matching against contained 15,000 IDs not just 2. This meant that a huge SQL statement was sent to the server everytime this seemingly-innocent query was called. Bad, embarrassing and gives your DBA more reasons to be mean (not that they need any!). In this real-life project, changing EF to use a JOIN created the sensible SQL statement I had in mind in the first place, and saved a huge amount of data being sent on the network.

This is just one example, but other changes like the order where you call .Single (after .Where as opposed to replacing .where), or calling .First() – or any other extension method – after materializing your list as opposed to adding it to your IQueryable, and many many other similar small changes will have a big impact on the SQL generated.

So the morale of the story is, turn on EF tracing (by using EF Profiler, adding IQueryable.ToString() in your Debug code or by using SQL Server Profiler) and monitor how EF interprets your code. You’ll be surprised.

Next: More optimizations


Optimizing Entity Framework (2): Writing integration tests

Performance optimization is a highly iterative exercise, you will need to go into many cycles of “running the application, measure, optimize and re-run” until you reach a conclusion or get bored. To avoid the latter (getting bored), write integration tests.

Where to test?

Write you tests to invoke the code from the highest level outside the UI.

In an ASP.NET MVC aplication, that will be the controllers. Writing integration tests from the UI-level – although can be automated – is unnecessarily expensive, slow and will make repeating your tests less likely. Choosing to do it from a low level, for example, by invoking your Repository methods (or your EF code) directly can be extremely deceiving. I fell into that trap when writing my tests as I wrote them against the repository forgetting that my Controllers were actually calling a Service that was calling multiple repository methods through its lifetime, so I was not close enough to the real-life situation that I was trying to tackle.

So with the help of NUnit and Moq as my favorite test and mocking frameworks, here starts the journey:

[TestFixture]
public class ContactsTest : TestsBase
{
    /// <summary>
    /// The first call is very expensive and it can deceive us when doing the measurements
    /// so let us warmup EF by making a call to the repository that generates the views
    /// and then our tests 
    /// </summary>
    [TestFixtureSetUp]
    public void WarmupCodeFirstModel()
    {
        // EntityFrameworkProfiler.Initialize();
        var contactsRepository = new ContactsRepository();
        contactsRepository.ListContacts();
    }

    [Test]
    public void ShouldReturnContactsInLessThan100Ms()
    {
        // 1. Arrange
        var controller = GetController<HomeController>();
        var stopWatch = Stopwatch.StartNew();

        // 2. Act: Invoke the Contact action in HomeController
        ViewResult result = controller.Contact();
        var model = (List<ContactModel>) result.Model;
        stopWatch.Stop();

        //Assert
        Assert.IsNotNull(model); //just a sanity check
        Assert.Less(stopWatch.ElapsedMilliseconds, 100);

        Debug.WriteLine("Integration test finished in {0} ms", stopWatch.ElapsedMilliseconds);
    }
}

The cost of the first call

The first-call to EF is the most expensive, it is caused partially by loading the metadata but mainly by the cost of generating the views. Quoting MSDN:

Before the Entity Framework can execute a query against a conceptual model or save changes to the data source, it must generate a set of local query views to access the database. Because of the high cost of generating these views, you can pre-generate the views and add them to the project at design-time …

Avoiding this one-time cost is not the objective of this post. But you can have a look at this msdn article for information on how to tackle it: How to Pre-Generate Views to Improve Query Performance.

So for our integration tests, in order to avoid this first-time cost ruining our measurements, we’ll make a first call in TestFixture setup in order to warm up our EF context, so that the upcoming calls don’t include this extra overhead that we don’t care about.

After that comes our test, the test calls the controller which will calls the Repository that will use EF to call the database. This means we exercise the whole stack of our application (except the UI). As a good practice and common pattern for MVC applications, controllers should have two constructors, one parameterless constructor that will be called by the real application and will create the the real dependencies (in this case an implementation of IContactRepository that uses EF), and another constructor that takes a parameter that will be used for unit-testing to pass a mock repository.

For our integration test, we want to test the real thing, therefore we call the parameterless constructor.

public class HomeController : Controller
    {
        private readonly IContactsRepository _repository;

        public HomeController()
        {
            _repository = new ContactsRepository();
        }

        public HomeController(IContactsRepository repository)
        {
            _repository = repository;
        }

        public ViewResult Contact()
        {
            var model = _repository.ListContacts();
            return View(model);
        }
    }

Finally, getting HttpContext out of the way

Our integration tests are inheriting from TestsBase which provide us with some handy methods for our tests. The main issue we’re tackling here that our controllers – in any real application – will definitely make use of HttpContext for one reason or another (to access Profile, cookies or any other information about the Http Request). We have a generic method that returns a Controller that we will act on for our tests, the controller will make use of a mock HttpContext so that we can focus on the task at hand and forget about HttpContext.

public class TestsBase
{
    /// <summary>
    /// Returns an Controller of type T that uses a mocked Controller Context
    /// that implements any dependencies on HttpContext that our controller needs
    /// A variation of this method is very handy and useful for both integration tests and unit tests
    /// </summary>
    /// <typeparam name="T">The type of the Controller to return</typeparam>
    /// <returns>Conroller of T</returns>
    protected T GetController<T>() where T : Controller
    {
        var controllerMock = new Mock<T> {CallBase = true};
        var controller = controllerMock.Object;
        controller.ControllerContext = new ControllerContext(CreateContextBase(),
                                                                new RouteData(), controller);
        return controller;
    }


    private HttpContextBase CreateContextBase()
    {
        // ... basically, whatever you need to mock in order to get around your controller's use of HttpContext
        // This uses Moq which is my favorite mocking, but any other framework will do

        var server = new Mock<HttpServerUtilityBase>();
        var response = new Mock<HttpResponseBase>();
        var request = new Mock<HttpRequestBase>();
        request.Setup(r => r.UserHostAddress).Returns("127.0.0.1");
        request.Setup(r => r.Url).Returns(new Uri("http://www.myurl.sd"));
        var session = new Mock<HttpSessionStateBase>();
        session.Setup(s => s.SessionID).Returns(Guid.NewGuid().ToString());
        //.... everything related to HttpContext is mocked

        var context = new Mock<HttpContextBase>();
        context.SetupGet(c => c.Request).Returns(request.Object);
        context.SetupGet(c => c.Response).Returns(response.Object);
        context.SetupGet(c => c.Server).Returns(server.Object);
        context.SetupGet(c => c.Session).Returns(session.Object);

        return context.Object;
    }
}

Summary

Don’t think of this exercise as a waste of time, imagine the alternative: every time, you have to open a browser, attach a debugger throw some StopWatches in the code and write the results in an excel sheet, compare, optimize and repeat. It’s a nightmare even with the smallest of applications. Now with these tests, you can run them in your test runner of choice, from console or even make them part of your builds. They are invaluable.


Optimizing Entity Framework: Part 1

I was recently involved in a project where I used Entity Framework 4.3 (Code-First) for the first time. I must say I’ve enjoyed it more than I expected, it seemed to fit nicely with Domain Driven Design (DDD) principles we were following, migrations were quiet convenient to keep our incremental updates under control (most of the time) but most importantly the models were very focused and clean and as such, it didn’t feel like we were getting out of our way or bending our architecture/design to fulfil any constraints imposed by EF.

In this project – as it should be in most cases – we didn’t optimize prematurely. But towards the end of the project, with real data being used, it was obvious that EF was a bottleneck to our application and becoming noticeably slower than previous versions of the application (that used a combination of ADO.NET with MSSQL Stored Procedures and Linq-to-Sql). And this is when I began this interesting exercise of optimizing the performance of the application that I am sharing here.

Unfortunately, I can’t share the specific code as I don’t own it. But the application is a public facing website that uses ASP.NET MVC for its architecure and DDD complements the picture as mentioned previously. So at the front-end side of things, there are typically Controllers, highly ajaxified Views (making use of knockout and some other cool stuff that I will hopefully discuss in other posts) and Models corresponding to the Code-first models. Controllers talk to Services that talk to Repositories through their interfaces which in turn talk to the database using EF as the ORM of choice.

Tools of the trade: Profiling

In my view, if you’re talking about performance and not putting numbers on the table, then it’s just a chatter with no substance. With performance, speak numbers or remain silent.

So in this specific case, the aim was to decrease the response time from around 800ms-950ms to the range of 100ms. And this is the main reason why numbers are so important, because in many scenarios, a response time of 1s might be acceptable, for our case, it just wasn’t good enough especially when compared to the previous version of the application. I am sure that if you are developing for the scale of Facebook or Twitter then even 100ms could easily become as unacceptable. So measure it and decide.

In order to get started with the measurement exercise, you probably have one of two choices:

  1. Scatter around Stopwatch instances around your code and keep track of them.
  2. or you can use a Profiler.

dotTrace - Easily drill into the code stack and define your code bottlenecksI used dotTrace from JetBrains as my profiler of choice, it is not free (nor cheap) but it provides 10 days trial which was more than enough for this exercise. Most of the profilers – code execution and performance profilers rather than memory profilers – out there are commercial and expensive (for personal use at least), the only free option seems to be EQATEC which I only had a quick look at but it seems like it could do the trick as well.

So running dotTrace, I could see exactly where the bottlenecks in my code where. They were indeed around Database access (don’t take that for granted) and Entity Framework.

Tracing EF generated Sql

EF can surprise you when it comes to generated SQL, and this is probably where most of the optimization will happen. So you need an easy way to check what is being generated by EF, at the end of the day, whatever you’re writing in C# and linq, no matter how fancy, testable, well written it is, it gets translated to the only language your database understands: SQL.

So here is the other amazing tool that you need to add to your toolset: Entity Framework Profiler. As with most of these tools, it is not free nor cheap, but you get a 30-days free trial. If you can’t afford it, the other option would be to use SQL Profiler to trace the SQL generated by EF (it is also not free but your organization could probably have it as part of their SQL Server edition anyhow). Worst case scenario, you could go back to basics and use .ToTraceString() to have a peak on the generated SQL.

Entity Framework Profiler: Easy insight into what EF is generating on your behalf

Now, you probably could survive without both of these tools, by peaking into your code through debugger, Stopwatch, trace messages or any other means. But since performance tuning is a highly iterative exercise where you need to repeat and compare many times before reaching a satisfactory conclusion, then having such tools is highly rewarding. It frees you from one extra point of pain and let you focus on the task at hand.


So I’ve decided to blog technical …

… something I’ve always wanted to do, but never felt confident enough to start. Now, I decided to just do it. First, I believe that everyone has some knowledge/experience worth sharing. Second and most importantly, as someone once put it: “to teach is to learn” so by sharing and teaching the things I know about, I will hopefully learn something.

So my blog is now divided into two sections:

1. Rai* where I will continue to express my random thoughts about random issues often about Sudan and being Sudanese. Topics in this category will continue to be rough, irrational and angry at times but so are the topics they cover.

2. Teknologia for my techy ramblings. My aim – for a start – is to write consistently regardless of the quality and am sure that the quality of my entries will improve if I keep the habit of blogging. A “relevant” post every two weeks is my initial target.

Language remains a problem and I am undecided whether to write in Arabic or English. For technical subjects, I prefer English as it is – whether we like it or not – the language of technology, and non-English speakers (Arabic or otherwise) have to make the effort to learn it, at least in the context of IT, otherwise they will be left behind. As for the “Rai” pages, they will continue to be a mix of both depending on the topic, audience and mood.

* Raï (English pronunciation: /ˈraɪ/; Arabic: راي‎) is a form of folk music that originated in Oran, Algeria from Bedouin shepherds, mixed with Spanish, French, African and Arabic musical forms, which dates back to the 1930s.
Singers of raï are called cheb (shabab, young) as opposed to sheikh (shaykh, old), the name given to Chaabi singers. The tradition arose in cities like Oran, Relizane and Mostaganem, primarily among the poor. Traditionally sung by men, at the turn of the 20th century, female singers became common. The lyrics of raï have concerned social issues, such as disease and the policing of European colonies, that affected native populations. (Excerpt from wikipedia)