Optimizing numeric range (interval) searches in SQL Server












18















This question is similar to Optimizing IP Range Search? but that one is restricted to SQL Server 2000.



Suppose I have 10 million ranges provisionally stored in a table structured and populated as below.



CREATE TABLE MyTable
(
Id INT IDENTITY PRIMARY KEY,
RangeFrom INT NOT NULL,
RangeTo INT NOT NULL,
CHECK (RangeTo > RangeFrom),
INDEX IX1 (RangeFrom,RangeTo),
INDEX IX2 (RangeTo,RangeFrom)
);

WITH RandomNumbers
AS (SELECT TOP 10000000 ABS(CRYPT_GEN_RANDOM(4)%100000000) AS Num
FROM sys.all_objects o1,
sys.all_objects o2,
sys.all_objects o3,
sys.all_objects o4)
INSERT INTO MyTable
(RangeFrom,
RangeTo)
SELECT Num,
Num + 1 + CRYPT_GEN_RANDOM(1)
FROM RandomNumbers


I need to know all ranges containing the value 50,000,000. I try the following query



SELECT *
FROM MyTable
WHERE 50000000 BETWEEN RangeFrom AND RangeTo


SQL Server shows that there were 10,951 logical reads and nearly 5 million rows were read to return the 12 matching ones.



enter image description here



Can I improve on this performance? Any restructuring of the table or additional indexes is fine.










share|improve this question

























  • If I'm understanding the set up of the table correctly, you're picking random numbers uniformly to form your ranges, with no constraints on the "size" of each range. And your probe is for the middle of the overall range 1..100M. In that case - no apparent clustering due to uniform randomness - I don't know why an index on either lower bound or upper bound would be helpful. Can you explain that?

    – davidbak
    Dec 29 '18 at 1:28













  • @davidbak the conventional indexes on this table are indeed not very helpful in the worst case as it has to scan half the range hence asking for potential improvements on it. There is a nice improvement in the linked question for SQL Server 2000 with the introduction of the "granule" I was hoping spatial indexes could help here as they support contains queries and whilst they work well at reducing the amount of data read they seem to add other overhead that counteracts this.

    – Martin Smith
    Dec 29 '18 at 1:37













  • I don't have the facility to try it - but I wonder if two indexes - one on the lower bound, one on the upper - and then an inner join - would let the query optimizer work something out.

    – davidbak
    Dec 29 '18 at 1:56








  • 4





    Related: Select all overlapping ranges from starting range

    – Paul White
    Dec 29 '18 at 10:11
















18















This question is similar to Optimizing IP Range Search? but that one is restricted to SQL Server 2000.



Suppose I have 10 million ranges provisionally stored in a table structured and populated as below.



CREATE TABLE MyTable
(
Id INT IDENTITY PRIMARY KEY,
RangeFrom INT NOT NULL,
RangeTo INT NOT NULL,
CHECK (RangeTo > RangeFrom),
INDEX IX1 (RangeFrom,RangeTo),
INDEX IX2 (RangeTo,RangeFrom)
);

WITH RandomNumbers
AS (SELECT TOP 10000000 ABS(CRYPT_GEN_RANDOM(4)%100000000) AS Num
FROM sys.all_objects o1,
sys.all_objects o2,
sys.all_objects o3,
sys.all_objects o4)
INSERT INTO MyTable
(RangeFrom,
RangeTo)
SELECT Num,
Num + 1 + CRYPT_GEN_RANDOM(1)
FROM RandomNumbers


I need to know all ranges containing the value 50,000,000. I try the following query



SELECT *
FROM MyTable
WHERE 50000000 BETWEEN RangeFrom AND RangeTo


SQL Server shows that there were 10,951 logical reads and nearly 5 million rows were read to return the 12 matching ones.



enter image description here



Can I improve on this performance? Any restructuring of the table or additional indexes is fine.










share|improve this question

























  • If I'm understanding the set up of the table correctly, you're picking random numbers uniformly to form your ranges, with no constraints on the "size" of each range. And your probe is for the middle of the overall range 1..100M. In that case - no apparent clustering due to uniform randomness - I don't know why an index on either lower bound or upper bound would be helpful. Can you explain that?

    – davidbak
    Dec 29 '18 at 1:28













  • @davidbak the conventional indexes on this table are indeed not very helpful in the worst case as it has to scan half the range hence asking for potential improvements on it. There is a nice improvement in the linked question for SQL Server 2000 with the introduction of the "granule" I was hoping spatial indexes could help here as they support contains queries and whilst they work well at reducing the amount of data read they seem to add other overhead that counteracts this.

    – Martin Smith
    Dec 29 '18 at 1:37













  • I don't have the facility to try it - but I wonder if two indexes - one on the lower bound, one on the upper - and then an inner join - would let the query optimizer work something out.

    – davidbak
    Dec 29 '18 at 1:56








  • 4





    Related: Select all overlapping ranges from starting range

    – Paul White
    Dec 29 '18 at 10:11














18












18








18


7






This question is similar to Optimizing IP Range Search? but that one is restricted to SQL Server 2000.



Suppose I have 10 million ranges provisionally stored in a table structured and populated as below.



CREATE TABLE MyTable
(
Id INT IDENTITY PRIMARY KEY,
RangeFrom INT NOT NULL,
RangeTo INT NOT NULL,
CHECK (RangeTo > RangeFrom),
INDEX IX1 (RangeFrom,RangeTo),
INDEX IX2 (RangeTo,RangeFrom)
);

WITH RandomNumbers
AS (SELECT TOP 10000000 ABS(CRYPT_GEN_RANDOM(4)%100000000) AS Num
FROM sys.all_objects o1,
sys.all_objects o2,
sys.all_objects o3,
sys.all_objects o4)
INSERT INTO MyTable
(RangeFrom,
RangeTo)
SELECT Num,
Num + 1 + CRYPT_GEN_RANDOM(1)
FROM RandomNumbers


I need to know all ranges containing the value 50,000,000. I try the following query



SELECT *
FROM MyTable
WHERE 50000000 BETWEEN RangeFrom AND RangeTo


SQL Server shows that there were 10,951 logical reads and nearly 5 million rows were read to return the 12 matching ones.



enter image description here



Can I improve on this performance? Any restructuring of the table or additional indexes is fine.










share|improve this question
















This question is similar to Optimizing IP Range Search? but that one is restricted to SQL Server 2000.



Suppose I have 10 million ranges provisionally stored in a table structured and populated as below.



CREATE TABLE MyTable
(
Id INT IDENTITY PRIMARY KEY,
RangeFrom INT NOT NULL,
RangeTo INT NOT NULL,
CHECK (RangeTo > RangeFrom),
INDEX IX1 (RangeFrom,RangeTo),
INDEX IX2 (RangeTo,RangeFrom)
);

WITH RandomNumbers
AS (SELECT TOP 10000000 ABS(CRYPT_GEN_RANDOM(4)%100000000) AS Num
FROM sys.all_objects o1,
sys.all_objects o2,
sys.all_objects o3,
sys.all_objects o4)
INSERT INTO MyTable
(RangeFrom,
RangeTo)
SELECT Num,
Num + 1 + CRYPT_GEN_RANDOM(1)
FROM RandomNumbers


I need to know all ranges containing the value 50,000,000. I try the following query



SELECT *
FROM MyTable
WHERE 50000000 BETWEEN RangeFrom AND RangeTo


SQL Server shows that there were 10,951 logical reads and nearly 5 million rows were read to return the 12 matching ones.



enter image description here



Can I improve on this performance? Any restructuring of the table or additional indexes is fine.







sql-server optimization






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 31 '18 at 9:34







Martin Smith

















asked Dec 28 '18 at 23:27









Martin SmithMartin Smith

63.9k10172256




63.9k10172256













  • If I'm understanding the set up of the table correctly, you're picking random numbers uniformly to form your ranges, with no constraints on the "size" of each range. And your probe is for the middle of the overall range 1..100M. In that case - no apparent clustering due to uniform randomness - I don't know why an index on either lower bound or upper bound would be helpful. Can you explain that?

    – davidbak
    Dec 29 '18 at 1:28













  • @davidbak the conventional indexes on this table are indeed not very helpful in the worst case as it has to scan half the range hence asking for potential improvements on it. There is a nice improvement in the linked question for SQL Server 2000 with the introduction of the "granule" I was hoping spatial indexes could help here as they support contains queries and whilst they work well at reducing the amount of data read they seem to add other overhead that counteracts this.

    – Martin Smith
    Dec 29 '18 at 1:37













  • I don't have the facility to try it - but I wonder if two indexes - one on the lower bound, one on the upper - and then an inner join - would let the query optimizer work something out.

    – davidbak
    Dec 29 '18 at 1:56








  • 4





    Related: Select all overlapping ranges from starting range

    – Paul White
    Dec 29 '18 at 10:11



















  • If I'm understanding the set up of the table correctly, you're picking random numbers uniformly to form your ranges, with no constraints on the "size" of each range. And your probe is for the middle of the overall range 1..100M. In that case - no apparent clustering due to uniform randomness - I don't know why an index on either lower bound or upper bound would be helpful. Can you explain that?

    – davidbak
    Dec 29 '18 at 1:28













  • @davidbak the conventional indexes on this table are indeed not very helpful in the worst case as it has to scan half the range hence asking for potential improvements on it. There is a nice improvement in the linked question for SQL Server 2000 with the introduction of the "granule" I was hoping spatial indexes could help here as they support contains queries and whilst they work well at reducing the amount of data read they seem to add other overhead that counteracts this.

    – Martin Smith
    Dec 29 '18 at 1:37













  • I don't have the facility to try it - but I wonder if two indexes - one on the lower bound, one on the upper - and then an inner join - would let the query optimizer work something out.

    – davidbak
    Dec 29 '18 at 1:56








  • 4





    Related: Select all overlapping ranges from starting range

    – Paul White
    Dec 29 '18 at 10:11

















If I'm understanding the set up of the table correctly, you're picking random numbers uniformly to form your ranges, with no constraints on the "size" of each range. And your probe is for the middle of the overall range 1..100M. In that case - no apparent clustering due to uniform randomness - I don't know why an index on either lower bound or upper bound would be helpful. Can you explain that?

– davidbak
Dec 29 '18 at 1:28







If I'm understanding the set up of the table correctly, you're picking random numbers uniformly to form your ranges, with no constraints on the "size" of each range. And your probe is for the middle of the overall range 1..100M. In that case - no apparent clustering due to uniform randomness - I don't know why an index on either lower bound or upper bound would be helpful. Can you explain that?

– davidbak
Dec 29 '18 at 1:28















@davidbak the conventional indexes on this table are indeed not very helpful in the worst case as it has to scan half the range hence asking for potential improvements on it. There is a nice improvement in the linked question for SQL Server 2000 with the introduction of the "granule" I was hoping spatial indexes could help here as they support contains queries and whilst they work well at reducing the amount of data read they seem to add other overhead that counteracts this.

– Martin Smith
Dec 29 '18 at 1:37







@davidbak the conventional indexes on this table are indeed not very helpful in the worst case as it has to scan half the range hence asking for potential improvements on it. There is a nice improvement in the linked question for SQL Server 2000 with the introduction of the "granule" I was hoping spatial indexes could help here as they support contains queries and whilst they work well at reducing the amount of data read they seem to add other overhead that counteracts this.

– Martin Smith
Dec 29 '18 at 1:37















I don't have the facility to try it - but I wonder if two indexes - one on the lower bound, one on the upper - and then an inner join - would let the query optimizer work something out.

– davidbak
Dec 29 '18 at 1:56







I don't have the facility to try it - but I wonder if two indexes - one on the lower bound, one on the upper - and then an inner join - would let the query optimizer work something out.

– davidbak
Dec 29 '18 at 1:56






4




4





Related: Select all overlapping ranges from starting range

– Paul White
Dec 29 '18 at 10:11





Related: Select all overlapping ranges from starting range

– Paul White
Dec 29 '18 at 10:11










7 Answers
7






active

oldest

votes


















11














Columnstore is very heplful here compared to a nonclustered index which scans half the table. A nonclustered columnstore index provides most of the benefit but inserting ordered data into a clustered columnstore index is even better.



DROP TABLE IF EXISTS dbo.MyTableCCI;

CREATE TABLE dbo.MyTableCCI
(
Id INT PRIMARY KEY,
RangeFrom INT NOT NULL,
RangeTo INT NOT NULL,
CHECK (RangeTo > RangeFrom),
INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.MyTableCCI
SELECT TOP (987654321) *
FROM dbo.MyTable
ORDER BY RangeFrom ASC
OPTION (MAXDOP 1);


By design I can get rowgroup elimination on the RangeFrom column which will eliminate half of my rowgroups. But due to the nature of the data I also get rowgroup elimination on the RangeTo column as well:



Table 'MyTableCCI'. Segment reads 1, segment skipped 9.


For larger tables with more variable data there are different ways to load the data to guarantee the best possible rowgroup elimination on both columns. For your data in particular, the query takes 1 ms.






share|improve this answer


























  • yep definitely looking for other approaches to consider without the 2000 restriction. Doesnt sound like that will be beaten.

    – Martin Smith
    Dec 29 '18 at 1:32



















9














Paul White pointed out an answer to a similar question containing a link to an interesting article by Itzik Ben Gan. This describes the "Static Relational Interval Tree" model that allows this to be done efficiently.



In summary this approach involves storing a computed ("forknode") value based on the interval values in the row. When searching for ranges that intersect another range it is possible to precalculate the possible forknode values that matching rows must have and use this to find the results with a maximum of 31 seek operations (the below supports integers in the range 0 to the max signed 32 bit int)



Based on this I restructured the table as below.



CREATE TABLE dbo.MyTable3
(
Id INT IDENTITY PRIMARY KEY,
RangeFrom INT NOT NULL,
RangeTo INT NOT NULL,
node AS RangeTo - RangeTo % POWER(2, FLOOR(LOG((RangeFrom - 1) ^ RangeTo, 2))) PERSISTED NOT NULL,
CHECK (RangeTo > RangeFrom)
);

CREATE INDEX ix1 ON dbo.MyTable3 (node, RangeFrom) INCLUDE (RangeTo);
CREATE INDEX ix2 ON dbo.MyTable3 (node, RangeTo) INCLUDE (RangeFrom);

SET IDENTITY_INSERT MyTable3 ON

INSERT INTO MyTable3
(Id,
RangeFrom,
RangeTo)
SELECT Id,
RangeFrom,
RangeTo
FROM MyTable

SET IDENTITY_INSERT MyTable3 OFF


And then used the following query (the article is looking for intersecting intervals so finding an interval containing a point is a degenerate case of this)



DECLARE @value INT = 50000000;

;WITH N AS
(
SELECT 30 AS Level,
CASE WHEN @value > POWER(2,30) THEN POWER(2,30) END AS selected_left_node,
CASE WHEN @value < POWER(2,30) THEN POWER(2,30) END AS selected_right_node,
(SIGN(@value - POWER(2,30)) * POWER(2,29)) + POWER(2,30) AS node
UNION ALL
SELECT N.Level-1,
CASE WHEN @value > node THEN node END AS selected_left_node,
CASE WHEN @value < node THEN node END AS selected_right_node,
(SIGN(@value - node) * POWER(2,N.Level-2)) + node AS node
FROM N
WHERE N.Level > 0
)
SELECT I.id, I.RangeFrom, I.RangeTo
FROM dbo.MyTable3 AS I
JOIN N AS L
ON I.node = L.selected_left_node
AND I.RangeTo >= @value
AND L.selected_left_node IS NOT NULL
UNION ALL
SELECT I.id, I.RangeFrom, I.RangeTo
FROM dbo.MyTable3 AS I
JOIN N AS R
ON I.node = R.selected_right_node
AND I.RangeFrom <= @value
AND R.selected_right_node IS NOT NULL
UNION ALL
SELECT I.id, I.RangeFrom, I.RangeTo
FROM dbo.MyTable3 AS I
WHERE node = @value;


This typically executes in 1mson my machine when all pages are in cache - with IO stats.



Table 'MyTable3'. Scan count 24, logical reads 72, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 4, logical reads 374, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


and plan



enter image description here



NB: The source uses multistatement TVFs rather than a recursive CTE to get the nodes to join on but in the interests of making my answer self contained I have opted for the latter. For production use I'd probably use the TVFs.






share|improve this answer

































    9














    I was able to find a row mode approach that is competitive with the N/CCI approach, but you need to know something about your data. Suppose that you had a column that contained the difference of RangeFrom and RangeTo and you indexed it along with RangeFrom:



    ALTER TABLE dbo.MyTableWithDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

    CREATE INDEX IXDIFF ON dbo.MyTableWithDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


    If you knew all of the distinct values of DiffOfColumns then you could perform a seek for every value of DiffOfColumns with a range filter on RangeTo to get all of the relevant data. For example, if we know that DiffOfColumns = 2 then the only allowed values for RangeFrom are 49999998, 49999999 and 50000000. Recursion can be used to get all of the distinct values of DiffOfColumns and it works well for your data set because there are only 256 of them. The query below takes about 6 ms on my machine:



    WITH RecursiveCTE
    AS
    (
    -- Anchor
    SELECT TOP (1)
    DiffOfColumns
    FROM dbo.MyTableWithDiff AS T
    ORDER BY
    T.DiffOfColumns

    UNION ALL

    -- Recursive
    SELECT R.DiffOfColumns
    FROM
    (
    -- Number the rows
    SELECT
    T.DiffOfColumns,
    rn = ROW_NUMBER() OVER (
    ORDER BY T.DiffOfColumns)
    FROM dbo.MyTableWithDiff AS T
    JOIN RecursiveCTE AS R
    ON R.DiffOfColumns < T.DiffOfColumns
    ) AS R
    WHERE
    -- Only the row that sorts lowest
    R.rn = 1
    )
    SELECT ca.*
    FROM RecursiveCTE rcte
    CROSS APPLY (
    SELECT mt.Id, mt.RangeFrom, mt.RangeTo
    FROM dbo.MyTableWithDiff mt
    WHERE mt.DiffOfColumns = rcte.DiffOfColumns
    AND mt.RangeFrom >= 50000000 - rcte.DiffOfColumns AND mt.RangeFrom <= 50000000
    ) ca
    OPTION (MAXRECURSION 0);


    You can see the usual recursive part along with the index seek for every distinct value:



    query plan 1



    The flaw with this approach is that it starts to get slow when there are too many distinct values for DiffOfColumns. Let's do the same test, but use CRYPT_GEN_RANDOM(2)instead of CRYPT_GEN_RANDOM(1).



    DROP TABLE IF EXISTS dbo.MyTableBigDiff;

    CREATE TABLE dbo.MyTableBigDiff
    (
    Id INT IDENTITY PRIMARY KEY,
    RangeFrom INT NOT NULL,
    RangeTo INT NOT NULL,
    CHECK (RangeTo > RangeFrom)
    );

    WITH RandomNumbers
    AS (SELECT TOP 10000000 ABS(CRYPT_GEN_RANDOM(4)%100000000) AS Num
    FROM sys.all_objects o1,
    sys.all_objects o2,
    sys.all_objects o3,
    sys.all_objects o4)
    INSERT INTO dbo.MyTableBigDiff
    (RangeFrom,
    RangeTo)
    SELECT Num,
    Num + 1 + CRYPT_GEN_RANDOM(2) -- note the 2
    FROM RandomNumbers;


    ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

    CREATE INDEX IXDIFF ON dbo.MyTableBigDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


    The same query now finds 65536 rows from the recursive part and takes 823 ms of CPU on my machine. There are PAGELATCH_SH waits and other bad things going on. I can improve performance by bucketing the diff values to keep the number of unique values under control and adjusting for the bucketing in the CROSS APPLY. For this data set I'll try 256 buckets:



    ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns_bucket256 AS CAST(CEILING((RangeTo-RangeFrom) / 256.) AS INT);

    CREATE INDEX [IXDIFF😎] ON dbo.MyTableBigDiff (DiffOfColumns_bucket256, RangeFrom) INCLUDE (RangeTo);


    One way to avoid getting extra rows (now I'm comparing to a rounded value instead of the true value) is by filtering on RangeTo:



    CROSS APPLY (
    SELECT mt.Id, mt.RangeFrom, mt.RangeTo
    FROM dbo.MyTableBigDiff mt
    WHERE mt.DiffOfColumns_bucket256 = rcte.DiffOfColumns_bucket256
    AND mt.RangeFrom >= 50000000 - (256 * rcte.DiffOfColumns_bucket256)
    AND mt.RangeFrom <= 50000000
    AND mt.RangeTo >= 50000000
    ) ca


    The full query now takes 6 ms on my machine.






    share|improve this answer

































      8














      One alternative way of representing a range would be as points on a line.



      The below migrates all the data into a new table with the range represented as a geometry datatype.



      CREATE TABLE MyTable2
      (
      Id INT IDENTITY PRIMARY KEY,
      Range GEOMETRY NOT NULL,
      RangeFrom AS Range.STPointN(1).STX,
      RangeTo AS Range.STPointN(2).STX,
      CHECK (Range.STNumPoints() = 2 AND Range.STPointN(1).STY = 0 AND Range.STPointN(2).STY = 0)
      );

      SET IDENTITY_INSERT MyTable2 ON

      INSERT INTO MyTable2
      (Id,
      Range)
      SELECT ID,
      geometry::STLineFromText(CONCAT('LINESTRING(', RangeFrom, ' 0, ', RangeTo, ' 0)'), 0)
      FROM MyTable

      SET IDENTITY_INSERT MyTable2 OFF


      CREATE SPATIAL INDEX index_name
      ON MyTable2 ( Range )
      USING GEOMETRY_GRID
      WITH (
      BOUNDING_BOX = ( xmin=0, ymin=0, xmax=110000000, ymax=1 ),
      GRIDS = (HIGH, HIGH, HIGH, HIGH),
      CELLS_PER_OBJECT = 16);


      The equivalent query to find ranges containing the value 50,000,000 is below.



      SELECT Id,
      RangeFrom,
      RangeTo
      FROM MyTable2
      WHERE Range.STContains(geometry::STPointFromText ('POINT (50000000 0)', 0)) = 1


      The reads for this show an improvement on the 10,951 from the original query.



      Table 'MyTable2'. Scan count 0, logical reads 505, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
      Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
      Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
      Table 'extended_index_1797581442_384000'. Scan count 4, logical reads 17, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


      However there is no significant improvement over the original in terms of time elapsed. Typical execution results are 250 ms vs 252 ms.



      The execution plan is more complex as below



      enter image description here



      The only case where the rewrite reliably performs better for me is with a cold cache.



      So disappointing in this case and difficult to recommend this rewrite but publication of negative results can also be useful.






      share|improve this answer































        5














        As an homage to our new robot overlords, I decided to see if any of the new-ish R and Python functionality could help us out here. The answer is no, at least for the scripts that I could get working and returning correct results. If anyone with better knowledge comes along, well, feel free to spank me. My rates are reasonable.



        To do this, I set up a VM with 4 cores and 16 GB RAM, thinking this would be enough to deal with a ~200MB data set.



        Let's start with the language that doesn't exist in Boston!



        R



        EXEC sp_execute_external_script 
        @language = N'R',
        @script = N'
        tweener = 50000000
        MO = data.frame(MartinIn)
        MartinOut <- subset(MO, RangeFrom <= tweener & RangeTo >= tweener, select = c("Id","RangeFrom","RangeTo"))
        ',
        @input_data_1_name = N'MartinIn',
        @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
        @output_data_1_name = N'MartinOut',
        @parallel = 1
        WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


        This was a bad time.



        Table 'MyTable'. Scan count 1, logical reads 22400

        SQL Server Execution Times:
        CPU time = 3219 ms, elapsed time = 5349 ms.


        The execution plan is pretty uninteresting, though I don't know why the middle operator has to call us names.



        NUTS



        Next up, coding with crayons!



        Python



        EXEC sp_execute_external_script 
        @language = N'Python',
        @script = N'
        import pandas as pd
        MO = pd.DataFrame(MartinIn)
        tweener = 50000000
        MartinOut = MO[(MO.RangeFrom <= tweener) & (MO.RangeTo >= tweener)]
        ',
        @input_data_1_name = N'MartinIn',
        @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
        @output_data_1_name = N'MartinOut',
        @parallel = 1
        WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


        Just when you thought it couldn't get worse than R:



        Table 'MyTable'. Scan count 1, logical reads 22400

        SQL Server Execution Times:
        CPU time = 3797 ms, elapsed time = 10146 ms.


        Another foul-mouthed execution plan:



        NUTS



        Hmm and Hmmer



        So far, I'm not impressed. I can't wait to delete this VM.






        share|improve this answer



















        • 1





          You can pass in parameters too, eg DECLARE @input INT = 50000001; EXEC dbo.sp_execute_external_script @language = N'R', @script = N'OutputDataSet <- InputDataSet[which(x >= InputDataSet$RangeFrom & x <= InputDataSet$RangeTo) , ]', @parallel = 1, @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable;', @params = N'@x INT', @x = 50000001 WITH RESULT SETS ( ( Id INT NOT NULL, RangeFrom INT NOT NULL, RangeTo INT NOT NULL )); but yeah performance isn't great. I use R for stuff you can't do in SQL, say if you wanted to predict something.

          – wBob
          Dec 31 '18 at 3:47





















        4














        I found a pretty good solution using a computed column, however it's only good for a single value. That being said, if you've got a magic value, maybe it's enough.



        Starting with your given sample, then modifying the table:



        ALTER TABLE dbo.MyTable
        ADD curtis_jackson
        AS CONVERT(BIT, CASE
        WHEN RangeTo >= 50000000
        AND RangeFrom < 50000000
        THEN 1
        ELSE 0
        END);

        CREATE INDEX IX1_redo
        ON dbo.MyTable (curtis_jackson)
        INCLUDE (RangeFrom, RangeTo);


        The query simply becomes:



        SELECT *
        FROM MyTable
        WHERE curtis_jackson = 1;


        Which returns the same results as your starting query. With execution plans turned off, here are the stats (truncated for brevity):



        Table 'MyTable'. Scan count 1, logical reads 3...

        SQL Server Execution Times:
        CPU time = 0 ms, elapsed time = 0 ms.


        And here's the query plan:



        NUTS






        share|improve this answer
























        • Can you not overcome the imitation of computed column/filtered index with an index on WHERE (50000000 BETWEEN RangeFrom AND RangeTo) INCLUDE (..)?

          – ypercubeᵀᴹ
          Dec 30 '18 at 17:58






        • 3





          @yper-crazyhat-cubeᵀᴹ - yes. CREATE INDEX IX1_redo ON dbo.MyTable (curtis_jackson) INCLUDE (RangeFrom, RangeTo) WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000 would work. And the query SELECT * FROM MyTable WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000; uses it - so then not much need for poor Curtis

          – Martin Smith
          Dec 31 '18 at 9:19



















        2














        My solution is based on the observation that the interval has a known maximum width W. For the sample data this is one byte or 256 integers. Hence for a given search parameter value P we know the smallest RangeFrom that can be in the result set is P - W. Adding this to the predicate gives



        declare @P int = 50000000;
        declare @W int = 256;

        select
        *
        from MyTable
        where @P between RangeFrom and RangeTo
        and RangeFrom >= (@P - @W);


        Given the original set-up and query my machine (64 bit Windows 10, 4-core hyperthreaded i7, 2.8GHz, 16GB RAM) returns 13 rows. That query uses a parallel index seek of the (RangeFrom, RangeTo) index. The revised query also performs a parallel index seek on the same index.



        The measurements for the original and revised queries are



                                  Original  Revised
        -------- -------
        Stats IO Scan count 9 6
        Stats IO logical reads 11547 6

        Estimated number of rows 1643170 1216080
        Number of rows read 5109666 29
        QueryTimeStats CPU 344 2
        QueryTimeStats Elapsed 53 0


        For the original query the number of rows read is equal to the number of rows that are less than or equal to @P. The query optimizer (QO) has no alternative but read them all since it cannot determine in advance which if these rows will satisfy the predicate. The multi-column index on (RangeFrom, RangeTo) is not useful in eliminating rows that do not match RangeTo as there is no correlation between the first index key and the second that can be applied. For example, the first row may have a small interval and be eliminated whereas the second row has a large interval and is returned, or vice versa.



        In one failed attempt I tried to provide that certainty through a check constraint:



        alter table MyTable with check
        add constraint CK_MyTable_Interval
        check
        (
        RangeTo <= RangeFrom + 256
        );


        It made no difference.



        By incorporating my external knowledge of the data distribution into the predicate I can cause the QO to skip low-valued RangeFrom rows, which can never be part of the resultset, and traverse the leading column of the index to the admissible rows. This shows in the different seek predicate for each query.



        In a mirror argument the upper limit of RangeTo is P + W. This is not useful, however, because there is no correlation between RangeFrom and RangeTo that would allow the trailing column of a multi-column index to eliminate rows. Hence there is no benefit from adding this clause to the query.



        This approach gains most of its benefit from the small interval size. As the possible interval size increases the number of low-valued rows skipped decreases, though some will still be skipped. In the limiting case, with the interval as large as the data range, this approach is no worse than the orginal query (which is cold comfort I admit).



        I apologise for whatever off-by-one errors may exist in this answer.






        share|improve this answer

























          Your Answer








          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "182"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f225953%2foptimizing-numeric-range-interval-searches-in-sql-server%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          7 Answers
          7






          active

          oldest

          votes








          7 Answers
          7






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          11














          Columnstore is very heplful here compared to a nonclustered index which scans half the table. A nonclustered columnstore index provides most of the benefit but inserting ordered data into a clustered columnstore index is even better.



          DROP TABLE IF EXISTS dbo.MyTableCCI;

          CREATE TABLE dbo.MyTableCCI
          (
          Id INT PRIMARY KEY,
          RangeFrom INT NOT NULL,
          RangeTo INT NOT NULL,
          CHECK (RangeTo > RangeFrom),
          INDEX CCI CLUSTERED COLUMNSTORE
          );

          INSERT INTO dbo.MyTableCCI
          SELECT TOP (987654321) *
          FROM dbo.MyTable
          ORDER BY RangeFrom ASC
          OPTION (MAXDOP 1);


          By design I can get rowgroup elimination on the RangeFrom column which will eliminate half of my rowgroups. But due to the nature of the data I also get rowgroup elimination on the RangeTo column as well:



          Table 'MyTableCCI'. Segment reads 1, segment skipped 9.


          For larger tables with more variable data there are different ways to load the data to guarantee the best possible rowgroup elimination on both columns. For your data in particular, the query takes 1 ms.






          share|improve this answer


























          • yep definitely looking for other approaches to consider without the 2000 restriction. Doesnt sound like that will be beaten.

            – Martin Smith
            Dec 29 '18 at 1:32
















          11














          Columnstore is very heplful here compared to a nonclustered index which scans half the table. A nonclustered columnstore index provides most of the benefit but inserting ordered data into a clustered columnstore index is even better.



          DROP TABLE IF EXISTS dbo.MyTableCCI;

          CREATE TABLE dbo.MyTableCCI
          (
          Id INT PRIMARY KEY,
          RangeFrom INT NOT NULL,
          RangeTo INT NOT NULL,
          CHECK (RangeTo > RangeFrom),
          INDEX CCI CLUSTERED COLUMNSTORE
          );

          INSERT INTO dbo.MyTableCCI
          SELECT TOP (987654321) *
          FROM dbo.MyTable
          ORDER BY RangeFrom ASC
          OPTION (MAXDOP 1);


          By design I can get rowgroup elimination on the RangeFrom column which will eliminate half of my rowgroups. But due to the nature of the data I also get rowgroup elimination on the RangeTo column as well:



          Table 'MyTableCCI'. Segment reads 1, segment skipped 9.


          For larger tables with more variable data there are different ways to load the data to guarantee the best possible rowgroup elimination on both columns. For your data in particular, the query takes 1 ms.






          share|improve this answer


























          • yep definitely looking for other approaches to consider without the 2000 restriction. Doesnt sound like that will be beaten.

            – Martin Smith
            Dec 29 '18 at 1:32














          11












          11








          11







          Columnstore is very heplful here compared to a nonclustered index which scans half the table. A nonclustered columnstore index provides most of the benefit but inserting ordered data into a clustered columnstore index is even better.



          DROP TABLE IF EXISTS dbo.MyTableCCI;

          CREATE TABLE dbo.MyTableCCI
          (
          Id INT PRIMARY KEY,
          RangeFrom INT NOT NULL,
          RangeTo INT NOT NULL,
          CHECK (RangeTo > RangeFrom),
          INDEX CCI CLUSTERED COLUMNSTORE
          );

          INSERT INTO dbo.MyTableCCI
          SELECT TOP (987654321) *
          FROM dbo.MyTable
          ORDER BY RangeFrom ASC
          OPTION (MAXDOP 1);


          By design I can get rowgroup elimination on the RangeFrom column which will eliminate half of my rowgroups. But due to the nature of the data I also get rowgroup elimination on the RangeTo column as well:



          Table 'MyTableCCI'. Segment reads 1, segment skipped 9.


          For larger tables with more variable data there are different ways to load the data to guarantee the best possible rowgroup elimination on both columns. For your data in particular, the query takes 1 ms.






          share|improve this answer















          Columnstore is very heplful here compared to a nonclustered index which scans half the table. A nonclustered columnstore index provides most of the benefit but inserting ordered data into a clustered columnstore index is even better.



          DROP TABLE IF EXISTS dbo.MyTableCCI;

          CREATE TABLE dbo.MyTableCCI
          (
          Id INT PRIMARY KEY,
          RangeFrom INT NOT NULL,
          RangeTo INT NOT NULL,
          CHECK (RangeTo > RangeFrom),
          INDEX CCI CLUSTERED COLUMNSTORE
          );

          INSERT INTO dbo.MyTableCCI
          SELECT TOP (987654321) *
          FROM dbo.MyTable
          ORDER BY RangeFrom ASC
          OPTION (MAXDOP 1);


          By design I can get rowgroup elimination on the RangeFrom column which will eliminate half of my rowgroups. But due to the nature of the data I also get rowgroup elimination on the RangeTo column as well:



          Table 'MyTableCCI'. Segment reads 1, segment skipped 9.


          For larger tables with more variable data there are different ways to load the data to guarantee the best possible rowgroup elimination on both columns. For your data in particular, the query takes 1 ms.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Dec 29 '18 at 3:06

























          answered Dec 29 '18 at 1:29









          Joe ObbishJoe Obbish

          21.3k43187




          21.3k43187













          • yep definitely looking for other approaches to consider without the 2000 restriction. Doesnt sound like that will be beaten.

            – Martin Smith
            Dec 29 '18 at 1:32



















          • yep definitely looking for other approaches to consider without the 2000 restriction. Doesnt sound like that will be beaten.

            – Martin Smith
            Dec 29 '18 at 1:32

















          yep definitely looking for other approaches to consider without the 2000 restriction. Doesnt sound like that will be beaten.

          – Martin Smith
          Dec 29 '18 at 1:32





          yep definitely looking for other approaches to consider without the 2000 restriction. Doesnt sound like that will be beaten.

          – Martin Smith
          Dec 29 '18 at 1:32













          9














          Paul White pointed out an answer to a similar question containing a link to an interesting article by Itzik Ben Gan. This describes the "Static Relational Interval Tree" model that allows this to be done efficiently.



          In summary this approach involves storing a computed ("forknode") value based on the interval values in the row. When searching for ranges that intersect another range it is possible to precalculate the possible forknode values that matching rows must have and use this to find the results with a maximum of 31 seek operations (the below supports integers in the range 0 to the max signed 32 bit int)



          Based on this I restructured the table as below.



          CREATE TABLE dbo.MyTable3
          (
          Id INT IDENTITY PRIMARY KEY,
          RangeFrom INT NOT NULL,
          RangeTo INT NOT NULL,
          node AS RangeTo - RangeTo % POWER(2, FLOOR(LOG((RangeFrom - 1) ^ RangeTo, 2))) PERSISTED NOT NULL,
          CHECK (RangeTo > RangeFrom)
          );

          CREATE INDEX ix1 ON dbo.MyTable3 (node, RangeFrom) INCLUDE (RangeTo);
          CREATE INDEX ix2 ON dbo.MyTable3 (node, RangeTo) INCLUDE (RangeFrom);

          SET IDENTITY_INSERT MyTable3 ON

          INSERT INTO MyTable3
          (Id,
          RangeFrom,
          RangeTo)
          SELECT Id,
          RangeFrom,
          RangeTo
          FROM MyTable

          SET IDENTITY_INSERT MyTable3 OFF


          And then used the following query (the article is looking for intersecting intervals so finding an interval containing a point is a degenerate case of this)



          DECLARE @value INT = 50000000;

          ;WITH N AS
          (
          SELECT 30 AS Level,
          CASE WHEN @value > POWER(2,30) THEN POWER(2,30) END AS selected_left_node,
          CASE WHEN @value < POWER(2,30) THEN POWER(2,30) END AS selected_right_node,
          (SIGN(@value - POWER(2,30)) * POWER(2,29)) + POWER(2,30) AS node
          UNION ALL
          SELECT N.Level-1,
          CASE WHEN @value > node THEN node END AS selected_left_node,
          CASE WHEN @value < node THEN node END AS selected_right_node,
          (SIGN(@value - node) * POWER(2,N.Level-2)) + node AS node
          FROM N
          WHERE N.Level > 0
          )
          SELECT I.id, I.RangeFrom, I.RangeTo
          FROM dbo.MyTable3 AS I
          JOIN N AS L
          ON I.node = L.selected_left_node
          AND I.RangeTo >= @value
          AND L.selected_left_node IS NOT NULL
          UNION ALL
          SELECT I.id, I.RangeFrom, I.RangeTo
          FROM dbo.MyTable3 AS I
          JOIN N AS R
          ON I.node = R.selected_right_node
          AND I.RangeFrom <= @value
          AND R.selected_right_node IS NOT NULL
          UNION ALL
          SELECT I.id, I.RangeFrom, I.RangeTo
          FROM dbo.MyTable3 AS I
          WHERE node = @value;


          This typically executes in 1mson my machine when all pages are in cache - with IO stats.



          Table 'MyTable3'. Scan count 24, logical reads 72, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
          Table 'Worktable'. Scan count 4, logical reads 374, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


          and plan



          enter image description here



          NB: The source uses multistatement TVFs rather than a recursive CTE to get the nodes to join on but in the interests of making my answer self contained I have opted for the latter. For production use I'd probably use the TVFs.






          share|improve this answer






























            9














            Paul White pointed out an answer to a similar question containing a link to an interesting article by Itzik Ben Gan. This describes the "Static Relational Interval Tree" model that allows this to be done efficiently.



            In summary this approach involves storing a computed ("forknode") value based on the interval values in the row. When searching for ranges that intersect another range it is possible to precalculate the possible forknode values that matching rows must have and use this to find the results with a maximum of 31 seek operations (the below supports integers in the range 0 to the max signed 32 bit int)



            Based on this I restructured the table as below.



            CREATE TABLE dbo.MyTable3
            (
            Id INT IDENTITY PRIMARY KEY,
            RangeFrom INT NOT NULL,
            RangeTo INT NOT NULL,
            node AS RangeTo - RangeTo % POWER(2, FLOOR(LOG((RangeFrom - 1) ^ RangeTo, 2))) PERSISTED NOT NULL,
            CHECK (RangeTo > RangeFrom)
            );

            CREATE INDEX ix1 ON dbo.MyTable3 (node, RangeFrom) INCLUDE (RangeTo);
            CREATE INDEX ix2 ON dbo.MyTable3 (node, RangeTo) INCLUDE (RangeFrom);

            SET IDENTITY_INSERT MyTable3 ON

            INSERT INTO MyTable3
            (Id,
            RangeFrom,
            RangeTo)
            SELECT Id,
            RangeFrom,
            RangeTo
            FROM MyTable

            SET IDENTITY_INSERT MyTable3 OFF


            And then used the following query (the article is looking for intersecting intervals so finding an interval containing a point is a degenerate case of this)



            DECLARE @value INT = 50000000;

            ;WITH N AS
            (
            SELECT 30 AS Level,
            CASE WHEN @value > POWER(2,30) THEN POWER(2,30) END AS selected_left_node,
            CASE WHEN @value < POWER(2,30) THEN POWER(2,30) END AS selected_right_node,
            (SIGN(@value - POWER(2,30)) * POWER(2,29)) + POWER(2,30) AS node
            UNION ALL
            SELECT N.Level-1,
            CASE WHEN @value > node THEN node END AS selected_left_node,
            CASE WHEN @value < node THEN node END AS selected_right_node,
            (SIGN(@value - node) * POWER(2,N.Level-2)) + node AS node
            FROM N
            WHERE N.Level > 0
            )
            SELECT I.id, I.RangeFrom, I.RangeTo
            FROM dbo.MyTable3 AS I
            JOIN N AS L
            ON I.node = L.selected_left_node
            AND I.RangeTo >= @value
            AND L.selected_left_node IS NOT NULL
            UNION ALL
            SELECT I.id, I.RangeFrom, I.RangeTo
            FROM dbo.MyTable3 AS I
            JOIN N AS R
            ON I.node = R.selected_right_node
            AND I.RangeFrom <= @value
            AND R.selected_right_node IS NOT NULL
            UNION ALL
            SELECT I.id, I.RangeFrom, I.RangeTo
            FROM dbo.MyTable3 AS I
            WHERE node = @value;


            This typically executes in 1mson my machine when all pages are in cache - with IO stats.



            Table 'MyTable3'. Scan count 24, logical reads 72, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
            Table 'Worktable'. Scan count 4, logical reads 374, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


            and plan



            enter image description here



            NB: The source uses multistatement TVFs rather than a recursive CTE to get the nodes to join on but in the interests of making my answer self contained I have opted for the latter. For production use I'd probably use the TVFs.






            share|improve this answer




























              9












              9








              9







              Paul White pointed out an answer to a similar question containing a link to an interesting article by Itzik Ben Gan. This describes the "Static Relational Interval Tree" model that allows this to be done efficiently.



              In summary this approach involves storing a computed ("forknode") value based on the interval values in the row. When searching for ranges that intersect another range it is possible to precalculate the possible forknode values that matching rows must have and use this to find the results with a maximum of 31 seek operations (the below supports integers in the range 0 to the max signed 32 bit int)



              Based on this I restructured the table as below.



              CREATE TABLE dbo.MyTable3
              (
              Id INT IDENTITY PRIMARY KEY,
              RangeFrom INT NOT NULL,
              RangeTo INT NOT NULL,
              node AS RangeTo - RangeTo % POWER(2, FLOOR(LOG((RangeFrom - 1) ^ RangeTo, 2))) PERSISTED NOT NULL,
              CHECK (RangeTo > RangeFrom)
              );

              CREATE INDEX ix1 ON dbo.MyTable3 (node, RangeFrom) INCLUDE (RangeTo);
              CREATE INDEX ix2 ON dbo.MyTable3 (node, RangeTo) INCLUDE (RangeFrom);

              SET IDENTITY_INSERT MyTable3 ON

              INSERT INTO MyTable3
              (Id,
              RangeFrom,
              RangeTo)
              SELECT Id,
              RangeFrom,
              RangeTo
              FROM MyTable

              SET IDENTITY_INSERT MyTable3 OFF


              And then used the following query (the article is looking for intersecting intervals so finding an interval containing a point is a degenerate case of this)



              DECLARE @value INT = 50000000;

              ;WITH N AS
              (
              SELECT 30 AS Level,
              CASE WHEN @value > POWER(2,30) THEN POWER(2,30) END AS selected_left_node,
              CASE WHEN @value < POWER(2,30) THEN POWER(2,30) END AS selected_right_node,
              (SIGN(@value - POWER(2,30)) * POWER(2,29)) + POWER(2,30) AS node
              UNION ALL
              SELECT N.Level-1,
              CASE WHEN @value > node THEN node END AS selected_left_node,
              CASE WHEN @value < node THEN node END AS selected_right_node,
              (SIGN(@value - node) * POWER(2,N.Level-2)) + node AS node
              FROM N
              WHERE N.Level > 0
              )
              SELECT I.id, I.RangeFrom, I.RangeTo
              FROM dbo.MyTable3 AS I
              JOIN N AS L
              ON I.node = L.selected_left_node
              AND I.RangeTo >= @value
              AND L.selected_left_node IS NOT NULL
              UNION ALL
              SELECT I.id, I.RangeFrom, I.RangeTo
              FROM dbo.MyTable3 AS I
              JOIN N AS R
              ON I.node = R.selected_right_node
              AND I.RangeFrom <= @value
              AND R.selected_right_node IS NOT NULL
              UNION ALL
              SELECT I.id, I.RangeFrom, I.RangeTo
              FROM dbo.MyTable3 AS I
              WHERE node = @value;


              This typically executes in 1mson my machine when all pages are in cache - with IO stats.



              Table 'MyTable3'. Scan count 24, logical reads 72, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
              Table 'Worktable'. Scan count 4, logical reads 374, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


              and plan



              enter image description here



              NB: The source uses multistatement TVFs rather than a recursive CTE to get the nodes to join on but in the interests of making my answer self contained I have opted for the latter. For production use I'd probably use the TVFs.






              share|improve this answer















              Paul White pointed out an answer to a similar question containing a link to an interesting article by Itzik Ben Gan. This describes the "Static Relational Interval Tree" model that allows this to be done efficiently.



              In summary this approach involves storing a computed ("forknode") value based on the interval values in the row. When searching for ranges that intersect another range it is possible to precalculate the possible forknode values that matching rows must have and use this to find the results with a maximum of 31 seek operations (the below supports integers in the range 0 to the max signed 32 bit int)



              Based on this I restructured the table as below.



              CREATE TABLE dbo.MyTable3
              (
              Id INT IDENTITY PRIMARY KEY,
              RangeFrom INT NOT NULL,
              RangeTo INT NOT NULL,
              node AS RangeTo - RangeTo % POWER(2, FLOOR(LOG((RangeFrom - 1) ^ RangeTo, 2))) PERSISTED NOT NULL,
              CHECK (RangeTo > RangeFrom)
              );

              CREATE INDEX ix1 ON dbo.MyTable3 (node, RangeFrom) INCLUDE (RangeTo);
              CREATE INDEX ix2 ON dbo.MyTable3 (node, RangeTo) INCLUDE (RangeFrom);

              SET IDENTITY_INSERT MyTable3 ON

              INSERT INTO MyTable3
              (Id,
              RangeFrom,
              RangeTo)
              SELECT Id,
              RangeFrom,
              RangeTo
              FROM MyTable

              SET IDENTITY_INSERT MyTable3 OFF


              And then used the following query (the article is looking for intersecting intervals so finding an interval containing a point is a degenerate case of this)



              DECLARE @value INT = 50000000;

              ;WITH N AS
              (
              SELECT 30 AS Level,
              CASE WHEN @value > POWER(2,30) THEN POWER(2,30) END AS selected_left_node,
              CASE WHEN @value < POWER(2,30) THEN POWER(2,30) END AS selected_right_node,
              (SIGN(@value - POWER(2,30)) * POWER(2,29)) + POWER(2,30) AS node
              UNION ALL
              SELECT N.Level-1,
              CASE WHEN @value > node THEN node END AS selected_left_node,
              CASE WHEN @value < node THEN node END AS selected_right_node,
              (SIGN(@value - node) * POWER(2,N.Level-2)) + node AS node
              FROM N
              WHERE N.Level > 0
              )
              SELECT I.id, I.RangeFrom, I.RangeTo
              FROM dbo.MyTable3 AS I
              JOIN N AS L
              ON I.node = L.selected_left_node
              AND I.RangeTo >= @value
              AND L.selected_left_node IS NOT NULL
              UNION ALL
              SELECT I.id, I.RangeFrom, I.RangeTo
              FROM dbo.MyTable3 AS I
              JOIN N AS R
              ON I.node = R.selected_right_node
              AND I.RangeFrom <= @value
              AND R.selected_right_node IS NOT NULL
              UNION ALL
              SELECT I.id, I.RangeFrom, I.RangeTo
              FROM dbo.MyTable3 AS I
              WHERE node = @value;


              This typically executes in 1mson my machine when all pages are in cache - with IO stats.



              Table 'MyTable3'. Scan count 24, logical reads 72, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
              Table 'Worktable'. Scan count 4, logical reads 374, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


              and plan



              enter image description here



              NB: The source uses multistatement TVFs rather than a recursive CTE to get the nodes to join on but in the interests of making my answer self contained I have opted for the latter. For production use I'd probably use the TVFs.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Dec 30 '18 at 23:15

























              answered Dec 29 '18 at 15:29









              Martin SmithMartin Smith

              63.9k10172256




              63.9k10172256























                  9














                  I was able to find a row mode approach that is competitive with the N/CCI approach, but you need to know something about your data. Suppose that you had a column that contained the difference of RangeFrom and RangeTo and you indexed it along with RangeFrom:



                  ALTER TABLE dbo.MyTableWithDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

                  CREATE INDEX IXDIFF ON dbo.MyTableWithDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


                  If you knew all of the distinct values of DiffOfColumns then you could perform a seek for every value of DiffOfColumns with a range filter on RangeTo to get all of the relevant data. For example, if we know that DiffOfColumns = 2 then the only allowed values for RangeFrom are 49999998, 49999999 and 50000000. Recursion can be used to get all of the distinct values of DiffOfColumns and it works well for your data set because there are only 256 of them. The query below takes about 6 ms on my machine:



                  WITH RecursiveCTE
                  AS
                  (
                  -- Anchor
                  SELECT TOP (1)
                  DiffOfColumns
                  FROM dbo.MyTableWithDiff AS T
                  ORDER BY
                  T.DiffOfColumns

                  UNION ALL

                  -- Recursive
                  SELECT R.DiffOfColumns
                  FROM
                  (
                  -- Number the rows
                  SELECT
                  T.DiffOfColumns,
                  rn = ROW_NUMBER() OVER (
                  ORDER BY T.DiffOfColumns)
                  FROM dbo.MyTableWithDiff AS T
                  JOIN RecursiveCTE AS R
                  ON R.DiffOfColumns < T.DiffOfColumns
                  ) AS R
                  WHERE
                  -- Only the row that sorts lowest
                  R.rn = 1
                  )
                  SELECT ca.*
                  FROM RecursiveCTE rcte
                  CROSS APPLY (
                  SELECT mt.Id, mt.RangeFrom, mt.RangeTo
                  FROM dbo.MyTableWithDiff mt
                  WHERE mt.DiffOfColumns = rcte.DiffOfColumns
                  AND mt.RangeFrom >= 50000000 - rcte.DiffOfColumns AND mt.RangeFrom <= 50000000
                  ) ca
                  OPTION (MAXRECURSION 0);


                  You can see the usual recursive part along with the index seek for every distinct value:



                  query plan 1



                  The flaw with this approach is that it starts to get slow when there are too many distinct values for DiffOfColumns. Let's do the same test, but use CRYPT_GEN_RANDOM(2)instead of CRYPT_GEN_RANDOM(1).



                  DROP TABLE IF EXISTS dbo.MyTableBigDiff;

                  CREATE TABLE dbo.MyTableBigDiff
                  (
                  Id INT IDENTITY PRIMARY KEY,
                  RangeFrom INT NOT NULL,
                  RangeTo INT NOT NULL,
                  CHECK (RangeTo > RangeFrom)
                  );

                  WITH RandomNumbers
                  AS (SELECT TOP 10000000 ABS(CRYPT_GEN_RANDOM(4)%100000000) AS Num
                  FROM sys.all_objects o1,
                  sys.all_objects o2,
                  sys.all_objects o3,
                  sys.all_objects o4)
                  INSERT INTO dbo.MyTableBigDiff
                  (RangeFrom,
                  RangeTo)
                  SELECT Num,
                  Num + 1 + CRYPT_GEN_RANDOM(2) -- note the 2
                  FROM RandomNumbers;


                  ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

                  CREATE INDEX IXDIFF ON dbo.MyTableBigDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


                  The same query now finds 65536 rows from the recursive part and takes 823 ms of CPU on my machine. There are PAGELATCH_SH waits and other bad things going on. I can improve performance by bucketing the diff values to keep the number of unique values under control and adjusting for the bucketing in the CROSS APPLY. For this data set I'll try 256 buckets:



                  ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns_bucket256 AS CAST(CEILING((RangeTo-RangeFrom) / 256.) AS INT);

                  CREATE INDEX [IXDIFF😎] ON dbo.MyTableBigDiff (DiffOfColumns_bucket256, RangeFrom) INCLUDE (RangeTo);


                  One way to avoid getting extra rows (now I'm comparing to a rounded value instead of the true value) is by filtering on RangeTo:



                  CROSS APPLY (
                  SELECT mt.Id, mt.RangeFrom, mt.RangeTo
                  FROM dbo.MyTableBigDiff mt
                  WHERE mt.DiffOfColumns_bucket256 = rcte.DiffOfColumns_bucket256
                  AND mt.RangeFrom >= 50000000 - (256 * rcte.DiffOfColumns_bucket256)
                  AND mt.RangeFrom <= 50000000
                  AND mt.RangeTo >= 50000000
                  ) ca


                  The full query now takes 6 ms on my machine.






                  share|improve this answer






























                    9














                    I was able to find a row mode approach that is competitive with the N/CCI approach, but you need to know something about your data. Suppose that you had a column that contained the difference of RangeFrom and RangeTo and you indexed it along with RangeFrom:



                    ALTER TABLE dbo.MyTableWithDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

                    CREATE INDEX IXDIFF ON dbo.MyTableWithDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


                    If you knew all of the distinct values of DiffOfColumns then you could perform a seek for every value of DiffOfColumns with a range filter on RangeTo to get all of the relevant data. For example, if we know that DiffOfColumns = 2 then the only allowed values for RangeFrom are 49999998, 49999999 and 50000000. Recursion can be used to get all of the distinct values of DiffOfColumns and it works well for your data set because there are only 256 of them. The query below takes about 6 ms on my machine:



                    WITH RecursiveCTE
                    AS
                    (
                    -- Anchor
                    SELECT TOP (1)
                    DiffOfColumns
                    FROM dbo.MyTableWithDiff AS T
                    ORDER BY
                    T.DiffOfColumns

                    UNION ALL

                    -- Recursive
                    SELECT R.DiffOfColumns
                    FROM
                    (
                    -- Number the rows
                    SELECT
                    T.DiffOfColumns,
                    rn = ROW_NUMBER() OVER (
                    ORDER BY T.DiffOfColumns)
                    FROM dbo.MyTableWithDiff AS T
                    JOIN RecursiveCTE AS R
                    ON R.DiffOfColumns < T.DiffOfColumns
                    ) AS R
                    WHERE
                    -- Only the row that sorts lowest
                    R.rn = 1
                    )
                    SELECT ca.*
                    FROM RecursiveCTE rcte
                    CROSS APPLY (
                    SELECT mt.Id, mt.RangeFrom, mt.RangeTo
                    FROM dbo.MyTableWithDiff mt
                    WHERE mt.DiffOfColumns = rcte.DiffOfColumns
                    AND mt.RangeFrom >= 50000000 - rcte.DiffOfColumns AND mt.RangeFrom <= 50000000
                    ) ca
                    OPTION (MAXRECURSION 0);


                    You can see the usual recursive part along with the index seek for every distinct value:



                    query plan 1



                    The flaw with this approach is that it starts to get slow when there are too many distinct values for DiffOfColumns. Let's do the same test, but use CRYPT_GEN_RANDOM(2)instead of CRYPT_GEN_RANDOM(1).



                    DROP TABLE IF EXISTS dbo.MyTableBigDiff;

                    CREATE TABLE dbo.MyTableBigDiff
                    (
                    Id INT IDENTITY PRIMARY KEY,
                    RangeFrom INT NOT NULL,
                    RangeTo INT NOT NULL,
                    CHECK (RangeTo > RangeFrom)
                    );

                    WITH RandomNumbers
                    AS (SELECT TOP 10000000 ABS(CRYPT_GEN_RANDOM(4)%100000000) AS Num
                    FROM sys.all_objects o1,
                    sys.all_objects o2,
                    sys.all_objects o3,
                    sys.all_objects o4)
                    INSERT INTO dbo.MyTableBigDiff
                    (RangeFrom,
                    RangeTo)
                    SELECT Num,
                    Num + 1 + CRYPT_GEN_RANDOM(2) -- note the 2
                    FROM RandomNumbers;


                    ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

                    CREATE INDEX IXDIFF ON dbo.MyTableBigDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


                    The same query now finds 65536 rows from the recursive part and takes 823 ms of CPU on my machine. There are PAGELATCH_SH waits and other bad things going on. I can improve performance by bucketing the diff values to keep the number of unique values under control and adjusting for the bucketing in the CROSS APPLY. For this data set I'll try 256 buckets:



                    ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns_bucket256 AS CAST(CEILING((RangeTo-RangeFrom) / 256.) AS INT);

                    CREATE INDEX [IXDIFF😎] ON dbo.MyTableBigDiff (DiffOfColumns_bucket256, RangeFrom) INCLUDE (RangeTo);


                    One way to avoid getting extra rows (now I'm comparing to a rounded value instead of the true value) is by filtering on RangeTo:



                    CROSS APPLY (
                    SELECT mt.Id, mt.RangeFrom, mt.RangeTo
                    FROM dbo.MyTableBigDiff mt
                    WHERE mt.DiffOfColumns_bucket256 = rcte.DiffOfColumns_bucket256
                    AND mt.RangeFrom >= 50000000 - (256 * rcte.DiffOfColumns_bucket256)
                    AND mt.RangeFrom <= 50000000
                    AND mt.RangeTo >= 50000000
                    ) ca


                    The full query now takes 6 ms on my machine.






                    share|improve this answer




























                      9












                      9








                      9







                      I was able to find a row mode approach that is competitive with the N/CCI approach, but you need to know something about your data. Suppose that you had a column that contained the difference of RangeFrom and RangeTo and you indexed it along with RangeFrom:



                      ALTER TABLE dbo.MyTableWithDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

                      CREATE INDEX IXDIFF ON dbo.MyTableWithDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


                      If you knew all of the distinct values of DiffOfColumns then you could perform a seek for every value of DiffOfColumns with a range filter on RangeTo to get all of the relevant data. For example, if we know that DiffOfColumns = 2 then the only allowed values for RangeFrom are 49999998, 49999999 and 50000000. Recursion can be used to get all of the distinct values of DiffOfColumns and it works well for your data set because there are only 256 of them. The query below takes about 6 ms on my machine:



                      WITH RecursiveCTE
                      AS
                      (
                      -- Anchor
                      SELECT TOP (1)
                      DiffOfColumns
                      FROM dbo.MyTableWithDiff AS T
                      ORDER BY
                      T.DiffOfColumns

                      UNION ALL

                      -- Recursive
                      SELECT R.DiffOfColumns
                      FROM
                      (
                      -- Number the rows
                      SELECT
                      T.DiffOfColumns,
                      rn = ROW_NUMBER() OVER (
                      ORDER BY T.DiffOfColumns)
                      FROM dbo.MyTableWithDiff AS T
                      JOIN RecursiveCTE AS R
                      ON R.DiffOfColumns < T.DiffOfColumns
                      ) AS R
                      WHERE
                      -- Only the row that sorts lowest
                      R.rn = 1
                      )
                      SELECT ca.*
                      FROM RecursiveCTE rcte
                      CROSS APPLY (
                      SELECT mt.Id, mt.RangeFrom, mt.RangeTo
                      FROM dbo.MyTableWithDiff mt
                      WHERE mt.DiffOfColumns = rcte.DiffOfColumns
                      AND mt.RangeFrom >= 50000000 - rcte.DiffOfColumns AND mt.RangeFrom <= 50000000
                      ) ca
                      OPTION (MAXRECURSION 0);


                      You can see the usual recursive part along with the index seek for every distinct value:



                      query plan 1



                      The flaw with this approach is that it starts to get slow when there are too many distinct values for DiffOfColumns. Let's do the same test, but use CRYPT_GEN_RANDOM(2)instead of CRYPT_GEN_RANDOM(1).



                      DROP TABLE IF EXISTS dbo.MyTableBigDiff;

                      CREATE TABLE dbo.MyTableBigDiff
                      (
                      Id INT IDENTITY PRIMARY KEY,
                      RangeFrom INT NOT NULL,
                      RangeTo INT NOT NULL,
                      CHECK (RangeTo > RangeFrom)
                      );

                      WITH RandomNumbers
                      AS (SELECT TOP 10000000 ABS(CRYPT_GEN_RANDOM(4)%100000000) AS Num
                      FROM sys.all_objects o1,
                      sys.all_objects o2,
                      sys.all_objects o3,
                      sys.all_objects o4)
                      INSERT INTO dbo.MyTableBigDiff
                      (RangeFrom,
                      RangeTo)
                      SELECT Num,
                      Num + 1 + CRYPT_GEN_RANDOM(2) -- note the 2
                      FROM RandomNumbers;


                      ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

                      CREATE INDEX IXDIFF ON dbo.MyTableBigDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


                      The same query now finds 65536 rows from the recursive part and takes 823 ms of CPU on my machine. There are PAGELATCH_SH waits and other bad things going on. I can improve performance by bucketing the diff values to keep the number of unique values under control and adjusting for the bucketing in the CROSS APPLY. For this data set I'll try 256 buckets:



                      ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns_bucket256 AS CAST(CEILING((RangeTo-RangeFrom) / 256.) AS INT);

                      CREATE INDEX [IXDIFF😎] ON dbo.MyTableBigDiff (DiffOfColumns_bucket256, RangeFrom) INCLUDE (RangeTo);


                      One way to avoid getting extra rows (now I'm comparing to a rounded value instead of the true value) is by filtering on RangeTo:



                      CROSS APPLY (
                      SELECT mt.Id, mt.RangeFrom, mt.RangeTo
                      FROM dbo.MyTableBigDiff mt
                      WHERE mt.DiffOfColumns_bucket256 = rcte.DiffOfColumns_bucket256
                      AND mt.RangeFrom >= 50000000 - (256 * rcte.DiffOfColumns_bucket256)
                      AND mt.RangeFrom <= 50000000
                      AND mt.RangeTo >= 50000000
                      ) ca


                      The full query now takes 6 ms on my machine.






                      share|improve this answer















                      I was able to find a row mode approach that is competitive with the N/CCI approach, but you need to know something about your data. Suppose that you had a column that contained the difference of RangeFrom and RangeTo and you indexed it along with RangeFrom:



                      ALTER TABLE dbo.MyTableWithDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

                      CREATE INDEX IXDIFF ON dbo.MyTableWithDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


                      If you knew all of the distinct values of DiffOfColumns then you could perform a seek for every value of DiffOfColumns with a range filter on RangeTo to get all of the relevant data. For example, if we know that DiffOfColumns = 2 then the only allowed values for RangeFrom are 49999998, 49999999 and 50000000. Recursion can be used to get all of the distinct values of DiffOfColumns and it works well for your data set because there are only 256 of them. The query below takes about 6 ms on my machine:



                      WITH RecursiveCTE
                      AS
                      (
                      -- Anchor
                      SELECT TOP (1)
                      DiffOfColumns
                      FROM dbo.MyTableWithDiff AS T
                      ORDER BY
                      T.DiffOfColumns

                      UNION ALL

                      -- Recursive
                      SELECT R.DiffOfColumns
                      FROM
                      (
                      -- Number the rows
                      SELECT
                      T.DiffOfColumns,
                      rn = ROW_NUMBER() OVER (
                      ORDER BY T.DiffOfColumns)
                      FROM dbo.MyTableWithDiff AS T
                      JOIN RecursiveCTE AS R
                      ON R.DiffOfColumns < T.DiffOfColumns
                      ) AS R
                      WHERE
                      -- Only the row that sorts lowest
                      R.rn = 1
                      )
                      SELECT ca.*
                      FROM RecursiveCTE rcte
                      CROSS APPLY (
                      SELECT mt.Id, mt.RangeFrom, mt.RangeTo
                      FROM dbo.MyTableWithDiff mt
                      WHERE mt.DiffOfColumns = rcte.DiffOfColumns
                      AND mt.RangeFrom >= 50000000 - rcte.DiffOfColumns AND mt.RangeFrom <= 50000000
                      ) ca
                      OPTION (MAXRECURSION 0);


                      You can see the usual recursive part along with the index seek for every distinct value:



                      query plan 1



                      The flaw with this approach is that it starts to get slow when there are too many distinct values for DiffOfColumns. Let's do the same test, but use CRYPT_GEN_RANDOM(2)instead of CRYPT_GEN_RANDOM(1).



                      DROP TABLE IF EXISTS dbo.MyTableBigDiff;

                      CREATE TABLE dbo.MyTableBigDiff
                      (
                      Id INT IDENTITY PRIMARY KEY,
                      RangeFrom INT NOT NULL,
                      RangeTo INT NOT NULL,
                      CHECK (RangeTo > RangeFrom)
                      );

                      WITH RandomNumbers
                      AS (SELECT TOP 10000000 ABS(CRYPT_GEN_RANDOM(4)%100000000) AS Num
                      FROM sys.all_objects o1,
                      sys.all_objects o2,
                      sys.all_objects o3,
                      sys.all_objects o4)
                      INSERT INTO dbo.MyTableBigDiff
                      (RangeFrom,
                      RangeTo)
                      SELECT Num,
                      Num + 1 + CRYPT_GEN_RANDOM(2) -- note the 2
                      FROM RandomNumbers;


                      ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

                      CREATE INDEX IXDIFF ON dbo.MyTableBigDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


                      The same query now finds 65536 rows from the recursive part and takes 823 ms of CPU on my machine. There are PAGELATCH_SH waits and other bad things going on. I can improve performance by bucketing the diff values to keep the number of unique values under control and adjusting for the bucketing in the CROSS APPLY. For this data set I'll try 256 buckets:



                      ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns_bucket256 AS CAST(CEILING((RangeTo-RangeFrom) / 256.) AS INT);

                      CREATE INDEX [IXDIFF😎] ON dbo.MyTableBigDiff (DiffOfColumns_bucket256, RangeFrom) INCLUDE (RangeTo);


                      One way to avoid getting extra rows (now I'm comparing to a rounded value instead of the true value) is by filtering on RangeTo:



                      CROSS APPLY (
                      SELECT mt.Id, mt.RangeFrom, mt.RangeTo
                      FROM dbo.MyTableBigDiff mt
                      WHERE mt.DiffOfColumns_bucket256 = rcte.DiffOfColumns_bucket256
                      AND mt.RangeFrom >= 50000000 - (256 * rcte.DiffOfColumns_bucket256)
                      AND mt.RangeFrom <= 50000000
                      AND mt.RangeTo >= 50000000
                      ) ca


                      The full query now takes 6 ms on my machine.







                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Jan 2 at 12:17









                      Michael Green

                      15k83162




                      15k83162










                      answered Dec 29 '18 at 3:02









                      Joe ObbishJoe Obbish

                      21.3k43187




                      21.3k43187























                          8














                          One alternative way of representing a range would be as points on a line.



                          The below migrates all the data into a new table with the range represented as a geometry datatype.



                          CREATE TABLE MyTable2
                          (
                          Id INT IDENTITY PRIMARY KEY,
                          Range GEOMETRY NOT NULL,
                          RangeFrom AS Range.STPointN(1).STX,
                          RangeTo AS Range.STPointN(2).STX,
                          CHECK (Range.STNumPoints() = 2 AND Range.STPointN(1).STY = 0 AND Range.STPointN(2).STY = 0)
                          );

                          SET IDENTITY_INSERT MyTable2 ON

                          INSERT INTO MyTable2
                          (Id,
                          Range)
                          SELECT ID,
                          geometry::STLineFromText(CONCAT('LINESTRING(', RangeFrom, ' 0, ', RangeTo, ' 0)'), 0)
                          FROM MyTable

                          SET IDENTITY_INSERT MyTable2 OFF


                          CREATE SPATIAL INDEX index_name
                          ON MyTable2 ( Range )
                          USING GEOMETRY_GRID
                          WITH (
                          BOUNDING_BOX = ( xmin=0, ymin=0, xmax=110000000, ymax=1 ),
                          GRIDS = (HIGH, HIGH, HIGH, HIGH),
                          CELLS_PER_OBJECT = 16);


                          The equivalent query to find ranges containing the value 50,000,000 is below.



                          SELECT Id,
                          RangeFrom,
                          RangeTo
                          FROM MyTable2
                          WHERE Range.STContains(geometry::STPointFromText ('POINT (50000000 0)', 0)) = 1


                          The reads for this show an improvement on the 10,951 from the original query.



                          Table 'MyTable2'. Scan count 0, logical reads 505, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
                          Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
                          Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
                          Table 'extended_index_1797581442_384000'. Scan count 4, logical reads 17, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


                          However there is no significant improvement over the original in terms of time elapsed. Typical execution results are 250 ms vs 252 ms.



                          The execution plan is more complex as below



                          enter image description here



                          The only case where the rewrite reliably performs better for me is with a cold cache.



                          So disappointing in this case and difficult to recommend this rewrite but publication of negative results can also be useful.






                          share|improve this answer




























                            8














                            One alternative way of representing a range would be as points on a line.



                            The below migrates all the data into a new table with the range represented as a geometry datatype.



                            CREATE TABLE MyTable2
                            (
                            Id INT IDENTITY PRIMARY KEY,
                            Range GEOMETRY NOT NULL,
                            RangeFrom AS Range.STPointN(1).STX,
                            RangeTo AS Range.STPointN(2).STX,
                            CHECK (Range.STNumPoints() = 2 AND Range.STPointN(1).STY = 0 AND Range.STPointN(2).STY = 0)
                            );

                            SET IDENTITY_INSERT MyTable2 ON

                            INSERT INTO MyTable2
                            (Id,
                            Range)
                            SELECT ID,
                            geometry::STLineFromText(CONCAT('LINESTRING(', RangeFrom, ' 0, ', RangeTo, ' 0)'), 0)
                            FROM MyTable

                            SET IDENTITY_INSERT MyTable2 OFF


                            CREATE SPATIAL INDEX index_name
                            ON MyTable2 ( Range )
                            USING GEOMETRY_GRID
                            WITH (
                            BOUNDING_BOX = ( xmin=0, ymin=0, xmax=110000000, ymax=1 ),
                            GRIDS = (HIGH, HIGH, HIGH, HIGH),
                            CELLS_PER_OBJECT = 16);


                            The equivalent query to find ranges containing the value 50,000,000 is below.



                            SELECT Id,
                            RangeFrom,
                            RangeTo
                            FROM MyTable2
                            WHERE Range.STContains(geometry::STPointFromText ('POINT (50000000 0)', 0)) = 1


                            The reads for this show an improvement on the 10,951 from the original query.



                            Table 'MyTable2'. Scan count 0, logical reads 505, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
                            Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
                            Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
                            Table 'extended_index_1797581442_384000'. Scan count 4, logical reads 17, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


                            However there is no significant improvement over the original in terms of time elapsed. Typical execution results are 250 ms vs 252 ms.



                            The execution plan is more complex as below



                            enter image description here



                            The only case where the rewrite reliably performs better for me is with a cold cache.



                            So disappointing in this case and difficult to recommend this rewrite but publication of negative results can also be useful.






                            share|improve this answer


























                              8












                              8








                              8







                              One alternative way of representing a range would be as points on a line.



                              The below migrates all the data into a new table with the range represented as a geometry datatype.



                              CREATE TABLE MyTable2
                              (
                              Id INT IDENTITY PRIMARY KEY,
                              Range GEOMETRY NOT NULL,
                              RangeFrom AS Range.STPointN(1).STX,
                              RangeTo AS Range.STPointN(2).STX,
                              CHECK (Range.STNumPoints() = 2 AND Range.STPointN(1).STY = 0 AND Range.STPointN(2).STY = 0)
                              );

                              SET IDENTITY_INSERT MyTable2 ON

                              INSERT INTO MyTable2
                              (Id,
                              Range)
                              SELECT ID,
                              geometry::STLineFromText(CONCAT('LINESTRING(', RangeFrom, ' 0, ', RangeTo, ' 0)'), 0)
                              FROM MyTable

                              SET IDENTITY_INSERT MyTable2 OFF


                              CREATE SPATIAL INDEX index_name
                              ON MyTable2 ( Range )
                              USING GEOMETRY_GRID
                              WITH (
                              BOUNDING_BOX = ( xmin=0, ymin=0, xmax=110000000, ymax=1 ),
                              GRIDS = (HIGH, HIGH, HIGH, HIGH),
                              CELLS_PER_OBJECT = 16);


                              The equivalent query to find ranges containing the value 50,000,000 is below.



                              SELECT Id,
                              RangeFrom,
                              RangeTo
                              FROM MyTable2
                              WHERE Range.STContains(geometry::STPointFromText ('POINT (50000000 0)', 0)) = 1


                              The reads for this show an improvement on the 10,951 from the original query.



                              Table 'MyTable2'. Scan count 0, logical reads 505, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
                              Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
                              Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
                              Table 'extended_index_1797581442_384000'. Scan count 4, logical reads 17, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


                              However there is no significant improvement over the original in terms of time elapsed. Typical execution results are 250 ms vs 252 ms.



                              The execution plan is more complex as below



                              enter image description here



                              The only case where the rewrite reliably performs better for me is with a cold cache.



                              So disappointing in this case and difficult to recommend this rewrite but publication of negative results can also be useful.






                              share|improve this answer













                              One alternative way of representing a range would be as points on a line.



                              The below migrates all the data into a new table with the range represented as a geometry datatype.



                              CREATE TABLE MyTable2
                              (
                              Id INT IDENTITY PRIMARY KEY,
                              Range GEOMETRY NOT NULL,
                              RangeFrom AS Range.STPointN(1).STX,
                              RangeTo AS Range.STPointN(2).STX,
                              CHECK (Range.STNumPoints() = 2 AND Range.STPointN(1).STY = 0 AND Range.STPointN(2).STY = 0)
                              );

                              SET IDENTITY_INSERT MyTable2 ON

                              INSERT INTO MyTable2
                              (Id,
                              Range)
                              SELECT ID,
                              geometry::STLineFromText(CONCAT('LINESTRING(', RangeFrom, ' 0, ', RangeTo, ' 0)'), 0)
                              FROM MyTable

                              SET IDENTITY_INSERT MyTable2 OFF


                              CREATE SPATIAL INDEX index_name
                              ON MyTable2 ( Range )
                              USING GEOMETRY_GRID
                              WITH (
                              BOUNDING_BOX = ( xmin=0, ymin=0, xmax=110000000, ymax=1 ),
                              GRIDS = (HIGH, HIGH, HIGH, HIGH),
                              CELLS_PER_OBJECT = 16);


                              The equivalent query to find ranges containing the value 50,000,000 is below.



                              SELECT Id,
                              RangeFrom,
                              RangeTo
                              FROM MyTable2
                              WHERE Range.STContains(geometry::STPointFromText ('POINT (50000000 0)', 0)) = 1


                              The reads for this show an improvement on the 10,951 from the original query.



                              Table 'MyTable2'. Scan count 0, logical reads 505, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
                              Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
                              Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
                              Table 'extended_index_1797581442_384000'. Scan count 4, logical reads 17, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


                              However there is no significant improvement over the original in terms of time elapsed. Typical execution results are 250 ms vs 252 ms.



                              The execution plan is more complex as below



                              enter image description here



                              The only case where the rewrite reliably performs better for me is with a cold cache.



                              So disappointing in this case and difficult to recommend this rewrite but publication of negative results can also be useful.







                              share|improve this answer












                              share|improve this answer



                              share|improve this answer










                              answered Dec 28 '18 at 23:27









                              Martin SmithMartin Smith

                              63.9k10172256




                              63.9k10172256























                                  5














                                  As an homage to our new robot overlords, I decided to see if any of the new-ish R and Python functionality could help us out here. The answer is no, at least for the scripts that I could get working and returning correct results. If anyone with better knowledge comes along, well, feel free to spank me. My rates are reasonable.



                                  To do this, I set up a VM with 4 cores and 16 GB RAM, thinking this would be enough to deal with a ~200MB data set.



                                  Let's start with the language that doesn't exist in Boston!



                                  R



                                  EXEC sp_execute_external_script 
                                  @language = N'R',
                                  @script = N'
                                  tweener = 50000000
                                  MO = data.frame(MartinIn)
                                  MartinOut <- subset(MO, RangeFrom <= tweener & RangeTo >= tweener, select = c("Id","RangeFrom","RangeTo"))
                                  ',
                                  @input_data_1_name = N'MartinIn',
                                  @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
                                  @output_data_1_name = N'MartinOut',
                                  @parallel = 1
                                  WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


                                  This was a bad time.



                                  Table 'MyTable'. Scan count 1, logical reads 22400

                                  SQL Server Execution Times:
                                  CPU time = 3219 ms, elapsed time = 5349 ms.


                                  The execution plan is pretty uninteresting, though I don't know why the middle operator has to call us names.



                                  NUTS



                                  Next up, coding with crayons!



                                  Python



                                  EXEC sp_execute_external_script 
                                  @language = N'Python',
                                  @script = N'
                                  import pandas as pd
                                  MO = pd.DataFrame(MartinIn)
                                  tweener = 50000000
                                  MartinOut = MO[(MO.RangeFrom <= tweener) & (MO.RangeTo >= tweener)]
                                  ',
                                  @input_data_1_name = N'MartinIn',
                                  @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
                                  @output_data_1_name = N'MartinOut',
                                  @parallel = 1
                                  WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


                                  Just when you thought it couldn't get worse than R:



                                  Table 'MyTable'. Scan count 1, logical reads 22400

                                  SQL Server Execution Times:
                                  CPU time = 3797 ms, elapsed time = 10146 ms.


                                  Another foul-mouthed execution plan:



                                  NUTS



                                  Hmm and Hmmer



                                  So far, I'm not impressed. I can't wait to delete this VM.






                                  share|improve this answer



















                                  • 1





                                    You can pass in parameters too, eg DECLARE @input INT = 50000001; EXEC dbo.sp_execute_external_script @language = N'R', @script = N'OutputDataSet <- InputDataSet[which(x >= InputDataSet$RangeFrom & x <= InputDataSet$RangeTo) , ]', @parallel = 1, @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable;', @params = N'@x INT', @x = 50000001 WITH RESULT SETS ( ( Id INT NOT NULL, RangeFrom INT NOT NULL, RangeTo INT NOT NULL )); but yeah performance isn't great. I use R for stuff you can't do in SQL, say if you wanted to predict something.

                                    – wBob
                                    Dec 31 '18 at 3:47


















                                  5














                                  As an homage to our new robot overlords, I decided to see if any of the new-ish R and Python functionality could help us out here. The answer is no, at least for the scripts that I could get working and returning correct results. If anyone with better knowledge comes along, well, feel free to spank me. My rates are reasonable.



                                  To do this, I set up a VM with 4 cores and 16 GB RAM, thinking this would be enough to deal with a ~200MB data set.



                                  Let's start with the language that doesn't exist in Boston!



                                  R



                                  EXEC sp_execute_external_script 
                                  @language = N'R',
                                  @script = N'
                                  tweener = 50000000
                                  MO = data.frame(MartinIn)
                                  MartinOut <- subset(MO, RangeFrom <= tweener & RangeTo >= tweener, select = c("Id","RangeFrom","RangeTo"))
                                  ',
                                  @input_data_1_name = N'MartinIn',
                                  @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
                                  @output_data_1_name = N'MartinOut',
                                  @parallel = 1
                                  WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


                                  This was a bad time.



                                  Table 'MyTable'. Scan count 1, logical reads 22400

                                  SQL Server Execution Times:
                                  CPU time = 3219 ms, elapsed time = 5349 ms.


                                  The execution plan is pretty uninteresting, though I don't know why the middle operator has to call us names.



                                  NUTS



                                  Next up, coding with crayons!



                                  Python



                                  EXEC sp_execute_external_script 
                                  @language = N'Python',
                                  @script = N'
                                  import pandas as pd
                                  MO = pd.DataFrame(MartinIn)
                                  tweener = 50000000
                                  MartinOut = MO[(MO.RangeFrom <= tweener) & (MO.RangeTo >= tweener)]
                                  ',
                                  @input_data_1_name = N'MartinIn',
                                  @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
                                  @output_data_1_name = N'MartinOut',
                                  @parallel = 1
                                  WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


                                  Just when you thought it couldn't get worse than R:



                                  Table 'MyTable'. Scan count 1, logical reads 22400

                                  SQL Server Execution Times:
                                  CPU time = 3797 ms, elapsed time = 10146 ms.


                                  Another foul-mouthed execution plan:



                                  NUTS



                                  Hmm and Hmmer



                                  So far, I'm not impressed. I can't wait to delete this VM.






                                  share|improve this answer



















                                  • 1





                                    You can pass in parameters too, eg DECLARE @input INT = 50000001; EXEC dbo.sp_execute_external_script @language = N'R', @script = N'OutputDataSet <- InputDataSet[which(x >= InputDataSet$RangeFrom & x <= InputDataSet$RangeTo) , ]', @parallel = 1, @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable;', @params = N'@x INT', @x = 50000001 WITH RESULT SETS ( ( Id INT NOT NULL, RangeFrom INT NOT NULL, RangeTo INT NOT NULL )); but yeah performance isn't great. I use R for stuff you can't do in SQL, say if you wanted to predict something.

                                    – wBob
                                    Dec 31 '18 at 3:47
















                                  5












                                  5








                                  5







                                  As an homage to our new robot overlords, I decided to see if any of the new-ish R and Python functionality could help us out here. The answer is no, at least for the scripts that I could get working and returning correct results. If anyone with better knowledge comes along, well, feel free to spank me. My rates are reasonable.



                                  To do this, I set up a VM with 4 cores and 16 GB RAM, thinking this would be enough to deal with a ~200MB data set.



                                  Let's start with the language that doesn't exist in Boston!



                                  R



                                  EXEC sp_execute_external_script 
                                  @language = N'R',
                                  @script = N'
                                  tweener = 50000000
                                  MO = data.frame(MartinIn)
                                  MartinOut <- subset(MO, RangeFrom <= tweener & RangeTo >= tweener, select = c("Id","RangeFrom","RangeTo"))
                                  ',
                                  @input_data_1_name = N'MartinIn',
                                  @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
                                  @output_data_1_name = N'MartinOut',
                                  @parallel = 1
                                  WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


                                  This was a bad time.



                                  Table 'MyTable'. Scan count 1, logical reads 22400

                                  SQL Server Execution Times:
                                  CPU time = 3219 ms, elapsed time = 5349 ms.


                                  The execution plan is pretty uninteresting, though I don't know why the middle operator has to call us names.



                                  NUTS



                                  Next up, coding with crayons!



                                  Python



                                  EXEC sp_execute_external_script 
                                  @language = N'Python',
                                  @script = N'
                                  import pandas as pd
                                  MO = pd.DataFrame(MartinIn)
                                  tweener = 50000000
                                  MartinOut = MO[(MO.RangeFrom <= tweener) & (MO.RangeTo >= tweener)]
                                  ',
                                  @input_data_1_name = N'MartinIn',
                                  @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
                                  @output_data_1_name = N'MartinOut',
                                  @parallel = 1
                                  WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


                                  Just when you thought it couldn't get worse than R:



                                  Table 'MyTable'. Scan count 1, logical reads 22400

                                  SQL Server Execution Times:
                                  CPU time = 3797 ms, elapsed time = 10146 ms.


                                  Another foul-mouthed execution plan:



                                  NUTS



                                  Hmm and Hmmer



                                  So far, I'm not impressed. I can't wait to delete this VM.






                                  share|improve this answer













                                  As an homage to our new robot overlords, I decided to see if any of the new-ish R and Python functionality could help us out here. The answer is no, at least for the scripts that I could get working and returning correct results. If anyone with better knowledge comes along, well, feel free to spank me. My rates are reasonable.



                                  To do this, I set up a VM with 4 cores and 16 GB RAM, thinking this would be enough to deal with a ~200MB data set.



                                  Let's start with the language that doesn't exist in Boston!



                                  R



                                  EXEC sp_execute_external_script 
                                  @language = N'R',
                                  @script = N'
                                  tweener = 50000000
                                  MO = data.frame(MartinIn)
                                  MartinOut <- subset(MO, RangeFrom <= tweener & RangeTo >= tweener, select = c("Id","RangeFrom","RangeTo"))
                                  ',
                                  @input_data_1_name = N'MartinIn',
                                  @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
                                  @output_data_1_name = N'MartinOut',
                                  @parallel = 1
                                  WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


                                  This was a bad time.



                                  Table 'MyTable'. Scan count 1, logical reads 22400

                                  SQL Server Execution Times:
                                  CPU time = 3219 ms, elapsed time = 5349 ms.


                                  The execution plan is pretty uninteresting, though I don't know why the middle operator has to call us names.



                                  NUTS



                                  Next up, coding with crayons!



                                  Python



                                  EXEC sp_execute_external_script 
                                  @language = N'Python',
                                  @script = N'
                                  import pandas as pd
                                  MO = pd.DataFrame(MartinIn)
                                  tweener = 50000000
                                  MartinOut = MO[(MO.RangeFrom <= tweener) & (MO.RangeTo >= tweener)]
                                  ',
                                  @input_data_1_name = N'MartinIn',
                                  @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
                                  @output_data_1_name = N'MartinOut',
                                  @parallel = 1
                                  WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


                                  Just when you thought it couldn't get worse than R:



                                  Table 'MyTable'. Scan count 1, logical reads 22400

                                  SQL Server Execution Times:
                                  CPU time = 3797 ms, elapsed time = 10146 ms.


                                  Another foul-mouthed execution plan:



                                  NUTS



                                  Hmm and Hmmer



                                  So far, I'm not impressed. I can't wait to delete this VM.







                                  share|improve this answer












                                  share|improve this answer



                                  share|improve this answer










                                  answered Dec 30 '18 at 17:22









                                  Erik DarlingErik Darling

                                  21.6k1267108




                                  21.6k1267108








                                  • 1





                                    You can pass in parameters too, eg DECLARE @input INT = 50000001; EXEC dbo.sp_execute_external_script @language = N'R', @script = N'OutputDataSet <- InputDataSet[which(x >= InputDataSet$RangeFrom & x <= InputDataSet$RangeTo) , ]', @parallel = 1, @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable;', @params = N'@x INT', @x = 50000001 WITH RESULT SETS ( ( Id INT NOT NULL, RangeFrom INT NOT NULL, RangeTo INT NOT NULL )); but yeah performance isn't great. I use R for stuff you can't do in SQL, say if you wanted to predict something.

                                    – wBob
                                    Dec 31 '18 at 3:47
















                                  • 1





                                    You can pass in parameters too, eg DECLARE @input INT = 50000001; EXEC dbo.sp_execute_external_script @language = N'R', @script = N'OutputDataSet <- InputDataSet[which(x >= InputDataSet$RangeFrom & x <= InputDataSet$RangeTo) , ]', @parallel = 1, @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable;', @params = N'@x INT', @x = 50000001 WITH RESULT SETS ( ( Id INT NOT NULL, RangeFrom INT NOT NULL, RangeTo INT NOT NULL )); but yeah performance isn't great. I use R for stuff you can't do in SQL, say if you wanted to predict something.

                                    – wBob
                                    Dec 31 '18 at 3:47










                                  1




                                  1





                                  You can pass in parameters too, eg DECLARE @input INT = 50000001; EXEC dbo.sp_execute_external_script @language = N'R', @script = N'OutputDataSet <- InputDataSet[which(x >= InputDataSet$RangeFrom & x <= InputDataSet$RangeTo) , ]', @parallel = 1, @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable;', @params = N'@x INT', @x = 50000001 WITH RESULT SETS ( ( Id INT NOT NULL, RangeFrom INT NOT NULL, RangeTo INT NOT NULL )); but yeah performance isn't great. I use R for stuff you can't do in SQL, say if you wanted to predict something.

                                  – wBob
                                  Dec 31 '18 at 3:47







                                  You can pass in parameters too, eg DECLARE @input INT = 50000001; EXEC dbo.sp_execute_external_script @language = N'R', @script = N'OutputDataSet <- InputDataSet[which(x >= InputDataSet$RangeFrom & x <= InputDataSet$RangeTo) , ]', @parallel = 1, @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable;', @params = N'@x INT', @x = 50000001 WITH RESULT SETS ( ( Id INT NOT NULL, RangeFrom INT NOT NULL, RangeTo INT NOT NULL )); but yeah performance isn't great. I use R for stuff you can't do in SQL, say if you wanted to predict something.

                                  – wBob
                                  Dec 31 '18 at 3:47













                                  4














                                  I found a pretty good solution using a computed column, however it's only good for a single value. That being said, if you've got a magic value, maybe it's enough.



                                  Starting with your given sample, then modifying the table:



                                  ALTER TABLE dbo.MyTable
                                  ADD curtis_jackson
                                  AS CONVERT(BIT, CASE
                                  WHEN RangeTo >= 50000000
                                  AND RangeFrom < 50000000
                                  THEN 1
                                  ELSE 0
                                  END);

                                  CREATE INDEX IX1_redo
                                  ON dbo.MyTable (curtis_jackson)
                                  INCLUDE (RangeFrom, RangeTo);


                                  The query simply becomes:



                                  SELECT *
                                  FROM MyTable
                                  WHERE curtis_jackson = 1;


                                  Which returns the same results as your starting query. With execution plans turned off, here are the stats (truncated for brevity):



                                  Table 'MyTable'. Scan count 1, logical reads 3...

                                  SQL Server Execution Times:
                                  CPU time = 0 ms, elapsed time = 0 ms.


                                  And here's the query plan:



                                  NUTS






                                  share|improve this answer
























                                  • Can you not overcome the imitation of computed column/filtered index with an index on WHERE (50000000 BETWEEN RangeFrom AND RangeTo) INCLUDE (..)?

                                    – ypercubeᵀᴹ
                                    Dec 30 '18 at 17:58






                                  • 3





                                    @yper-crazyhat-cubeᵀᴹ - yes. CREATE INDEX IX1_redo ON dbo.MyTable (curtis_jackson) INCLUDE (RangeFrom, RangeTo) WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000 would work. And the query SELECT * FROM MyTable WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000; uses it - so then not much need for poor Curtis

                                    – Martin Smith
                                    Dec 31 '18 at 9:19
















                                  4














                                  I found a pretty good solution using a computed column, however it's only good for a single value. That being said, if you've got a magic value, maybe it's enough.



                                  Starting with your given sample, then modifying the table:



                                  ALTER TABLE dbo.MyTable
                                  ADD curtis_jackson
                                  AS CONVERT(BIT, CASE
                                  WHEN RangeTo >= 50000000
                                  AND RangeFrom < 50000000
                                  THEN 1
                                  ELSE 0
                                  END);

                                  CREATE INDEX IX1_redo
                                  ON dbo.MyTable (curtis_jackson)
                                  INCLUDE (RangeFrom, RangeTo);


                                  The query simply becomes:



                                  SELECT *
                                  FROM MyTable
                                  WHERE curtis_jackson = 1;


                                  Which returns the same results as your starting query. With execution plans turned off, here are the stats (truncated for brevity):



                                  Table 'MyTable'. Scan count 1, logical reads 3...

                                  SQL Server Execution Times:
                                  CPU time = 0 ms, elapsed time = 0 ms.


                                  And here's the query plan:



                                  NUTS






                                  share|improve this answer
























                                  • Can you not overcome the imitation of computed column/filtered index with an index on WHERE (50000000 BETWEEN RangeFrom AND RangeTo) INCLUDE (..)?

                                    – ypercubeᵀᴹ
                                    Dec 30 '18 at 17:58






                                  • 3





                                    @yper-crazyhat-cubeᵀᴹ - yes. CREATE INDEX IX1_redo ON dbo.MyTable (curtis_jackson) INCLUDE (RangeFrom, RangeTo) WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000 would work. And the query SELECT * FROM MyTable WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000; uses it - so then not much need for poor Curtis

                                    – Martin Smith
                                    Dec 31 '18 at 9:19














                                  4












                                  4








                                  4







                                  I found a pretty good solution using a computed column, however it's only good for a single value. That being said, if you've got a magic value, maybe it's enough.



                                  Starting with your given sample, then modifying the table:



                                  ALTER TABLE dbo.MyTable
                                  ADD curtis_jackson
                                  AS CONVERT(BIT, CASE
                                  WHEN RangeTo >= 50000000
                                  AND RangeFrom < 50000000
                                  THEN 1
                                  ELSE 0
                                  END);

                                  CREATE INDEX IX1_redo
                                  ON dbo.MyTable (curtis_jackson)
                                  INCLUDE (RangeFrom, RangeTo);


                                  The query simply becomes:



                                  SELECT *
                                  FROM MyTable
                                  WHERE curtis_jackson = 1;


                                  Which returns the same results as your starting query. With execution plans turned off, here are the stats (truncated for brevity):



                                  Table 'MyTable'. Scan count 1, logical reads 3...

                                  SQL Server Execution Times:
                                  CPU time = 0 ms, elapsed time = 0 ms.


                                  And here's the query plan:



                                  NUTS






                                  share|improve this answer













                                  I found a pretty good solution using a computed column, however it's only good for a single value. That being said, if you've got a magic value, maybe it's enough.



                                  Starting with your given sample, then modifying the table:



                                  ALTER TABLE dbo.MyTable
                                  ADD curtis_jackson
                                  AS CONVERT(BIT, CASE
                                  WHEN RangeTo >= 50000000
                                  AND RangeFrom < 50000000
                                  THEN 1
                                  ELSE 0
                                  END);

                                  CREATE INDEX IX1_redo
                                  ON dbo.MyTable (curtis_jackson)
                                  INCLUDE (RangeFrom, RangeTo);


                                  The query simply becomes:



                                  SELECT *
                                  FROM MyTable
                                  WHERE curtis_jackson = 1;


                                  Which returns the same results as your starting query. With execution plans turned off, here are the stats (truncated for brevity):



                                  Table 'MyTable'. Scan count 1, logical reads 3...

                                  SQL Server Execution Times:
                                  CPU time = 0 ms, elapsed time = 0 ms.


                                  And here's the query plan:



                                  NUTS







                                  share|improve this answer












                                  share|improve this answer



                                  share|improve this answer










                                  answered Dec 29 '18 at 20:20









                                  Erik DarlingErik Darling

                                  21.6k1267108




                                  21.6k1267108













                                  • Can you not overcome the imitation of computed column/filtered index with an index on WHERE (50000000 BETWEEN RangeFrom AND RangeTo) INCLUDE (..)?

                                    – ypercubeᵀᴹ
                                    Dec 30 '18 at 17:58






                                  • 3





                                    @yper-crazyhat-cubeᵀᴹ - yes. CREATE INDEX IX1_redo ON dbo.MyTable (curtis_jackson) INCLUDE (RangeFrom, RangeTo) WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000 would work. And the query SELECT * FROM MyTable WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000; uses it - so then not much need for poor Curtis

                                    – Martin Smith
                                    Dec 31 '18 at 9:19



















                                  • Can you not overcome the imitation of computed column/filtered index with an index on WHERE (50000000 BETWEEN RangeFrom AND RangeTo) INCLUDE (..)?

                                    – ypercubeᵀᴹ
                                    Dec 30 '18 at 17:58






                                  • 3





                                    @yper-crazyhat-cubeᵀᴹ - yes. CREATE INDEX IX1_redo ON dbo.MyTable (curtis_jackson) INCLUDE (RangeFrom, RangeTo) WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000 would work. And the query SELECT * FROM MyTable WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000; uses it - so then not much need for poor Curtis

                                    – Martin Smith
                                    Dec 31 '18 at 9:19

















                                  Can you not overcome the imitation of computed column/filtered index with an index on WHERE (50000000 BETWEEN RangeFrom AND RangeTo) INCLUDE (..)?

                                  – ypercubeᵀᴹ
                                  Dec 30 '18 at 17:58





                                  Can you not overcome the imitation of computed column/filtered index with an index on WHERE (50000000 BETWEEN RangeFrom AND RangeTo) INCLUDE (..)?

                                  – ypercubeᵀᴹ
                                  Dec 30 '18 at 17:58




                                  3




                                  3





                                  @yper-crazyhat-cubeᵀᴹ - yes. CREATE INDEX IX1_redo ON dbo.MyTable (curtis_jackson) INCLUDE (RangeFrom, RangeTo) WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000 would work. And the query SELECT * FROM MyTable WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000; uses it - so then not much need for poor Curtis

                                  – Martin Smith
                                  Dec 31 '18 at 9:19





                                  @yper-crazyhat-cubeᵀᴹ - yes. CREATE INDEX IX1_redo ON dbo.MyTable (curtis_jackson) INCLUDE (RangeFrom, RangeTo) WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000 would work. And the query SELECT * FROM MyTable WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000; uses it - so then not much need for poor Curtis

                                  – Martin Smith
                                  Dec 31 '18 at 9:19











                                  2














                                  My solution is based on the observation that the interval has a known maximum width W. For the sample data this is one byte or 256 integers. Hence for a given search parameter value P we know the smallest RangeFrom that can be in the result set is P - W. Adding this to the predicate gives



                                  declare @P int = 50000000;
                                  declare @W int = 256;

                                  select
                                  *
                                  from MyTable
                                  where @P between RangeFrom and RangeTo
                                  and RangeFrom >= (@P - @W);


                                  Given the original set-up and query my machine (64 bit Windows 10, 4-core hyperthreaded i7, 2.8GHz, 16GB RAM) returns 13 rows. That query uses a parallel index seek of the (RangeFrom, RangeTo) index. The revised query also performs a parallel index seek on the same index.



                                  The measurements for the original and revised queries are



                                                            Original  Revised
                                  -------- -------
                                  Stats IO Scan count 9 6
                                  Stats IO logical reads 11547 6

                                  Estimated number of rows 1643170 1216080
                                  Number of rows read 5109666 29
                                  QueryTimeStats CPU 344 2
                                  QueryTimeStats Elapsed 53 0


                                  For the original query the number of rows read is equal to the number of rows that are less than or equal to @P. The query optimizer (QO) has no alternative but read them all since it cannot determine in advance which if these rows will satisfy the predicate. The multi-column index on (RangeFrom, RangeTo) is not useful in eliminating rows that do not match RangeTo as there is no correlation between the first index key and the second that can be applied. For example, the first row may have a small interval and be eliminated whereas the second row has a large interval and is returned, or vice versa.



                                  In one failed attempt I tried to provide that certainty through a check constraint:



                                  alter table MyTable with check
                                  add constraint CK_MyTable_Interval
                                  check
                                  (
                                  RangeTo <= RangeFrom + 256
                                  );


                                  It made no difference.



                                  By incorporating my external knowledge of the data distribution into the predicate I can cause the QO to skip low-valued RangeFrom rows, which can never be part of the resultset, and traverse the leading column of the index to the admissible rows. This shows in the different seek predicate for each query.



                                  In a mirror argument the upper limit of RangeTo is P + W. This is not useful, however, because there is no correlation between RangeFrom and RangeTo that would allow the trailing column of a multi-column index to eliminate rows. Hence there is no benefit from adding this clause to the query.



                                  This approach gains most of its benefit from the small interval size. As the possible interval size increases the number of low-valued rows skipped decreases, though some will still be skipped. In the limiting case, with the interval as large as the data range, this approach is no worse than the orginal query (which is cold comfort I admit).



                                  I apologise for whatever off-by-one errors may exist in this answer.






                                  share|improve this answer






























                                    2














                                    My solution is based on the observation that the interval has a known maximum width W. For the sample data this is one byte or 256 integers. Hence for a given search parameter value P we know the smallest RangeFrom that can be in the result set is P - W. Adding this to the predicate gives



                                    declare @P int = 50000000;
                                    declare @W int = 256;

                                    select
                                    *
                                    from MyTable
                                    where @P between RangeFrom and RangeTo
                                    and RangeFrom >= (@P - @W);


                                    Given the original set-up and query my machine (64 bit Windows 10, 4-core hyperthreaded i7, 2.8GHz, 16GB RAM) returns 13 rows. That query uses a parallel index seek of the (RangeFrom, RangeTo) index. The revised query also performs a parallel index seek on the same index.



                                    The measurements for the original and revised queries are



                                                              Original  Revised
                                    -------- -------
                                    Stats IO Scan count 9 6
                                    Stats IO logical reads 11547 6

                                    Estimated number of rows 1643170 1216080
                                    Number of rows read 5109666 29
                                    QueryTimeStats CPU 344 2
                                    QueryTimeStats Elapsed 53 0


                                    For the original query the number of rows read is equal to the number of rows that are less than or equal to @P. The query optimizer (QO) has no alternative but read them all since it cannot determine in advance which if these rows will satisfy the predicate. The multi-column index on (RangeFrom, RangeTo) is not useful in eliminating rows that do not match RangeTo as there is no correlation between the first index key and the second that can be applied. For example, the first row may have a small interval and be eliminated whereas the second row has a large interval and is returned, or vice versa.



                                    In one failed attempt I tried to provide that certainty through a check constraint:



                                    alter table MyTable with check
                                    add constraint CK_MyTable_Interval
                                    check
                                    (
                                    RangeTo <= RangeFrom + 256
                                    );


                                    It made no difference.



                                    By incorporating my external knowledge of the data distribution into the predicate I can cause the QO to skip low-valued RangeFrom rows, which can never be part of the resultset, and traverse the leading column of the index to the admissible rows. This shows in the different seek predicate for each query.



                                    In a mirror argument the upper limit of RangeTo is P + W. This is not useful, however, because there is no correlation between RangeFrom and RangeTo that would allow the trailing column of a multi-column index to eliminate rows. Hence there is no benefit from adding this clause to the query.



                                    This approach gains most of its benefit from the small interval size. As the possible interval size increases the number of low-valued rows skipped decreases, though some will still be skipped. In the limiting case, with the interval as large as the data range, this approach is no worse than the orginal query (which is cold comfort I admit).



                                    I apologise for whatever off-by-one errors may exist in this answer.






                                    share|improve this answer




























                                      2












                                      2








                                      2







                                      My solution is based on the observation that the interval has a known maximum width W. For the sample data this is one byte or 256 integers. Hence for a given search parameter value P we know the smallest RangeFrom that can be in the result set is P - W. Adding this to the predicate gives



                                      declare @P int = 50000000;
                                      declare @W int = 256;

                                      select
                                      *
                                      from MyTable
                                      where @P between RangeFrom and RangeTo
                                      and RangeFrom >= (@P - @W);


                                      Given the original set-up and query my machine (64 bit Windows 10, 4-core hyperthreaded i7, 2.8GHz, 16GB RAM) returns 13 rows. That query uses a parallel index seek of the (RangeFrom, RangeTo) index. The revised query also performs a parallel index seek on the same index.



                                      The measurements for the original and revised queries are



                                                                Original  Revised
                                      -------- -------
                                      Stats IO Scan count 9 6
                                      Stats IO logical reads 11547 6

                                      Estimated number of rows 1643170 1216080
                                      Number of rows read 5109666 29
                                      QueryTimeStats CPU 344 2
                                      QueryTimeStats Elapsed 53 0


                                      For the original query the number of rows read is equal to the number of rows that are less than or equal to @P. The query optimizer (QO) has no alternative but read them all since it cannot determine in advance which if these rows will satisfy the predicate. The multi-column index on (RangeFrom, RangeTo) is not useful in eliminating rows that do not match RangeTo as there is no correlation between the first index key and the second that can be applied. For example, the first row may have a small interval and be eliminated whereas the second row has a large interval and is returned, or vice versa.



                                      In one failed attempt I tried to provide that certainty through a check constraint:



                                      alter table MyTable with check
                                      add constraint CK_MyTable_Interval
                                      check
                                      (
                                      RangeTo <= RangeFrom + 256
                                      );


                                      It made no difference.



                                      By incorporating my external knowledge of the data distribution into the predicate I can cause the QO to skip low-valued RangeFrom rows, which can never be part of the resultset, and traverse the leading column of the index to the admissible rows. This shows in the different seek predicate for each query.



                                      In a mirror argument the upper limit of RangeTo is P + W. This is not useful, however, because there is no correlation between RangeFrom and RangeTo that would allow the trailing column of a multi-column index to eliminate rows. Hence there is no benefit from adding this clause to the query.



                                      This approach gains most of its benefit from the small interval size. As the possible interval size increases the number of low-valued rows skipped decreases, though some will still be skipped. In the limiting case, with the interval as large as the data range, this approach is no worse than the orginal query (which is cold comfort I admit).



                                      I apologise for whatever off-by-one errors may exist in this answer.






                                      share|improve this answer















                                      My solution is based on the observation that the interval has a known maximum width W. For the sample data this is one byte or 256 integers. Hence for a given search parameter value P we know the smallest RangeFrom that can be in the result set is P - W. Adding this to the predicate gives



                                      declare @P int = 50000000;
                                      declare @W int = 256;

                                      select
                                      *
                                      from MyTable
                                      where @P between RangeFrom and RangeTo
                                      and RangeFrom >= (@P - @W);


                                      Given the original set-up and query my machine (64 bit Windows 10, 4-core hyperthreaded i7, 2.8GHz, 16GB RAM) returns 13 rows. That query uses a parallel index seek of the (RangeFrom, RangeTo) index. The revised query also performs a parallel index seek on the same index.



                                      The measurements for the original and revised queries are



                                                                Original  Revised
                                      -------- -------
                                      Stats IO Scan count 9 6
                                      Stats IO logical reads 11547 6

                                      Estimated number of rows 1643170 1216080
                                      Number of rows read 5109666 29
                                      QueryTimeStats CPU 344 2
                                      QueryTimeStats Elapsed 53 0


                                      For the original query the number of rows read is equal to the number of rows that are less than or equal to @P. The query optimizer (QO) has no alternative but read them all since it cannot determine in advance which if these rows will satisfy the predicate. The multi-column index on (RangeFrom, RangeTo) is not useful in eliminating rows that do not match RangeTo as there is no correlation between the first index key and the second that can be applied. For example, the first row may have a small interval and be eliminated whereas the second row has a large interval and is returned, or vice versa.



                                      In one failed attempt I tried to provide that certainty through a check constraint:



                                      alter table MyTable with check
                                      add constraint CK_MyTable_Interval
                                      check
                                      (
                                      RangeTo <= RangeFrom + 256
                                      );


                                      It made no difference.



                                      By incorporating my external knowledge of the data distribution into the predicate I can cause the QO to skip low-valued RangeFrom rows, which can never be part of the resultset, and traverse the leading column of the index to the admissible rows. This shows in the different seek predicate for each query.



                                      In a mirror argument the upper limit of RangeTo is P + W. This is not useful, however, because there is no correlation between RangeFrom and RangeTo that would allow the trailing column of a multi-column index to eliminate rows. Hence there is no benefit from adding this clause to the query.



                                      This approach gains most of its benefit from the small interval size. As the possible interval size increases the number of low-valued rows skipped decreases, though some will still be skipped. In the limiting case, with the interval as large as the data range, this approach is no worse than the orginal query (which is cold comfort I admit).



                                      I apologise for whatever off-by-one errors may exist in this answer.







                                      share|improve this answer














                                      share|improve this answer



                                      share|improve this answer








                                      edited Jan 4 at 12:42

























                                      answered Jan 4 at 12:06









                                      Michael GreenMichael Green

                                      15k83162




                                      15k83162






























                                          draft saved

                                          draft discarded




















































                                          Thanks for contributing an answer to Database Administrators Stack Exchange!


                                          • Please be sure to answer the question. Provide details and share your research!

                                          But avoid



                                          • Asking for help, clarification, or responding to other answers.

                                          • Making statements based on opinion; back them up with references or personal experience.


                                          To learn more, see our tips on writing great answers.




                                          draft saved


                                          draft discarded














                                          StackExchange.ready(
                                          function () {
                                          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f225953%2foptimizing-numeric-range-interval-searches-in-sql-server%23new-answer', 'question_page');
                                          }
                                          );

                                          Post as a guest















                                          Required, but never shown





















































                                          Required, but never shown














                                          Required, but never shown












                                          Required, but never shown







                                          Required, but never shown

































                                          Required, but never shown














                                          Required, but never shown












                                          Required, but never shown







                                          Required, but never shown







                                          Popular posts from this blog

                                          Wiesbaden

                                          Marschland

                                          Dieringhausen