How can I retrieve data from a Hive table from two columns with non null values and top 500 records in one...












0















I have a Hive table (my_table) which is in ORC format and has 30 columns. Two of the columns (col_us, col_ds) store numeric values which can be 0 or null or some integer. The table is partitioned on the bases of day and hourly.
The table has approx. 8 Million x 96 records in a days partition and I am referring to 15 daily partitions



Currently I am running separate queries to retrieve top 500 records with value greater than 0 using a rank function. One query to retrieve col_us and other for col_ds



It is possible that clo_US may have a numeric value while col_DS is 0 or null



Question:
I want to retrieve top 500 non null and non 0 records from each of these columns from one query.



My Query:



From(
SELECT D.COL_US, D.DATESTAMP,
ROW_NUMBER() OVER (PARTITION BY D.ID,D.SUB_ID ORDER BY CONCAT (D.DATESTAMP,D.HOURSTAMP,D.TIMESTAMP) DESC) AS RNK
FROM ${wf_table_name} D
WHERE DATESTAMP >= '${datestamp_15}' AND DATESTAMP < '${datestamp}'
AND COL_US > 0)T
INSERT OVERWRITE TABLE ${wf_us_table}
SELECT T.COL_US, T.DATESTAMP, T.RNK WHERE T.RNK < 500;









share|improve this question



























    0















    I have a Hive table (my_table) which is in ORC format and has 30 columns. Two of the columns (col_us, col_ds) store numeric values which can be 0 or null or some integer. The table is partitioned on the bases of day and hourly.
    The table has approx. 8 Million x 96 records in a days partition and I am referring to 15 daily partitions



    Currently I am running separate queries to retrieve top 500 records with value greater than 0 using a rank function. One query to retrieve col_us and other for col_ds



    It is possible that clo_US may have a numeric value while col_DS is 0 or null



    Question:
    I want to retrieve top 500 non null and non 0 records from each of these columns from one query.



    My Query:



    From(
    SELECT D.COL_US, D.DATESTAMP,
    ROW_NUMBER() OVER (PARTITION BY D.ID,D.SUB_ID ORDER BY CONCAT (D.DATESTAMP,D.HOURSTAMP,D.TIMESTAMP) DESC) AS RNK
    FROM ${wf_table_name} D
    WHERE DATESTAMP >= '${datestamp_15}' AND DATESTAMP < '${datestamp}'
    AND COL_US > 0)T
    INSERT OVERWRITE TABLE ${wf_us_table}
    SELECT T.COL_US, T.DATESTAMP, T.RNK WHERE T.RNK < 500;









    share|improve this question

























      0












      0








      0








      I have a Hive table (my_table) which is in ORC format and has 30 columns. Two of the columns (col_us, col_ds) store numeric values which can be 0 or null or some integer. The table is partitioned on the bases of day and hourly.
      The table has approx. 8 Million x 96 records in a days partition and I am referring to 15 daily partitions



      Currently I am running separate queries to retrieve top 500 records with value greater than 0 using a rank function. One query to retrieve col_us and other for col_ds



      It is possible that clo_US may have a numeric value while col_DS is 0 or null



      Question:
      I want to retrieve top 500 non null and non 0 records from each of these columns from one query.



      My Query:



      From(
      SELECT D.COL_US, D.DATESTAMP,
      ROW_NUMBER() OVER (PARTITION BY D.ID,D.SUB_ID ORDER BY CONCAT (D.DATESTAMP,D.HOURSTAMP,D.TIMESTAMP) DESC) AS RNK
      FROM ${wf_table_name} D
      WHERE DATESTAMP >= '${datestamp_15}' AND DATESTAMP < '${datestamp}'
      AND COL_US > 0)T
      INSERT OVERWRITE TABLE ${wf_us_table}
      SELECT T.COL_US, T.DATESTAMP, T.RNK WHERE T.RNK < 500;









      share|improve this question














      I have a Hive table (my_table) which is in ORC format and has 30 columns. Two of the columns (col_us, col_ds) store numeric values which can be 0 or null or some integer. The table is partitioned on the bases of day and hourly.
      The table has approx. 8 Million x 96 records in a days partition and I am referring to 15 daily partitions



      Currently I am running separate queries to retrieve top 500 records with value greater than 0 using a rank function. One query to retrieve col_us and other for col_ds



      It is possible that clo_US may have a numeric value while col_DS is 0 or null



      Question:
      I want to retrieve top 500 non null and non 0 records from each of these columns from one query.



      My Query:



      From(
      SELECT D.COL_US, D.DATESTAMP,
      ROW_NUMBER() OVER (PARTITION BY D.ID,D.SUB_ID ORDER BY CONCAT (D.DATESTAMP,D.HOURSTAMP,D.TIMESTAMP) DESC) AS RNK
      FROM ${wf_table_name} D
      WHERE DATESTAMP >= '${datestamp_15}' AND DATESTAMP < '${datestamp}'
      AND COL_US > 0)T
      INSERT OVERWRITE TABLE ${wf_us_table}
      SELECT T.COL_US, T.DATESTAMP, T.RNK WHERE T.RNK < 500;






      hive






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 26 '18 at 8:22









      K.KetanK.Ketan

      146




      146
























          1 Answer
          1






          active

          oldest

          votes


















          0














          As per your query I can guess that you are trying to get top 500 rows from your table based on date/time that means latest 500 rows where col_us, col_ds both have a value which is >0 but not top 500 from each of these columns.
          As per your question your table may have 2 type of value. for example.



          col_us



          0

          NULL

          10

          5



          col_ds



          5



          10



          0



          NULL



          or both column may have >0 value.
          So instead of 'AND COL_US > 0' under WHERE clause use 'AND (COL_US > 0 and col_ds > 0)'
          But with this condition you will not get any value from above stated 4 rows.
          So if you want to get 10,5 from col_us along with 5,10 col_ds then I should say it's not possible using a single query.



          Again, as per your question stated "I want to retrieve top 500 non null and non 0 records from each of these columns from one query." ,
          I can guess that you want to get top 500 records from col_us, col_ds depends on the value of col_us/col_ds then you must have to use these columns within rank clause instead of date/time.



          What you want to retrieve you may get by UPDATE query depending on other available columns but before that I want to request you to share exactly what you want (top 500 based on col_us/col_ds or latest 500) along with your base and target table structure.






          share|improve this answer
























          • This is exactly what I am looking for: "So if you want to get 10,5 from col_us along with 5,10 col_ds then I should say it's not possible using a single query"

            – K.Ketan
            Nov 29 '18 at 10:18













          • Yes you are absolutely right. Because you can fetch data row wise only. You will never get data as below using a single query, col_us 10 5 col_ds 5 10

            – Pallab
            Dec 1 '18 at 8:26














          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53477101%2fhow-can-i-retrieve-data-from-a-hive-table-from-two-columns-with-non-null-values%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          As per your query I can guess that you are trying to get top 500 rows from your table based on date/time that means latest 500 rows where col_us, col_ds both have a value which is >0 but not top 500 from each of these columns.
          As per your question your table may have 2 type of value. for example.



          col_us



          0

          NULL

          10

          5



          col_ds



          5



          10



          0



          NULL



          or both column may have >0 value.
          So instead of 'AND COL_US > 0' under WHERE clause use 'AND (COL_US > 0 and col_ds > 0)'
          But with this condition you will not get any value from above stated 4 rows.
          So if you want to get 10,5 from col_us along with 5,10 col_ds then I should say it's not possible using a single query.



          Again, as per your question stated "I want to retrieve top 500 non null and non 0 records from each of these columns from one query." ,
          I can guess that you want to get top 500 records from col_us, col_ds depends on the value of col_us/col_ds then you must have to use these columns within rank clause instead of date/time.



          What you want to retrieve you may get by UPDATE query depending on other available columns but before that I want to request you to share exactly what you want (top 500 based on col_us/col_ds or latest 500) along with your base and target table structure.






          share|improve this answer
























          • This is exactly what I am looking for: "So if you want to get 10,5 from col_us along with 5,10 col_ds then I should say it's not possible using a single query"

            – K.Ketan
            Nov 29 '18 at 10:18













          • Yes you are absolutely right. Because you can fetch data row wise only. You will never get data as below using a single query, col_us 10 5 col_ds 5 10

            – Pallab
            Dec 1 '18 at 8:26


















          0














          As per your query I can guess that you are trying to get top 500 rows from your table based on date/time that means latest 500 rows where col_us, col_ds both have a value which is >0 but not top 500 from each of these columns.
          As per your question your table may have 2 type of value. for example.



          col_us



          0

          NULL

          10

          5



          col_ds



          5



          10



          0



          NULL



          or both column may have >0 value.
          So instead of 'AND COL_US > 0' under WHERE clause use 'AND (COL_US > 0 and col_ds > 0)'
          But with this condition you will not get any value from above stated 4 rows.
          So if you want to get 10,5 from col_us along with 5,10 col_ds then I should say it's not possible using a single query.



          Again, as per your question stated "I want to retrieve top 500 non null and non 0 records from each of these columns from one query." ,
          I can guess that you want to get top 500 records from col_us, col_ds depends on the value of col_us/col_ds then you must have to use these columns within rank clause instead of date/time.



          What you want to retrieve you may get by UPDATE query depending on other available columns but before that I want to request you to share exactly what you want (top 500 based on col_us/col_ds or latest 500) along with your base and target table structure.






          share|improve this answer
























          • This is exactly what I am looking for: "So if you want to get 10,5 from col_us along with 5,10 col_ds then I should say it's not possible using a single query"

            – K.Ketan
            Nov 29 '18 at 10:18













          • Yes you are absolutely right. Because you can fetch data row wise only. You will never get data as below using a single query, col_us 10 5 col_ds 5 10

            – Pallab
            Dec 1 '18 at 8:26
















          0












          0








          0







          As per your query I can guess that you are trying to get top 500 rows from your table based on date/time that means latest 500 rows where col_us, col_ds both have a value which is >0 but not top 500 from each of these columns.
          As per your question your table may have 2 type of value. for example.



          col_us



          0

          NULL

          10

          5



          col_ds



          5



          10



          0



          NULL



          or both column may have >0 value.
          So instead of 'AND COL_US > 0' under WHERE clause use 'AND (COL_US > 0 and col_ds > 0)'
          But with this condition you will not get any value from above stated 4 rows.
          So if you want to get 10,5 from col_us along with 5,10 col_ds then I should say it's not possible using a single query.



          Again, as per your question stated "I want to retrieve top 500 non null and non 0 records from each of these columns from one query." ,
          I can guess that you want to get top 500 records from col_us, col_ds depends on the value of col_us/col_ds then you must have to use these columns within rank clause instead of date/time.



          What you want to retrieve you may get by UPDATE query depending on other available columns but before that I want to request you to share exactly what you want (top 500 based on col_us/col_ds or latest 500) along with your base and target table structure.






          share|improve this answer













          As per your query I can guess that you are trying to get top 500 rows from your table based on date/time that means latest 500 rows where col_us, col_ds both have a value which is >0 but not top 500 from each of these columns.
          As per your question your table may have 2 type of value. for example.



          col_us



          0

          NULL

          10

          5



          col_ds



          5



          10



          0



          NULL



          or both column may have >0 value.
          So instead of 'AND COL_US > 0' under WHERE clause use 'AND (COL_US > 0 and col_ds > 0)'
          But with this condition you will not get any value from above stated 4 rows.
          So if you want to get 10,5 from col_us along with 5,10 col_ds then I should say it's not possible using a single query.



          Again, as per your question stated "I want to retrieve top 500 non null and non 0 records from each of these columns from one query." ,
          I can guess that you want to get top 500 records from col_us, col_ds depends on the value of col_us/col_ds then you must have to use these columns within rank clause instead of date/time.



          What you want to retrieve you may get by UPDATE query depending on other available columns but before that I want to request you to share exactly what you want (top 500 based on col_us/col_ds or latest 500) along with your base and target table structure.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 26 '18 at 10:36









          PallabPallab

          262




          262













          • This is exactly what I am looking for: "So if you want to get 10,5 from col_us along with 5,10 col_ds then I should say it's not possible using a single query"

            – K.Ketan
            Nov 29 '18 at 10:18













          • Yes you are absolutely right. Because you can fetch data row wise only. You will never get data as below using a single query, col_us 10 5 col_ds 5 10

            – Pallab
            Dec 1 '18 at 8:26





















          • This is exactly what I am looking for: "So if you want to get 10,5 from col_us along with 5,10 col_ds then I should say it's not possible using a single query"

            – K.Ketan
            Nov 29 '18 at 10:18













          • Yes you are absolutely right. Because you can fetch data row wise only. You will never get data as below using a single query, col_us 10 5 col_ds 5 10

            – Pallab
            Dec 1 '18 at 8:26



















          This is exactly what I am looking for: "So if you want to get 10,5 from col_us along with 5,10 col_ds then I should say it's not possible using a single query"

          – K.Ketan
          Nov 29 '18 at 10:18







          This is exactly what I am looking for: "So if you want to get 10,5 from col_us along with 5,10 col_ds then I should say it's not possible using a single query"

          – K.Ketan
          Nov 29 '18 at 10:18















          Yes you are absolutely right. Because you can fetch data row wise only. You will never get data as below using a single query, col_us 10 5 col_ds 5 10

          – Pallab
          Dec 1 '18 at 8:26







          Yes you are absolutely right. Because you can fetch data row wise only. You will never get data as below using a single query, col_us 10 5 col_ds 5 10

          – Pallab
          Dec 1 '18 at 8:26






















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53477101%2fhow-can-i-retrieve-data-from-a-hive-table-from-two-columns-with-non-null-values%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Wiesbaden

          Marschland

          Dieringhausen