Loss of rows when running Nested Loops in pandas dataframe












3















I am trying to replicate a pandas data frame, named df, by running a nested loop. I am simply filtering the data by unique values in each column and then appending the results together in an empty dataframe. The number of rows in df is 627974. The number of rows in df_new is 2275. Here is top 10 rows of the data



    cid     run_bal     last_tran_date  mark_seg    province    local_body  branch              acc_type       int_rate     
0 11276 495000.0 2013/1/4 Institutional 4 Municipality KULESHWOR BRANCH NORMAL SAVINGS 0.065
1 11277 1030000.0 2013/1/2 Institutional 1 Municipality CHABAHIL BRANCH NORMAL SAVINGS 0.065
2 11278 1225000.0 2013/1/20 Institutional 1 Municipality TINKUNE BRANCH NORMAL SAVINGS 0.065
3 11278 50000.0 2013/1/8 Institutional 4 Municipality Jorpati Branch NORMAL SAVINGS 0.065
4 11279 320000.0 2013/1/28 Institutional 3 Gaunpalika TRIPURESHWAR BRANCH NORMAL SAVINGS 0.065
5 11281 100000.0 2013/1/14 Institutional 3 Municipality KANTIPATH BRANCH NORMAL SAVINGS 0.06
6 11282 10000.0 2013/1/1 Institutional 5 Municipality Gwarko NORMAL SAVINGS 0.06
7 11283 10000.0 2013/1/28 Institutional 1 Municipality SWOYAMBHU BRANCH CALL DEPOSIT 0.04
8 11284 200000.0 2013/1/15 Institutional 3 Gaunpalika Anamnagar Branch NORMAL SAVINGS 0.065
9 11284 545000.0 2013/1/21 Institutional 3 Municipality Bouddha Branch CALL DEPOSIT 0.04


Please find the code set below:



def bin_cus(df):
df_new = pd.DataFrame(columns= col_names)
mk_unq = df.mark_seg.unique()
for mk in mk_unq:
df_f = df.loc[df.mark_seg == mk]
prc_unq = df_f.province.unique()
for prc in prc_unq:
df_f = df_f.loc[df_f.province == prc]
lb_unq = df_f.local_body.unique()
for lb in lb_unq:
df_f = df_f.loc[df_f.local_body == lb]
brn_unq = df_f.branch.unique()
for brn in brn_unq:
df_f = df_f.loc[df_f.branch == brn]
acc_unq = df_f.acc_type.unique()
for acc in acc_unq:
df_f = df_f.loc[df_f.acc_type == acc]

df_new = pd.concat([df_new, df_f], axis = 0)
return df_new









share|improve this question





























    3















    I am trying to replicate a pandas data frame, named df, by running a nested loop. I am simply filtering the data by unique values in each column and then appending the results together in an empty dataframe. The number of rows in df is 627974. The number of rows in df_new is 2275. Here is top 10 rows of the data



        cid     run_bal     last_tran_date  mark_seg    province    local_body  branch              acc_type       int_rate     
    0 11276 495000.0 2013/1/4 Institutional 4 Municipality KULESHWOR BRANCH NORMAL SAVINGS 0.065
    1 11277 1030000.0 2013/1/2 Institutional 1 Municipality CHABAHIL BRANCH NORMAL SAVINGS 0.065
    2 11278 1225000.0 2013/1/20 Institutional 1 Municipality TINKUNE BRANCH NORMAL SAVINGS 0.065
    3 11278 50000.0 2013/1/8 Institutional 4 Municipality Jorpati Branch NORMAL SAVINGS 0.065
    4 11279 320000.0 2013/1/28 Institutional 3 Gaunpalika TRIPURESHWAR BRANCH NORMAL SAVINGS 0.065
    5 11281 100000.0 2013/1/14 Institutional 3 Municipality KANTIPATH BRANCH NORMAL SAVINGS 0.06
    6 11282 10000.0 2013/1/1 Institutional 5 Municipality Gwarko NORMAL SAVINGS 0.06
    7 11283 10000.0 2013/1/28 Institutional 1 Municipality SWOYAMBHU BRANCH CALL DEPOSIT 0.04
    8 11284 200000.0 2013/1/15 Institutional 3 Gaunpalika Anamnagar Branch NORMAL SAVINGS 0.065
    9 11284 545000.0 2013/1/21 Institutional 3 Municipality Bouddha Branch CALL DEPOSIT 0.04


    Please find the code set below:



    def bin_cus(df):
    df_new = pd.DataFrame(columns= col_names)
    mk_unq = df.mark_seg.unique()
    for mk in mk_unq:
    df_f = df.loc[df.mark_seg == mk]
    prc_unq = df_f.province.unique()
    for prc in prc_unq:
    df_f = df_f.loc[df_f.province == prc]
    lb_unq = df_f.local_body.unique()
    for lb in lb_unq:
    df_f = df_f.loc[df_f.local_body == lb]
    brn_unq = df_f.branch.unique()
    for brn in brn_unq:
    df_f = df_f.loc[df_f.branch == brn]
    acc_unq = df_f.acc_type.unique()
    for acc in acc_unq:
    df_f = df_f.loc[df_f.acc_type == acc]

    df_new = pd.concat([df_new, df_f], axis = 0)
    return df_new









    share|improve this question



























      3












      3








      3


      2






      I am trying to replicate a pandas data frame, named df, by running a nested loop. I am simply filtering the data by unique values in each column and then appending the results together in an empty dataframe. The number of rows in df is 627974. The number of rows in df_new is 2275. Here is top 10 rows of the data



          cid     run_bal     last_tran_date  mark_seg    province    local_body  branch              acc_type       int_rate     
      0 11276 495000.0 2013/1/4 Institutional 4 Municipality KULESHWOR BRANCH NORMAL SAVINGS 0.065
      1 11277 1030000.0 2013/1/2 Institutional 1 Municipality CHABAHIL BRANCH NORMAL SAVINGS 0.065
      2 11278 1225000.0 2013/1/20 Institutional 1 Municipality TINKUNE BRANCH NORMAL SAVINGS 0.065
      3 11278 50000.0 2013/1/8 Institutional 4 Municipality Jorpati Branch NORMAL SAVINGS 0.065
      4 11279 320000.0 2013/1/28 Institutional 3 Gaunpalika TRIPURESHWAR BRANCH NORMAL SAVINGS 0.065
      5 11281 100000.0 2013/1/14 Institutional 3 Municipality KANTIPATH BRANCH NORMAL SAVINGS 0.06
      6 11282 10000.0 2013/1/1 Institutional 5 Municipality Gwarko NORMAL SAVINGS 0.06
      7 11283 10000.0 2013/1/28 Institutional 1 Municipality SWOYAMBHU BRANCH CALL DEPOSIT 0.04
      8 11284 200000.0 2013/1/15 Institutional 3 Gaunpalika Anamnagar Branch NORMAL SAVINGS 0.065
      9 11284 545000.0 2013/1/21 Institutional 3 Municipality Bouddha Branch CALL DEPOSIT 0.04


      Please find the code set below:



      def bin_cus(df):
      df_new = pd.DataFrame(columns= col_names)
      mk_unq = df.mark_seg.unique()
      for mk in mk_unq:
      df_f = df.loc[df.mark_seg == mk]
      prc_unq = df_f.province.unique()
      for prc in prc_unq:
      df_f = df_f.loc[df_f.province == prc]
      lb_unq = df_f.local_body.unique()
      for lb in lb_unq:
      df_f = df_f.loc[df_f.local_body == lb]
      brn_unq = df_f.branch.unique()
      for brn in brn_unq:
      df_f = df_f.loc[df_f.branch == brn]
      acc_unq = df_f.acc_type.unique()
      for acc in acc_unq:
      df_f = df_f.loc[df_f.acc_type == acc]

      df_new = pd.concat([df_new, df_f], axis = 0)
      return df_new









      share|improve this question
















      I am trying to replicate a pandas data frame, named df, by running a nested loop. I am simply filtering the data by unique values in each column and then appending the results together in an empty dataframe. The number of rows in df is 627974. The number of rows in df_new is 2275. Here is top 10 rows of the data



          cid     run_bal     last_tran_date  mark_seg    province    local_body  branch              acc_type       int_rate     
      0 11276 495000.0 2013/1/4 Institutional 4 Municipality KULESHWOR BRANCH NORMAL SAVINGS 0.065
      1 11277 1030000.0 2013/1/2 Institutional 1 Municipality CHABAHIL BRANCH NORMAL SAVINGS 0.065
      2 11278 1225000.0 2013/1/20 Institutional 1 Municipality TINKUNE BRANCH NORMAL SAVINGS 0.065
      3 11278 50000.0 2013/1/8 Institutional 4 Municipality Jorpati Branch NORMAL SAVINGS 0.065
      4 11279 320000.0 2013/1/28 Institutional 3 Gaunpalika TRIPURESHWAR BRANCH NORMAL SAVINGS 0.065
      5 11281 100000.0 2013/1/14 Institutional 3 Municipality KANTIPATH BRANCH NORMAL SAVINGS 0.06
      6 11282 10000.0 2013/1/1 Institutional 5 Municipality Gwarko NORMAL SAVINGS 0.06
      7 11283 10000.0 2013/1/28 Institutional 1 Municipality SWOYAMBHU BRANCH CALL DEPOSIT 0.04
      8 11284 200000.0 2013/1/15 Institutional 3 Gaunpalika Anamnagar Branch NORMAL SAVINGS 0.065
      9 11284 545000.0 2013/1/21 Institutional 3 Municipality Bouddha Branch CALL DEPOSIT 0.04


      Please find the code set below:



      def bin_cus(df):
      df_new = pd.DataFrame(columns= col_names)
      mk_unq = df.mark_seg.unique()
      for mk in mk_unq:
      df_f = df.loc[df.mark_seg == mk]
      prc_unq = df_f.province.unique()
      for prc in prc_unq:
      df_f = df_f.loc[df_f.province == prc]
      lb_unq = df_f.local_body.unique()
      for lb in lb_unq:
      df_f = df_f.loc[df_f.local_body == lb]
      brn_unq = df_f.branch.unique()
      for brn in brn_unq:
      df_f = df_f.loc[df_f.branch == brn]
      acc_unq = df_f.acc_type.unique()
      for acc in acc_unq:
      df_f = df_f.loc[df_f.acc_type == acc]

      df_new = pd.concat([df_new, df_f], axis = 0)
      return df_new






      python-3.x pandas






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 24 '18 at 11:47









      Srce Cde

      1,184511




      1,184511










      asked Nov 24 '18 at 11:46









      RishiRishi

      184




      184
























          1 Answer
          1






          active

          oldest

          votes


















          3














          You are making changes in the same dataframe inside each loop. If you rewrite the code as shown below it will fix your problem.



          dep_cus_pandas_df = dep_cus_spark_df.toPandas()
          dep_cus_pandas_df["cus_seg"] = ""
          col_names = dep_cus_pandas_df.columns
          df = dep_cus_pandas_df
          df_new = pd.DataFrame(columns= col_names)
          mk_unq = df.mark_seg.unique()
          for mk in mk_unq:
          df_f_m = df.loc[df.mark_seg == mk]
          prc_unq = df_f_m.province.unique()
          for prc in prc_unq:
          df_f_p = df_f_m.loc[df_f_m.province == prc]
          lb_unq = df_f_p.local_body.unique()
          for lb in lb_unq:
          df_f_lb = df_f_p.loc[df_f_p.local_body == lb]
          brn_unq = df_f_lb.branch.unique()
          for brn in brn_unq:
          df_f_brn = df_f_lb.loc[df_f_lb.branch == brn]
          acc_unq = df_f_brn.acc_type.unique()
          for acc in acc_unq:
          df_f_acc = df_f_brn.loc[df_f_brn.acc_type == acc]
          df_f_acc.cus_seg = pd.cut(df_f_acc.run_bal.sort_values(), 4, labels= ["min_q1", "q1_q2", "q2_q3", "q3_max"],
          duplicates = "drop")
          df_new = pd.concat([df_new, df_f_acc], axis = 0)

          dep_cus_spark_df_new = sc_sql.createDataFrame(df_new, schema=["cid", "run_bal", "last_tran_date",
          "mark_seg", "province","local_body",
          "branch", "acc_type", "int_rate", "cus_seg"])





          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53457791%2floss-of-rows-when-running-nested-loops-in-pandas-dataframe%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            3














            You are making changes in the same dataframe inside each loop. If you rewrite the code as shown below it will fix your problem.



            dep_cus_pandas_df = dep_cus_spark_df.toPandas()
            dep_cus_pandas_df["cus_seg"] = ""
            col_names = dep_cus_pandas_df.columns
            df = dep_cus_pandas_df
            df_new = pd.DataFrame(columns= col_names)
            mk_unq = df.mark_seg.unique()
            for mk in mk_unq:
            df_f_m = df.loc[df.mark_seg == mk]
            prc_unq = df_f_m.province.unique()
            for prc in prc_unq:
            df_f_p = df_f_m.loc[df_f_m.province == prc]
            lb_unq = df_f_p.local_body.unique()
            for lb in lb_unq:
            df_f_lb = df_f_p.loc[df_f_p.local_body == lb]
            brn_unq = df_f_lb.branch.unique()
            for brn in brn_unq:
            df_f_brn = df_f_lb.loc[df_f_lb.branch == brn]
            acc_unq = df_f_brn.acc_type.unique()
            for acc in acc_unq:
            df_f_acc = df_f_brn.loc[df_f_brn.acc_type == acc]
            df_f_acc.cus_seg = pd.cut(df_f_acc.run_bal.sort_values(), 4, labels= ["min_q1", "q1_q2", "q2_q3", "q3_max"],
            duplicates = "drop")
            df_new = pd.concat([df_new, df_f_acc], axis = 0)

            dep_cus_spark_df_new = sc_sql.createDataFrame(df_new, schema=["cid", "run_bal", "last_tran_date",
            "mark_seg", "province","local_body",
            "branch", "acc_type", "int_rate", "cus_seg"])





            share|improve this answer




























              3














              You are making changes in the same dataframe inside each loop. If you rewrite the code as shown below it will fix your problem.



              dep_cus_pandas_df = dep_cus_spark_df.toPandas()
              dep_cus_pandas_df["cus_seg"] = ""
              col_names = dep_cus_pandas_df.columns
              df = dep_cus_pandas_df
              df_new = pd.DataFrame(columns= col_names)
              mk_unq = df.mark_seg.unique()
              for mk in mk_unq:
              df_f_m = df.loc[df.mark_seg == mk]
              prc_unq = df_f_m.province.unique()
              for prc in prc_unq:
              df_f_p = df_f_m.loc[df_f_m.province == prc]
              lb_unq = df_f_p.local_body.unique()
              for lb in lb_unq:
              df_f_lb = df_f_p.loc[df_f_p.local_body == lb]
              brn_unq = df_f_lb.branch.unique()
              for brn in brn_unq:
              df_f_brn = df_f_lb.loc[df_f_lb.branch == brn]
              acc_unq = df_f_brn.acc_type.unique()
              for acc in acc_unq:
              df_f_acc = df_f_brn.loc[df_f_brn.acc_type == acc]
              df_f_acc.cus_seg = pd.cut(df_f_acc.run_bal.sort_values(), 4, labels= ["min_q1", "q1_q2", "q2_q3", "q3_max"],
              duplicates = "drop")
              df_new = pd.concat([df_new, df_f_acc], axis = 0)

              dep_cus_spark_df_new = sc_sql.createDataFrame(df_new, schema=["cid", "run_bal", "last_tran_date",
              "mark_seg", "province","local_body",
              "branch", "acc_type", "int_rate", "cus_seg"])





              share|improve this answer


























                3












                3








                3







                You are making changes in the same dataframe inside each loop. If you rewrite the code as shown below it will fix your problem.



                dep_cus_pandas_df = dep_cus_spark_df.toPandas()
                dep_cus_pandas_df["cus_seg"] = ""
                col_names = dep_cus_pandas_df.columns
                df = dep_cus_pandas_df
                df_new = pd.DataFrame(columns= col_names)
                mk_unq = df.mark_seg.unique()
                for mk in mk_unq:
                df_f_m = df.loc[df.mark_seg == mk]
                prc_unq = df_f_m.province.unique()
                for prc in prc_unq:
                df_f_p = df_f_m.loc[df_f_m.province == prc]
                lb_unq = df_f_p.local_body.unique()
                for lb in lb_unq:
                df_f_lb = df_f_p.loc[df_f_p.local_body == lb]
                brn_unq = df_f_lb.branch.unique()
                for brn in brn_unq:
                df_f_brn = df_f_lb.loc[df_f_lb.branch == brn]
                acc_unq = df_f_brn.acc_type.unique()
                for acc in acc_unq:
                df_f_acc = df_f_brn.loc[df_f_brn.acc_type == acc]
                df_f_acc.cus_seg = pd.cut(df_f_acc.run_bal.sort_values(), 4, labels= ["min_q1", "q1_q2", "q2_q3", "q3_max"],
                duplicates = "drop")
                df_new = pd.concat([df_new, df_f_acc], axis = 0)

                dep_cus_spark_df_new = sc_sql.createDataFrame(df_new, schema=["cid", "run_bal", "last_tran_date",
                "mark_seg", "province","local_body",
                "branch", "acc_type", "int_rate", "cus_seg"])





                share|improve this answer













                You are making changes in the same dataframe inside each loop. If you rewrite the code as shown below it will fix your problem.



                dep_cus_pandas_df = dep_cus_spark_df.toPandas()
                dep_cus_pandas_df["cus_seg"] = ""
                col_names = dep_cus_pandas_df.columns
                df = dep_cus_pandas_df
                df_new = pd.DataFrame(columns= col_names)
                mk_unq = df.mark_seg.unique()
                for mk in mk_unq:
                df_f_m = df.loc[df.mark_seg == mk]
                prc_unq = df_f_m.province.unique()
                for prc in prc_unq:
                df_f_p = df_f_m.loc[df_f_m.province == prc]
                lb_unq = df_f_p.local_body.unique()
                for lb in lb_unq:
                df_f_lb = df_f_p.loc[df_f_p.local_body == lb]
                brn_unq = df_f_lb.branch.unique()
                for brn in brn_unq:
                df_f_brn = df_f_lb.loc[df_f_lb.branch == brn]
                acc_unq = df_f_brn.acc_type.unique()
                for acc in acc_unq:
                df_f_acc = df_f_brn.loc[df_f_brn.acc_type == acc]
                df_f_acc.cus_seg = pd.cut(df_f_acc.run_bal.sort_values(), 4, labels= ["min_q1", "q1_q2", "q2_q3", "q3_max"],
                duplicates = "drop")
                df_new = pd.concat([df_new, df_f_acc], axis = 0)

                dep_cus_spark_df_new = sc_sql.createDataFrame(df_new, schema=["cid", "run_bal", "last_tran_date",
                "mark_seg", "province","local_body",
                "branch", "acc_type", "int_rate", "cus_seg"])






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 29 '18 at 3:19









                Dipesh PoudelDipesh Poudel

                38947




                38947
































                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53457791%2floss-of-rows-when-running-nested-loops-in-pandas-dataframe%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Wiesbaden

                    Marschland

                    Dieringhausen