Loss of rows when running Nested Loops in pandas dataframe
I am trying to replicate a pandas data frame, named df, by running a nested loop. I am simply filtering the data by unique values in each column and then appending the results together in an empty dataframe. The number of rows in df is 627974. The number of rows in df_new is 2275. Here is top 10 rows of the data
cid run_bal last_tran_date mark_seg province local_body branch acc_type int_rate
0 11276 495000.0 2013/1/4 Institutional 4 Municipality KULESHWOR BRANCH NORMAL SAVINGS 0.065
1 11277 1030000.0 2013/1/2 Institutional 1 Municipality CHABAHIL BRANCH NORMAL SAVINGS 0.065
2 11278 1225000.0 2013/1/20 Institutional 1 Municipality TINKUNE BRANCH NORMAL SAVINGS 0.065
3 11278 50000.0 2013/1/8 Institutional 4 Municipality Jorpati Branch NORMAL SAVINGS 0.065
4 11279 320000.0 2013/1/28 Institutional 3 Gaunpalika TRIPURESHWAR BRANCH NORMAL SAVINGS 0.065
5 11281 100000.0 2013/1/14 Institutional 3 Municipality KANTIPATH BRANCH NORMAL SAVINGS 0.06
6 11282 10000.0 2013/1/1 Institutional 5 Municipality Gwarko NORMAL SAVINGS 0.06
7 11283 10000.0 2013/1/28 Institutional 1 Municipality SWOYAMBHU BRANCH CALL DEPOSIT 0.04
8 11284 200000.0 2013/1/15 Institutional 3 Gaunpalika Anamnagar Branch NORMAL SAVINGS 0.065
9 11284 545000.0 2013/1/21 Institutional 3 Municipality Bouddha Branch CALL DEPOSIT 0.04
Please find the code set below:
def bin_cus(df):
df_new = pd.DataFrame(columns= col_names)
mk_unq = df.mark_seg.unique()
for mk in mk_unq:
df_f = df.loc[df.mark_seg == mk]
prc_unq = df_f.province.unique()
for prc in prc_unq:
df_f = df_f.loc[df_f.province == prc]
lb_unq = df_f.local_body.unique()
for lb in lb_unq:
df_f = df_f.loc[df_f.local_body == lb]
brn_unq = df_f.branch.unique()
for brn in brn_unq:
df_f = df_f.loc[df_f.branch == brn]
acc_unq = df_f.acc_type.unique()
for acc in acc_unq:
df_f = df_f.loc[df_f.acc_type == acc]
df_new = pd.concat([df_new, df_f], axis = 0)
return df_new
python-3.x pandas
add a comment |
I am trying to replicate a pandas data frame, named df, by running a nested loop. I am simply filtering the data by unique values in each column and then appending the results together in an empty dataframe. The number of rows in df is 627974. The number of rows in df_new is 2275. Here is top 10 rows of the data
cid run_bal last_tran_date mark_seg province local_body branch acc_type int_rate
0 11276 495000.0 2013/1/4 Institutional 4 Municipality KULESHWOR BRANCH NORMAL SAVINGS 0.065
1 11277 1030000.0 2013/1/2 Institutional 1 Municipality CHABAHIL BRANCH NORMAL SAVINGS 0.065
2 11278 1225000.0 2013/1/20 Institutional 1 Municipality TINKUNE BRANCH NORMAL SAVINGS 0.065
3 11278 50000.0 2013/1/8 Institutional 4 Municipality Jorpati Branch NORMAL SAVINGS 0.065
4 11279 320000.0 2013/1/28 Institutional 3 Gaunpalika TRIPURESHWAR BRANCH NORMAL SAVINGS 0.065
5 11281 100000.0 2013/1/14 Institutional 3 Municipality KANTIPATH BRANCH NORMAL SAVINGS 0.06
6 11282 10000.0 2013/1/1 Institutional 5 Municipality Gwarko NORMAL SAVINGS 0.06
7 11283 10000.0 2013/1/28 Institutional 1 Municipality SWOYAMBHU BRANCH CALL DEPOSIT 0.04
8 11284 200000.0 2013/1/15 Institutional 3 Gaunpalika Anamnagar Branch NORMAL SAVINGS 0.065
9 11284 545000.0 2013/1/21 Institutional 3 Municipality Bouddha Branch CALL DEPOSIT 0.04
Please find the code set below:
def bin_cus(df):
df_new = pd.DataFrame(columns= col_names)
mk_unq = df.mark_seg.unique()
for mk in mk_unq:
df_f = df.loc[df.mark_seg == mk]
prc_unq = df_f.province.unique()
for prc in prc_unq:
df_f = df_f.loc[df_f.province == prc]
lb_unq = df_f.local_body.unique()
for lb in lb_unq:
df_f = df_f.loc[df_f.local_body == lb]
brn_unq = df_f.branch.unique()
for brn in brn_unq:
df_f = df_f.loc[df_f.branch == brn]
acc_unq = df_f.acc_type.unique()
for acc in acc_unq:
df_f = df_f.loc[df_f.acc_type == acc]
df_new = pd.concat([df_new, df_f], axis = 0)
return df_new
python-3.x pandas
add a comment |
I am trying to replicate a pandas data frame, named df, by running a nested loop. I am simply filtering the data by unique values in each column and then appending the results together in an empty dataframe. The number of rows in df is 627974. The number of rows in df_new is 2275. Here is top 10 rows of the data
cid run_bal last_tran_date mark_seg province local_body branch acc_type int_rate
0 11276 495000.0 2013/1/4 Institutional 4 Municipality KULESHWOR BRANCH NORMAL SAVINGS 0.065
1 11277 1030000.0 2013/1/2 Institutional 1 Municipality CHABAHIL BRANCH NORMAL SAVINGS 0.065
2 11278 1225000.0 2013/1/20 Institutional 1 Municipality TINKUNE BRANCH NORMAL SAVINGS 0.065
3 11278 50000.0 2013/1/8 Institutional 4 Municipality Jorpati Branch NORMAL SAVINGS 0.065
4 11279 320000.0 2013/1/28 Institutional 3 Gaunpalika TRIPURESHWAR BRANCH NORMAL SAVINGS 0.065
5 11281 100000.0 2013/1/14 Institutional 3 Municipality KANTIPATH BRANCH NORMAL SAVINGS 0.06
6 11282 10000.0 2013/1/1 Institutional 5 Municipality Gwarko NORMAL SAVINGS 0.06
7 11283 10000.0 2013/1/28 Institutional 1 Municipality SWOYAMBHU BRANCH CALL DEPOSIT 0.04
8 11284 200000.0 2013/1/15 Institutional 3 Gaunpalika Anamnagar Branch NORMAL SAVINGS 0.065
9 11284 545000.0 2013/1/21 Institutional 3 Municipality Bouddha Branch CALL DEPOSIT 0.04
Please find the code set below:
def bin_cus(df):
df_new = pd.DataFrame(columns= col_names)
mk_unq = df.mark_seg.unique()
for mk in mk_unq:
df_f = df.loc[df.mark_seg == mk]
prc_unq = df_f.province.unique()
for prc in prc_unq:
df_f = df_f.loc[df_f.province == prc]
lb_unq = df_f.local_body.unique()
for lb in lb_unq:
df_f = df_f.loc[df_f.local_body == lb]
brn_unq = df_f.branch.unique()
for brn in brn_unq:
df_f = df_f.loc[df_f.branch == brn]
acc_unq = df_f.acc_type.unique()
for acc in acc_unq:
df_f = df_f.loc[df_f.acc_type == acc]
df_new = pd.concat([df_new, df_f], axis = 0)
return df_new
python-3.x pandas
I am trying to replicate a pandas data frame, named df, by running a nested loop. I am simply filtering the data by unique values in each column and then appending the results together in an empty dataframe. The number of rows in df is 627974. The number of rows in df_new is 2275. Here is top 10 rows of the data
cid run_bal last_tran_date mark_seg province local_body branch acc_type int_rate
0 11276 495000.0 2013/1/4 Institutional 4 Municipality KULESHWOR BRANCH NORMAL SAVINGS 0.065
1 11277 1030000.0 2013/1/2 Institutional 1 Municipality CHABAHIL BRANCH NORMAL SAVINGS 0.065
2 11278 1225000.0 2013/1/20 Institutional 1 Municipality TINKUNE BRANCH NORMAL SAVINGS 0.065
3 11278 50000.0 2013/1/8 Institutional 4 Municipality Jorpati Branch NORMAL SAVINGS 0.065
4 11279 320000.0 2013/1/28 Institutional 3 Gaunpalika TRIPURESHWAR BRANCH NORMAL SAVINGS 0.065
5 11281 100000.0 2013/1/14 Institutional 3 Municipality KANTIPATH BRANCH NORMAL SAVINGS 0.06
6 11282 10000.0 2013/1/1 Institutional 5 Municipality Gwarko NORMAL SAVINGS 0.06
7 11283 10000.0 2013/1/28 Institutional 1 Municipality SWOYAMBHU BRANCH CALL DEPOSIT 0.04
8 11284 200000.0 2013/1/15 Institutional 3 Gaunpalika Anamnagar Branch NORMAL SAVINGS 0.065
9 11284 545000.0 2013/1/21 Institutional 3 Municipality Bouddha Branch CALL DEPOSIT 0.04
Please find the code set below:
def bin_cus(df):
df_new = pd.DataFrame(columns= col_names)
mk_unq = df.mark_seg.unique()
for mk in mk_unq:
df_f = df.loc[df.mark_seg == mk]
prc_unq = df_f.province.unique()
for prc in prc_unq:
df_f = df_f.loc[df_f.province == prc]
lb_unq = df_f.local_body.unique()
for lb in lb_unq:
df_f = df_f.loc[df_f.local_body == lb]
brn_unq = df_f.branch.unique()
for brn in brn_unq:
df_f = df_f.loc[df_f.branch == brn]
acc_unq = df_f.acc_type.unique()
for acc in acc_unq:
df_f = df_f.loc[df_f.acc_type == acc]
df_new = pd.concat([df_new, df_f], axis = 0)
return df_new
python-3.x pandas
python-3.x pandas
edited Nov 24 '18 at 11:47
Srce Cde
1,184511
1,184511
asked Nov 24 '18 at 11:46
RishiRishi
184
184
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
You are making changes in the same dataframe inside each loop. If you rewrite the code as shown below it will fix your problem.
dep_cus_pandas_df = dep_cus_spark_df.toPandas()
dep_cus_pandas_df["cus_seg"] = ""
col_names = dep_cus_pandas_df.columns
df = dep_cus_pandas_df
df_new = pd.DataFrame(columns= col_names)
mk_unq = df.mark_seg.unique()
for mk in mk_unq:
df_f_m = df.loc[df.mark_seg == mk]
prc_unq = df_f_m.province.unique()
for prc in prc_unq:
df_f_p = df_f_m.loc[df_f_m.province == prc]
lb_unq = df_f_p.local_body.unique()
for lb in lb_unq:
df_f_lb = df_f_p.loc[df_f_p.local_body == lb]
brn_unq = df_f_lb.branch.unique()
for brn in brn_unq:
df_f_brn = df_f_lb.loc[df_f_lb.branch == brn]
acc_unq = df_f_brn.acc_type.unique()
for acc in acc_unq:
df_f_acc = df_f_brn.loc[df_f_brn.acc_type == acc]
df_f_acc.cus_seg = pd.cut(df_f_acc.run_bal.sort_values(), 4, labels= ["min_q1", "q1_q2", "q2_q3", "q3_max"],
duplicates = "drop")
df_new = pd.concat([df_new, df_f_acc], axis = 0)
dep_cus_spark_df_new = sc_sql.createDataFrame(df_new, schema=["cid", "run_bal", "last_tran_date",
"mark_seg", "province","local_body",
"branch", "acc_type", "int_rate", "cus_seg"])
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53457791%2floss-of-rows-when-running-nested-loops-in-pandas-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You are making changes in the same dataframe inside each loop. If you rewrite the code as shown below it will fix your problem.
dep_cus_pandas_df = dep_cus_spark_df.toPandas()
dep_cus_pandas_df["cus_seg"] = ""
col_names = dep_cus_pandas_df.columns
df = dep_cus_pandas_df
df_new = pd.DataFrame(columns= col_names)
mk_unq = df.mark_seg.unique()
for mk in mk_unq:
df_f_m = df.loc[df.mark_seg == mk]
prc_unq = df_f_m.province.unique()
for prc in prc_unq:
df_f_p = df_f_m.loc[df_f_m.province == prc]
lb_unq = df_f_p.local_body.unique()
for lb in lb_unq:
df_f_lb = df_f_p.loc[df_f_p.local_body == lb]
brn_unq = df_f_lb.branch.unique()
for brn in brn_unq:
df_f_brn = df_f_lb.loc[df_f_lb.branch == brn]
acc_unq = df_f_brn.acc_type.unique()
for acc in acc_unq:
df_f_acc = df_f_brn.loc[df_f_brn.acc_type == acc]
df_f_acc.cus_seg = pd.cut(df_f_acc.run_bal.sort_values(), 4, labels= ["min_q1", "q1_q2", "q2_q3", "q3_max"],
duplicates = "drop")
df_new = pd.concat([df_new, df_f_acc], axis = 0)
dep_cus_spark_df_new = sc_sql.createDataFrame(df_new, schema=["cid", "run_bal", "last_tran_date",
"mark_seg", "province","local_body",
"branch", "acc_type", "int_rate", "cus_seg"])
add a comment |
You are making changes in the same dataframe inside each loop. If you rewrite the code as shown below it will fix your problem.
dep_cus_pandas_df = dep_cus_spark_df.toPandas()
dep_cus_pandas_df["cus_seg"] = ""
col_names = dep_cus_pandas_df.columns
df = dep_cus_pandas_df
df_new = pd.DataFrame(columns= col_names)
mk_unq = df.mark_seg.unique()
for mk in mk_unq:
df_f_m = df.loc[df.mark_seg == mk]
prc_unq = df_f_m.province.unique()
for prc in prc_unq:
df_f_p = df_f_m.loc[df_f_m.province == prc]
lb_unq = df_f_p.local_body.unique()
for lb in lb_unq:
df_f_lb = df_f_p.loc[df_f_p.local_body == lb]
brn_unq = df_f_lb.branch.unique()
for brn in brn_unq:
df_f_brn = df_f_lb.loc[df_f_lb.branch == brn]
acc_unq = df_f_brn.acc_type.unique()
for acc in acc_unq:
df_f_acc = df_f_brn.loc[df_f_brn.acc_type == acc]
df_f_acc.cus_seg = pd.cut(df_f_acc.run_bal.sort_values(), 4, labels= ["min_q1", "q1_q2", "q2_q3", "q3_max"],
duplicates = "drop")
df_new = pd.concat([df_new, df_f_acc], axis = 0)
dep_cus_spark_df_new = sc_sql.createDataFrame(df_new, schema=["cid", "run_bal", "last_tran_date",
"mark_seg", "province","local_body",
"branch", "acc_type", "int_rate", "cus_seg"])
add a comment |
You are making changes in the same dataframe inside each loop. If you rewrite the code as shown below it will fix your problem.
dep_cus_pandas_df = dep_cus_spark_df.toPandas()
dep_cus_pandas_df["cus_seg"] = ""
col_names = dep_cus_pandas_df.columns
df = dep_cus_pandas_df
df_new = pd.DataFrame(columns= col_names)
mk_unq = df.mark_seg.unique()
for mk in mk_unq:
df_f_m = df.loc[df.mark_seg == mk]
prc_unq = df_f_m.province.unique()
for prc in prc_unq:
df_f_p = df_f_m.loc[df_f_m.province == prc]
lb_unq = df_f_p.local_body.unique()
for lb in lb_unq:
df_f_lb = df_f_p.loc[df_f_p.local_body == lb]
brn_unq = df_f_lb.branch.unique()
for brn in brn_unq:
df_f_brn = df_f_lb.loc[df_f_lb.branch == brn]
acc_unq = df_f_brn.acc_type.unique()
for acc in acc_unq:
df_f_acc = df_f_brn.loc[df_f_brn.acc_type == acc]
df_f_acc.cus_seg = pd.cut(df_f_acc.run_bal.sort_values(), 4, labels= ["min_q1", "q1_q2", "q2_q3", "q3_max"],
duplicates = "drop")
df_new = pd.concat([df_new, df_f_acc], axis = 0)
dep_cus_spark_df_new = sc_sql.createDataFrame(df_new, schema=["cid", "run_bal", "last_tran_date",
"mark_seg", "province","local_body",
"branch", "acc_type", "int_rate", "cus_seg"])
You are making changes in the same dataframe inside each loop. If you rewrite the code as shown below it will fix your problem.
dep_cus_pandas_df = dep_cus_spark_df.toPandas()
dep_cus_pandas_df["cus_seg"] = ""
col_names = dep_cus_pandas_df.columns
df = dep_cus_pandas_df
df_new = pd.DataFrame(columns= col_names)
mk_unq = df.mark_seg.unique()
for mk in mk_unq:
df_f_m = df.loc[df.mark_seg == mk]
prc_unq = df_f_m.province.unique()
for prc in prc_unq:
df_f_p = df_f_m.loc[df_f_m.province == prc]
lb_unq = df_f_p.local_body.unique()
for lb in lb_unq:
df_f_lb = df_f_p.loc[df_f_p.local_body == lb]
brn_unq = df_f_lb.branch.unique()
for brn in brn_unq:
df_f_brn = df_f_lb.loc[df_f_lb.branch == brn]
acc_unq = df_f_brn.acc_type.unique()
for acc in acc_unq:
df_f_acc = df_f_brn.loc[df_f_brn.acc_type == acc]
df_f_acc.cus_seg = pd.cut(df_f_acc.run_bal.sort_values(), 4, labels= ["min_q1", "q1_q2", "q2_q3", "q3_max"],
duplicates = "drop")
df_new = pd.concat([df_new, df_f_acc], axis = 0)
dep_cus_spark_df_new = sc_sql.createDataFrame(df_new, schema=["cid", "run_bal", "last_tran_date",
"mark_seg", "province","local_body",
"branch", "acc_type", "int_rate", "cus_seg"])
answered Nov 29 '18 at 3:19
Dipesh PoudelDipesh Poudel
38947
38947
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53457791%2floss-of-rows-when-running-nested-loops-in-pandas-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown