Create a column in a dataframe that is a string of characters summarizing data in other columns





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







3















I have a dataframe like this where the columns are the scores of some metrics:



A B C D  
4 3 3 1
2 5 2 2
3 5 2 4


I want to create a new column to summarize which metrics each row scored over a set threshold in, using the column name as a string. So if the threshold was A > 2, B > 3, C > 1, D > 3, I would want the new column to look like this:



A B C D NewCol  
4 3 3 1 AC
2 5 2 2 BC
3 5 2 4 ABCD


I tried using a series of np.where:



df[NewCol] = np.where(df['A'] > 2, 'A', '')  
df[NewCol] = np.where(df['B'] > 3, 'B', '')


etc.



but realized the result was overwriting with the last metric any time all four metrics didn't meet the conditions, like so:



A B C D NewCol  
4 3 3 1 C
2 5 2 2 C
3 5 2 4 ABCD


I am pretty sure there is an easier and correct way to do this.










share|improve this question































    3















    I have a dataframe like this where the columns are the scores of some metrics:



    A B C D  
    4 3 3 1
    2 5 2 2
    3 5 2 4


    I want to create a new column to summarize which metrics each row scored over a set threshold in, using the column name as a string. So if the threshold was A > 2, B > 3, C > 1, D > 3, I would want the new column to look like this:



    A B C D NewCol  
    4 3 3 1 AC
    2 5 2 2 BC
    3 5 2 4 ABCD


    I tried using a series of np.where:



    df[NewCol] = np.where(df['A'] > 2, 'A', '')  
    df[NewCol] = np.where(df['B'] > 3, 'B', '')


    etc.



    but realized the result was overwriting with the last metric any time all four metrics didn't meet the conditions, like so:



    A B C D NewCol  
    4 3 3 1 C
    2 5 2 2 C
    3 5 2 4 ABCD


    I am pretty sure there is an easier and correct way to do this.










    share|improve this question



























      3












      3








      3








      I have a dataframe like this where the columns are the scores of some metrics:



      A B C D  
      4 3 3 1
      2 5 2 2
      3 5 2 4


      I want to create a new column to summarize which metrics each row scored over a set threshold in, using the column name as a string. So if the threshold was A > 2, B > 3, C > 1, D > 3, I would want the new column to look like this:



      A B C D NewCol  
      4 3 3 1 AC
      2 5 2 2 BC
      3 5 2 4 ABCD


      I tried using a series of np.where:



      df[NewCol] = np.where(df['A'] > 2, 'A', '')  
      df[NewCol] = np.where(df['B'] > 3, 'B', '')


      etc.



      but realized the result was overwriting with the last metric any time all four metrics didn't meet the conditions, like so:



      A B C D NewCol  
      4 3 3 1 C
      2 5 2 2 C
      3 5 2 4 ABCD


      I am pretty sure there is an easier and correct way to do this.










      share|improve this question
















      I have a dataframe like this where the columns are the scores of some metrics:



      A B C D  
      4 3 3 1
      2 5 2 2
      3 5 2 4


      I want to create a new column to summarize which metrics each row scored over a set threshold in, using the column name as a string. So if the threshold was A > 2, B > 3, C > 1, D > 3, I would want the new column to look like this:



      A B C D NewCol  
      4 3 3 1 AC
      2 5 2 2 BC
      3 5 2 4 ABCD


      I tried using a series of np.where:



      df[NewCol] = np.where(df['A'] > 2, 'A', '')  
      df[NewCol] = np.where(df['B'] > 3, 'B', '')


      etc.



      but realized the result was overwriting with the last metric any time all four metrics didn't meet the conditions, like so:



      A B C D NewCol  
      4 3 3 1 C
      2 5 2 2 C
      3 5 2 4 ABCD


      I am pretty sure there is an easier and correct way to do this.







      python pandas numpy dataframe






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 27 '18 at 8:45









      WhatsThePoint

      2,32162239




      2,32162239










      asked Nov 27 '18 at 0:54









      J.S.P.J.S.P.

      203




      203
























          3 Answers
          3






          active

          oldest

          votes


















          2














          You could do:



          import pandas as pd

          data = [[4, 3, 3, 1],
          [2, 5, 2, 2],
          [3, 5, 2, 4]]

          df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D'])

          th = {'A': 2, 'B': 3, 'C': 1, 'D': 3}

          df['result'] = [''.join(k for k in df.columns if record[k] > th[k]) for record in df.to_dict('records')]

          print(df)


          Output



             A  B  C  D result
          0 4 3 3 1 AC
          1 2 5 2 2 BC
          2 3 5 2 4 ABCD





          share|improve this answer
























          • Thanks Daniel. This worked perfectly.

            – J.S.P.
            Nov 27 '18 at 19:38



















          2














          Using dot



          s=pd.Series([2,3,1,3],index=df.columns)
          df.gt(s,1).dot(df.columns)
          Out[179]:
          0 AC
          1 BC
          2 ABCD
          dtype: object

          #df['New']=df.gt(s,1).dot(df.columns)





          share|improve this answer
























          • Thanks W-B. I wasn’t able to implement your answer because I don’t really understand it and I struggled trying to research the numpy docs. It was fine until my data changed a little (new columns). Once it broke I didn’t understand enough to figure out how to fix it.

            – J.S.P.
            Nov 27 '18 at 19:41



















          1














          Another option that operates in an array fashion. It would be interesting to compare performance.



          import pandas as pd
          import numpy as np

          # Data to test.

          data = pd.DataFrame(
          [
          [4, 3, 3, 1],
          [2, 5, 2, 2],
          [3, 5, 2, 4]
          ]
          , columns = ['A', 'B', 'C', 'D']
          )

          # Series to hold the thresholds.

          thresholds = pd.Series([2, 3, 1, 3], index = ['A', 'B', 'C', 'D'])

          # Subtract the series from the data, broadcasting, and then use sum to concatenate the strings.

          data['result'] = np.where(data - thresholds > 0, data.columns, '').sum(axis = 1)

          print(data)


          Gives:



             A  B  C  D result
          0 4 3 3 1 AC
          1 2 5 2 2 BC
          2 3 5 2 4 ABCD





          share|improve this answer
























          • Thanks smj. If my data gets bigger maybe I will try the comparison.

            – J.S.P.
            Nov 27 '18 at 19:42












          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53491264%2fcreate-a-column-in-a-dataframe-that-is-a-string-of-characters-summarizing-data-i%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          3 Answers
          3






          active

          oldest

          votes








          3 Answers
          3






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2














          You could do:



          import pandas as pd

          data = [[4, 3, 3, 1],
          [2, 5, 2, 2],
          [3, 5, 2, 4]]

          df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D'])

          th = {'A': 2, 'B': 3, 'C': 1, 'D': 3}

          df['result'] = [''.join(k for k in df.columns if record[k] > th[k]) for record in df.to_dict('records')]

          print(df)


          Output



             A  B  C  D result
          0 4 3 3 1 AC
          1 2 5 2 2 BC
          2 3 5 2 4 ABCD





          share|improve this answer
























          • Thanks Daniel. This worked perfectly.

            – J.S.P.
            Nov 27 '18 at 19:38
















          2














          You could do:



          import pandas as pd

          data = [[4, 3, 3, 1],
          [2, 5, 2, 2],
          [3, 5, 2, 4]]

          df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D'])

          th = {'A': 2, 'B': 3, 'C': 1, 'D': 3}

          df['result'] = [''.join(k for k in df.columns if record[k] > th[k]) for record in df.to_dict('records')]

          print(df)


          Output



             A  B  C  D result
          0 4 3 3 1 AC
          1 2 5 2 2 BC
          2 3 5 2 4 ABCD





          share|improve this answer
























          • Thanks Daniel. This worked perfectly.

            – J.S.P.
            Nov 27 '18 at 19:38














          2












          2








          2







          You could do:



          import pandas as pd

          data = [[4, 3, 3, 1],
          [2, 5, 2, 2],
          [3, 5, 2, 4]]

          df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D'])

          th = {'A': 2, 'B': 3, 'C': 1, 'D': 3}

          df['result'] = [''.join(k for k in df.columns if record[k] > th[k]) for record in df.to_dict('records')]

          print(df)


          Output



             A  B  C  D result
          0 4 3 3 1 AC
          1 2 5 2 2 BC
          2 3 5 2 4 ABCD





          share|improve this answer













          You could do:



          import pandas as pd

          data = [[4, 3, 3, 1],
          [2, 5, 2, 2],
          [3, 5, 2, 4]]

          df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D'])

          th = {'A': 2, 'B': 3, 'C': 1, 'D': 3}

          df['result'] = [''.join(k for k in df.columns if record[k] > th[k]) for record in df.to_dict('records')]

          print(df)


          Output



             A  B  C  D result
          0 4 3 3 1 AC
          1 2 5 2 2 BC
          2 3 5 2 4 ABCD






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 27 '18 at 1:07









          Daniel MesejoDaniel Mesejo

          18.8k21533




          18.8k21533













          • Thanks Daniel. This worked perfectly.

            – J.S.P.
            Nov 27 '18 at 19:38



















          • Thanks Daniel. This worked perfectly.

            – J.S.P.
            Nov 27 '18 at 19:38

















          Thanks Daniel. This worked perfectly.

          – J.S.P.
          Nov 27 '18 at 19:38





          Thanks Daniel. This worked perfectly.

          – J.S.P.
          Nov 27 '18 at 19:38













          2














          Using dot



          s=pd.Series([2,3,1,3],index=df.columns)
          df.gt(s,1).dot(df.columns)
          Out[179]:
          0 AC
          1 BC
          2 ABCD
          dtype: object

          #df['New']=df.gt(s,1).dot(df.columns)





          share|improve this answer
























          • Thanks W-B. I wasn’t able to implement your answer because I don’t really understand it and I struggled trying to research the numpy docs. It was fine until my data changed a little (new columns). Once it broke I didn’t understand enough to figure out how to fix it.

            – J.S.P.
            Nov 27 '18 at 19:41
















          2














          Using dot



          s=pd.Series([2,3,1,3],index=df.columns)
          df.gt(s,1).dot(df.columns)
          Out[179]:
          0 AC
          1 BC
          2 ABCD
          dtype: object

          #df['New']=df.gt(s,1).dot(df.columns)





          share|improve this answer
























          • Thanks W-B. I wasn’t able to implement your answer because I don’t really understand it and I struggled trying to research the numpy docs. It was fine until my data changed a little (new columns). Once it broke I didn’t understand enough to figure out how to fix it.

            – J.S.P.
            Nov 27 '18 at 19:41














          2












          2








          2







          Using dot



          s=pd.Series([2,3,1,3],index=df.columns)
          df.gt(s,1).dot(df.columns)
          Out[179]:
          0 AC
          1 BC
          2 ABCD
          dtype: object

          #df['New']=df.gt(s,1).dot(df.columns)





          share|improve this answer













          Using dot



          s=pd.Series([2,3,1,3],index=df.columns)
          df.gt(s,1).dot(df.columns)
          Out[179]:
          0 AC
          1 BC
          2 ABCD
          dtype: object

          #df['New']=df.gt(s,1).dot(df.columns)






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 27 '18 at 1:36









          Wen-BenWen-Ben

          126k83872




          126k83872













          • Thanks W-B. I wasn’t able to implement your answer because I don’t really understand it and I struggled trying to research the numpy docs. It was fine until my data changed a little (new columns). Once it broke I didn’t understand enough to figure out how to fix it.

            – J.S.P.
            Nov 27 '18 at 19:41



















          • Thanks W-B. I wasn’t able to implement your answer because I don’t really understand it and I struggled trying to research the numpy docs. It was fine until my data changed a little (new columns). Once it broke I didn’t understand enough to figure out how to fix it.

            – J.S.P.
            Nov 27 '18 at 19:41

















          Thanks W-B. I wasn’t able to implement your answer because I don’t really understand it and I struggled trying to research the numpy docs. It was fine until my data changed a little (new columns). Once it broke I didn’t understand enough to figure out how to fix it.

          – J.S.P.
          Nov 27 '18 at 19:41





          Thanks W-B. I wasn’t able to implement your answer because I don’t really understand it and I struggled trying to research the numpy docs. It was fine until my data changed a little (new columns). Once it broke I didn’t understand enough to figure out how to fix it.

          – J.S.P.
          Nov 27 '18 at 19:41











          1














          Another option that operates in an array fashion. It would be interesting to compare performance.



          import pandas as pd
          import numpy as np

          # Data to test.

          data = pd.DataFrame(
          [
          [4, 3, 3, 1],
          [2, 5, 2, 2],
          [3, 5, 2, 4]
          ]
          , columns = ['A', 'B', 'C', 'D']
          )

          # Series to hold the thresholds.

          thresholds = pd.Series([2, 3, 1, 3], index = ['A', 'B', 'C', 'D'])

          # Subtract the series from the data, broadcasting, and then use sum to concatenate the strings.

          data['result'] = np.where(data - thresholds > 0, data.columns, '').sum(axis = 1)

          print(data)


          Gives:



             A  B  C  D result
          0 4 3 3 1 AC
          1 2 5 2 2 BC
          2 3 5 2 4 ABCD





          share|improve this answer
























          • Thanks smj. If my data gets bigger maybe I will try the comparison.

            – J.S.P.
            Nov 27 '18 at 19:42
















          1














          Another option that operates in an array fashion. It would be interesting to compare performance.



          import pandas as pd
          import numpy as np

          # Data to test.

          data = pd.DataFrame(
          [
          [4, 3, 3, 1],
          [2, 5, 2, 2],
          [3, 5, 2, 4]
          ]
          , columns = ['A', 'B', 'C', 'D']
          )

          # Series to hold the thresholds.

          thresholds = pd.Series([2, 3, 1, 3], index = ['A', 'B', 'C', 'D'])

          # Subtract the series from the data, broadcasting, and then use sum to concatenate the strings.

          data['result'] = np.where(data - thresholds > 0, data.columns, '').sum(axis = 1)

          print(data)


          Gives:



             A  B  C  D result
          0 4 3 3 1 AC
          1 2 5 2 2 BC
          2 3 5 2 4 ABCD





          share|improve this answer
























          • Thanks smj. If my data gets bigger maybe I will try the comparison.

            – J.S.P.
            Nov 27 '18 at 19:42














          1












          1








          1







          Another option that operates in an array fashion. It would be interesting to compare performance.



          import pandas as pd
          import numpy as np

          # Data to test.

          data = pd.DataFrame(
          [
          [4, 3, 3, 1],
          [2, 5, 2, 2],
          [3, 5, 2, 4]
          ]
          , columns = ['A', 'B', 'C', 'D']
          )

          # Series to hold the thresholds.

          thresholds = pd.Series([2, 3, 1, 3], index = ['A', 'B', 'C', 'D'])

          # Subtract the series from the data, broadcasting, and then use sum to concatenate the strings.

          data['result'] = np.where(data - thresholds > 0, data.columns, '').sum(axis = 1)

          print(data)


          Gives:



             A  B  C  D result
          0 4 3 3 1 AC
          1 2 5 2 2 BC
          2 3 5 2 4 ABCD





          share|improve this answer













          Another option that operates in an array fashion. It would be interesting to compare performance.



          import pandas as pd
          import numpy as np

          # Data to test.

          data = pd.DataFrame(
          [
          [4, 3, 3, 1],
          [2, 5, 2, 2],
          [3, 5, 2, 4]
          ]
          , columns = ['A', 'B', 'C', 'D']
          )

          # Series to hold the thresholds.

          thresholds = pd.Series([2, 3, 1, 3], index = ['A', 'B', 'C', 'D'])

          # Subtract the series from the data, broadcasting, and then use sum to concatenate the strings.

          data['result'] = np.where(data - thresholds > 0, data.columns, '').sum(axis = 1)

          print(data)


          Gives:



             A  B  C  D result
          0 4 3 3 1 AC
          1 2 5 2 2 BC
          2 3 5 2 4 ABCD






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 27 '18 at 1:29









          smjsmj

          1,146613




          1,146613













          • Thanks smj. If my data gets bigger maybe I will try the comparison.

            – J.S.P.
            Nov 27 '18 at 19:42



















          • Thanks smj. If my data gets bigger maybe I will try the comparison.

            – J.S.P.
            Nov 27 '18 at 19:42

















          Thanks smj. If my data gets bigger maybe I will try the comparison.

          – J.S.P.
          Nov 27 '18 at 19:42





          Thanks smj. If my data gets bigger maybe I will try the comparison.

          – J.S.P.
          Nov 27 '18 at 19:42


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53491264%2fcreate-a-column-in-a-dataframe-that-is-a-string-of-characters-summarizing-data-i%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Tonle Sap (See)

          I get strange results when I access the Sqlitedatabase with Unity C# via XAMPP

          Guatemaltekische Davis-Cup-Mannschaft