Convert pandas.DataFrame to numpy tensor using factor levels for shape [duplicate]












0
















This question already has an answer here:




  • Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array

    1 answer




I have data from a full factorial experiment. For example, for each of N samples, I have J types of measurement and K measurement loci. I receive this data in long format, for example,



import numpy as np
import pandas as pd
import itertools
from numpy.random import normal as rnorm

# [[N], [J], [K]]
levels = [[1,2,3,4], ['start', 'stop'], ['gene1', 'gene2', 'gene3']]

# fully crossed
exp_design = list(itertools.product(*levels))

df = pd.DataFrame(exp_design, columns=["sample", "mode", "gene"])

# some fake data
df['x'] = rnorm(size=len(exp_design))


which results in 24 observations (x) with a column for each of the three factors.



> df.head()
sample mode gene x
0 1 start gene1 -1.229370
1 1 start gene2 1.129773
2 1 start gene3 -1.155202
3 1 stop gene1 -0.757551
4 1 stop gene2 -0.166129


I want to convert these observations to the corresponding (N,J,K)-shaped tensor (numpy array). I was thinking pivoting to wide format with a MultiIndex, then extracting values would generate the correct tensor, but it simply comes off as a column vector:



> df.pivot_table(values='x', index=['sample', 'mode', 'gene']).values
array([[-1.22936989],
[ 1.12977346],
[-1.15520216],
...,
[-0.1031641 ],
[ 1.1296491 ],
[ 1.31113584]])


Is there a quick way to get tensor formatted data from a long format pandas.DataFrame?










share|improve this question















marked as duplicate by merv, Community Nov 29 '18 at 2:37


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.























    0
















    This question already has an answer here:




    • Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array

      1 answer




    I have data from a full factorial experiment. For example, for each of N samples, I have J types of measurement and K measurement loci. I receive this data in long format, for example,



    import numpy as np
    import pandas as pd
    import itertools
    from numpy.random import normal as rnorm

    # [[N], [J], [K]]
    levels = [[1,2,3,4], ['start', 'stop'], ['gene1', 'gene2', 'gene3']]

    # fully crossed
    exp_design = list(itertools.product(*levels))

    df = pd.DataFrame(exp_design, columns=["sample", "mode", "gene"])

    # some fake data
    df['x'] = rnorm(size=len(exp_design))


    which results in 24 observations (x) with a column for each of the three factors.



    > df.head()
    sample mode gene x
    0 1 start gene1 -1.229370
    1 1 start gene2 1.129773
    2 1 start gene3 -1.155202
    3 1 stop gene1 -0.757551
    4 1 stop gene2 -0.166129


    I want to convert these observations to the corresponding (N,J,K)-shaped tensor (numpy array). I was thinking pivoting to wide format with a MultiIndex, then extracting values would generate the correct tensor, but it simply comes off as a column vector:



    > df.pivot_table(values='x', index=['sample', 'mode', 'gene']).values
    array([[-1.22936989],
    [ 1.12977346],
    [-1.15520216],
    ...,
    [-0.1031641 ],
    [ 1.1296491 ],
    [ 1.31113584]])


    Is there a quick way to get tensor formatted data from a long format pandas.DataFrame?










    share|improve this question















    marked as duplicate by merv, Community Nov 29 '18 at 2:37


    This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.





















      0












      0








      0









      This question already has an answer here:




      • Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array

        1 answer




      I have data from a full factorial experiment. For example, for each of N samples, I have J types of measurement and K measurement loci. I receive this data in long format, for example,



      import numpy as np
      import pandas as pd
      import itertools
      from numpy.random import normal as rnorm

      # [[N], [J], [K]]
      levels = [[1,2,3,4], ['start', 'stop'], ['gene1', 'gene2', 'gene3']]

      # fully crossed
      exp_design = list(itertools.product(*levels))

      df = pd.DataFrame(exp_design, columns=["sample", "mode", "gene"])

      # some fake data
      df['x'] = rnorm(size=len(exp_design))


      which results in 24 observations (x) with a column for each of the three factors.



      > df.head()
      sample mode gene x
      0 1 start gene1 -1.229370
      1 1 start gene2 1.129773
      2 1 start gene3 -1.155202
      3 1 stop gene1 -0.757551
      4 1 stop gene2 -0.166129


      I want to convert these observations to the corresponding (N,J,K)-shaped tensor (numpy array). I was thinking pivoting to wide format with a MultiIndex, then extracting values would generate the correct tensor, but it simply comes off as a column vector:



      > df.pivot_table(values='x', index=['sample', 'mode', 'gene']).values
      array([[-1.22936989],
      [ 1.12977346],
      [-1.15520216],
      ...,
      [-0.1031641 ],
      [ 1.1296491 ],
      [ 1.31113584]])


      Is there a quick way to get tensor formatted data from a long format pandas.DataFrame?










      share|improve this question

















      This question already has an answer here:




      • Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array

        1 answer




      I have data from a full factorial experiment. For example, for each of N samples, I have J types of measurement and K measurement loci. I receive this data in long format, for example,



      import numpy as np
      import pandas as pd
      import itertools
      from numpy.random import normal as rnorm

      # [[N], [J], [K]]
      levels = [[1,2,3,4], ['start', 'stop'], ['gene1', 'gene2', 'gene3']]

      # fully crossed
      exp_design = list(itertools.product(*levels))

      df = pd.DataFrame(exp_design, columns=["sample", "mode", "gene"])

      # some fake data
      df['x'] = rnorm(size=len(exp_design))


      which results in 24 observations (x) with a column for each of the three factors.



      > df.head()
      sample mode gene x
      0 1 start gene1 -1.229370
      1 1 start gene2 1.129773
      2 1 start gene3 -1.155202
      3 1 stop gene1 -0.757551
      4 1 stop gene2 -0.166129


      I want to convert these observations to the corresponding (N,J,K)-shaped tensor (numpy array). I was thinking pivoting to wide format with a MultiIndex, then extracting values would generate the correct tensor, but it simply comes off as a column vector:



      > df.pivot_table(values='x', index=['sample', 'mode', 'gene']).values
      array([[-1.22936989],
      [ 1.12977346],
      [-1.15520216],
      ...,
      [-0.1031641 ],
      [ 1.1296491 ],
      [ 1.31113584]])


      Is there a quick way to get tensor formatted data from a long format pandas.DataFrame?





      This question already has an answer here:




      • Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array

        1 answer








      python pandas numpy tensor numpy-ndarray






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 24 '18 at 4:13







      merv

















      asked Nov 24 '18 at 3:59









      mervmerv

      25.3k674109




      25.3k674109




      marked as duplicate by merv, Community Nov 29 '18 at 2:37


      This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.









      marked as duplicate by merv, Community Nov 29 '18 at 2:37


      This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.


























          1 Answer
          1






          active

          oldest

          votes


















          1














          Try with



          df.agg('nunique')

          Out[69]:
          sample 4
          mode 2
          gene 3
          x 24
          dtype: int64
          s=df.agg('nunique')
          df.x.values.reshape(s['sample'],s['mode'],s['gene'])
          Out[71]:
          array([[[-2.78133759e-01, -1.42234420e+00, 5.42439121e-01],
          [ 2.15359867e+00, 6.55837886e-01, -1.01293568e+00]],
          [[ 7.92306679e-01, -1.62539763e-01, -6.13120335e-01],
          [-2.91567999e-01, -4.01257702e-01, 7.96422763e-01]],
          [[ 1.05088264e-01, -7.23400925e-02, 2.78515041e-01],
          [ 2.63088568e-01, 1.47477886e+00, -2.10735619e+00]],
          [[-1.71756374e+00, 6.12224005e-04, -3.11562798e-02],
          [ 5.26028807e-01, -1.18502045e+00, 1.88633760e+00]]])





          share|improve this answer


























          • @merv check the update

            – Wen-Ben
            Nov 24 '18 at 5:21











          • I think it's important to note here that this assumes that data frame is first sorted like, df.sort_values(by=['sample', 'mode', 'gene'])

            – merv
            Nov 24 '18 at 5:46











          • @merv yes you are right

            – Wen-Ben
            Nov 24 '18 at 6:33


















          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          Try with



          df.agg('nunique')

          Out[69]:
          sample 4
          mode 2
          gene 3
          x 24
          dtype: int64
          s=df.agg('nunique')
          df.x.values.reshape(s['sample'],s['mode'],s['gene'])
          Out[71]:
          array([[[-2.78133759e-01, -1.42234420e+00, 5.42439121e-01],
          [ 2.15359867e+00, 6.55837886e-01, -1.01293568e+00]],
          [[ 7.92306679e-01, -1.62539763e-01, -6.13120335e-01],
          [-2.91567999e-01, -4.01257702e-01, 7.96422763e-01]],
          [[ 1.05088264e-01, -7.23400925e-02, 2.78515041e-01],
          [ 2.63088568e-01, 1.47477886e+00, -2.10735619e+00]],
          [[-1.71756374e+00, 6.12224005e-04, -3.11562798e-02],
          [ 5.26028807e-01, -1.18502045e+00, 1.88633760e+00]]])





          share|improve this answer


























          • @merv check the update

            – Wen-Ben
            Nov 24 '18 at 5:21











          • I think it's important to note here that this assumes that data frame is first sorted like, df.sort_values(by=['sample', 'mode', 'gene'])

            – merv
            Nov 24 '18 at 5:46











          • @merv yes you are right

            – Wen-Ben
            Nov 24 '18 at 6:33
















          1














          Try with



          df.agg('nunique')

          Out[69]:
          sample 4
          mode 2
          gene 3
          x 24
          dtype: int64
          s=df.agg('nunique')
          df.x.values.reshape(s['sample'],s['mode'],s['gene'])
          Out[71]:
          array([[[-2.78133759e-01, -1.42234420e+00, 5.42439121e-01],
          [ 2.15359867e+00, 6.55837886e-01, -1.01293568e+00]],
          [[ 7.92306679e-01, -1.62539763e-01, -6.13120335e-01],
          [-2.91567999e-01, -4.01257702e-01, 7.96422763e-01]],
          [[ 1.05088264e-01, -7.23400925e-02, 2.78515041e-01],
          [ 2.63088568e-01, 1.47477886e+00, -2.10735619e+00]],
          [[-1.71756374e+00, 6.12224005e-04, -3.11562798e-02],
          [ 5.26028807e-01, -1.18502045e+00, 1.88633760e+00]]])





          share|improve this answer


























          • @merv check the update

            – Wen-Ben
            Nov 24 '18 at 5:21











          • I think it's important to note here that this assumes that data frame is first sorted like, df.sort_values(by=['sample', 'mode', 'gene'])

            – merv
            Nov 24 '18 at 5:46











          • @merv yes you are right

            – Wen-Ben
            Nov 24 '18 at 6:33














          1












          1








          1







          Try with



          df.agg('nunique')

          Out[69]:
          sample 4
          mode 2
          gene 3
          x 24
          dtype: int64
          s=df.agg('nunique')
          df.x.values.reshape(s['sample'],s['mode'],s['gene'])
          Out[71]:
          array([[[-2.78133759e-01, -1.42234420e+00, 5.42439121e-01],
          [ 2.15359867e+00, 6.55837886e-01, -1.01293568e+00]],
          [[ 7.92306679e-01, -1.62539763e-01, -6.13120335e-01],
          [-2.91567999e-01, -4.01257702e-01, 7.96422763e-01]],
          [[ 1.05088264e-01, -7.23400925e-02, 2.78515041e-01],
          [ 2.63088568e-01, 1.47477886e+00, -2.10735619e+00]],
          [[-1.71756374e+00, 6.12224005e-04, -3.11562798e-02],
          [ 5.26028807e-01, -1.18502045e+00, 1.88633760e+00]]])





          share|improve this answer















          Try with



          df.agg('nunique')

          Out[69]:
          sample 4
          mode 2
          gene 3
          x 24
          dtype: int64
          s=df.agg('nunique')
          df.x.values.reshape(s['sample'],s['mode'],s['gene'])
          Out[71]:
          array([[[-2.78133759e-01, -1.42234420e+00, 5.42439121e-01],
          [ 2.15359867e+00, 6.55837886e-01, -1.01293568e+00]],
          [[ 7.92306679e-01, -1.62539763e-01, -6.13120335e-01],
          [-2.91567999e-01, -4.01257702e-01, 7.96422763e-01]],
          [[ 1.05088264e-01, -7.23400925e-02, 2.78515041e-01],
          [ 2.63088568e-01, 1.47477886e+00, -2.10735619e+00]],
          [[-1.71756374e+00, 6.12224005e-04, -3.11562798e-02],
          [ 5.26028807e-01, -1.18502045e+00, 1.88633760e+00]]])






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 24 '18 at 5:21

























          answered Nov 24 '18 at 4:36









          Wen-BenWen-Ben

          112k83367




          112k83367













          • @merv check the update

            – Wen-Ben
            Nov 24 '18 at 5:21











          • I think it's important to note here that this assumes that data frame is first sorted like, df.sort_values(by=['sample', 'mode', 'gene'])

            – merv
            Nov 24 '18 at 5:46











          • @merv yes you are right

            – Wen-Ben
            Nov 24 '18 at 6:33



















          • @merv check the update

            – Wen-Ben
            Nov 24 '18 at 5:21











          • I think it's important to note here that this assumes that data frame is first sorted like, df.sort_values(by=['sample', 'mode', 'gene'])

            – merv
            Nov 24 '18 at 5:46











          • @merv yes you are right

            – Wen-Ben
            Nov 24 '18 at 6:33

















          @merv check the update

          – Wen-Ben
          Nov 24 '18 at 5:21





          @merv check the update

          – Wen-Ben
          Nov 24 '18 at 5:21













          I think it's important to note here that this assumes that data frame is first sorted like, df.sort_values(by=['sample', 'mode', 'gene'])

          – merv
          Nov 24 '18 at 5:46





          I think it's important to note here that this assumes that data frame is first sorted like, df.sort_values(by=['sample', 'mode', 'gene'])

          – merv
          Nov 24 '18 at 5:46













          @merv yes you are right

          – Wen-Ben
          Nov 24 '18 at 6:33





          @merv yes you are right

          – Wen-Ben
          Nov 24 '18 at 6:33





          Popular posts from this blog

          Wiesbaden

          Marschland

          Dieringhausen