How to find the count of consecutive same string values in a pandas dataframe?











up vote
0
down vote

favorite












Assume that we have the following pandas dataframe:



df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})

input:
col1 col2 start
0 A>G TCT 1000
1 C>T ACA 2000
2 C>T TCA 3000
3 G>T TCA 4000
4 C>T GCT 5000
5 A>G ACT 6000
6 A>G CTG 10000
7 A>G ATG 20000
8 C>A TCT 10000
9 C>T ACA 2000
10 C>T TCA 3000
11 C>T TCA 4000


What I want to get is the number of consecutive values in col1 and length of these consecutive values and the difference between the last element's start and first element's start:



output:
type length diff
0 C>T 2 1000
1 A>G 3 14000
2 C>T 3 2000









share|improve this question






















  • The data frame defined in df = ... is missing some rows compared to the example below.
    – Matthias Ossadnik
    Nov 19 at 22:48















up vote
0
down vote

favorite












Assume that we have the following pandas dataframe:



df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})

input:
col1 col2 start
0 A>G TCT 1000
1 C>T ACA 2000
2 C>T TCA 3000
3 G>T TCA 4000
4 C>T GCT 5000
5 A>G ACT 6000
6 A>G CTG 10000
7 A>G ATG 20000
8 C>A TCT 10000
9 C>T ACA 2000
10 C>T TCA 3000
11 C>T TCA 4000


What I want to get is the number of consecutive values in col1 and length of these consecutive values and the difference between the last element's start and first element's start:



output:
type length diff
0 C>T 2 1000
1 A>G 3 14000
2 C>T 3 2000









share|improve this question






















  • The data frame defined in df = ... is missing some rows compared to the example below.
    – Matthias Ossadnik
    Nov 19 at 22:48













up vote
0
down vote

favorite









up vote
0
down vote

favorite











Assume that we have the following pandas dataframe:



df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})

input:
col1 col2 start
0 A>G TCT 1000
1 C>T ACA 2000
2 C>T TCA 3000
3 G>T TCA 4000
4 C>T GCT 5000
5 A>G ACT 6000
6 A>G CTG 10000
7 A>G ATG 20000
8 C>A TCT 10000
9 C>T ACA 2000
10 C>T TCA 3000
11 C>T TCA 4000


What I want to get is the number of consecutive values in col1 and length of these consecutive values and the difference between the last element's start and first element's start:



output:
type length diff
0 C>T 2 1000
1 A>G 3 14000
2 C>T 3 2000









share|improve this question













Assume that we have the following pandas dataframe:



df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})

input:
col1 col2 start
0 A>G TCT 1000
1 C>T ACA 2000
2 C>T TCA 3000
3 G>T TCA 4000
4 C>T GCT 5000
5 A>G ACT 6000
6 A>G CTG 10000
7 A>G ATG 20000
8 C>A TCT 10000
9 C>T ACA 2000
10 C>T TCA 3000
11 C>T TCA 4000


What I want to get is the number of consecutive values in col1 and length of these consecutive values and the difference between the last element's start and first element's start:



output:
type length diff
0 C>T 2 1000
1 A>G 3 14000
2 C>T 3 2000






python dataframe






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 19 at 21:56









burcak

467




467












  • The data frame defined in df = ... is missing some rows compared to the example below.
    – Matthias Ossadnik
    Nov 19 at 22:48


















  • The data frame defined in df = ... is missing some rows compared to the example below.
    – Matthias Ossadnik
    Nov 19 at 22:48
















The data frame defined in df = ... is missing some rows compared to the example below.
– Matthias Ossadnik
Nov 19 at 22:48




The data frame defined in df = ... is missing some rows compared to the example below.
– Matthias Ossadnik
Nov 19 at 22:48












4 Answers
4






active

oldest

votes

















up vote
2
down vote



accepted










With a little setup, you can 100% vectorise this using GroupBy.agg:



aggfunc = {
'col1': [('type', 'first'), ('length', 'count')],
'start': [('diff', lambda x: abs(x.iat[-1] - x.iat[0]))]
}

grouper = df.col1.ne(df.col1.shift()).cumsum()

v = df.assign(key=grouper).groupby('key').agg(aggfunc)
v.columns = v.columns.droplevel(0)
v[v['diff'].ne(0)].reset_index(drop=True)

type length diff
0 C>T 2 1000
1 A>G 3 14000
2 C>T 3 2000





share|improve this answer

















  • 1




    up voted. imo, this is the most concise and optimized solution.
    – teng
    Nov 19 at 22:54








  • 1




    @teng Thanks, returned :)
    – coldspeed
    Nov 19 at 22:55










  • Thanks. How does this aggfunc work here? Could you please explain?
    – burcak
    Nov 20 at 18:59










  • @burcak The keys are columns to aggregate. The values are a list of tuples. The first element is the column name of the output column, and the second is a function (or function name as a string) that does the aggregation.
    – coldspeed
    Nov 20 at 21:17




















up vote
2
down vote













probably something like the below:



import pandas as pd
from itertools import groupby

df = pd.DataFrame({
'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G','C>T','C>T','C>T'],
'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG','ACA','TCA','TCA'],
'start':[1000,2000,3000,4000,5000,6000,10000,20000,2000,3000,4000]})

final =
pos = 0
for k,g in groupby([row.col1 for n,row in df.iterrows()]):
glist = [x for x in g]
first_pos = pos
last_pos = pos+len(glist)-1
if len(glist)>1:
print(glist)
val = df.iloc[first_pos].col1
first = df.iloc[first_pos].start
last = df.iloc[last_pos].start
final.append({'type':val,'length':len(glist),'diff':last-first})
pos = last_pos +1
final = pd.DataFrame(final)
print(final)


output:



diff    length  type
0 1000 2 C>T
1 14000 3 A>G
2 2000 3 C>T





share|improve this answer




























    up vote
    0
    down vote













    Here is a two-step solution, first creating an auxiliary column that labels consecutive occurrences of the same string, the then using standard pandas groupby:



    # add a group variable
    values = df['col1'].values
    # get locations where value changes
    change = np.zeros(values.size, dtype=bool)
    change[1:] = values[:-1] != values[1:]
    df['group'] = change.cumsum() # summing change points yields the label

    # do the aggregation
    res = (df
    .groupby('group')
    .agg({'start': lambda x: x.max() - x.min(), 'col1': 'first', 'col2': 'size'})
    .rename(columns={'col1': 'type', 'col2': 'length', 'start': 'diff'})
    )
    # filter on more than one consecutive value
    res = res[res['length'] > 1]

    print(res)

    diff type length
    group
    1 1000 C>T 2
    4 14000 A>G 3
    5 2000 C>T 3





    share|improve this answer




























      up vote
      0
      down vote













      You can use pandas groupby and more_itertools:



      import more_itertools as mit
      def f(g):
      result = pd.DataFrame(, columns={'type', 'length', 'diff'})
      tp = g['col1'].iloc[0]
      for group in mit.consecutive_groups(g.index):
      group = list(group)
      if len(group) == 1:
      continue
      cur_df = pd.DataFrame({'type': [tp], 'length': [len(group)], 'diff': g.loc[group[-1]]['start'] - g.loc[group[0]]['start']})
      result = pd.concat([result, cur_df], ignore_index=True)
      return result

      df.groupby('col1').apply(f).reset_index(drop=True)





      share|improve this answer























        Your Answer






        StackExchange.ifUsing("editor", function () {
        StackExchange.using("externalEditor", function () {
        StackExchange.using("snippets", function () {
        StackExchange.snippets.init();
        });
        });
        }, "code-snippets");

        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "1"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        convertImagesToLinks: true,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: 10,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53383208%2fhow-to-find-the-count-of-consecutive-same-string-values-in-a-pandas-dataframe%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        4 Answers
        4






        active

        oldest

        votes








        4 Answers
        4






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes








        up vote
        2
        down vote



        accepted










        With a little setup, you can 100% vectorise this using GroupBy.agg:



        aggfunc = {
        'col1': [('type', 'first'), ('length', 'count')],
        'start': [('diff', lambda x: abs(x.iat[-1] - x.iat[0]))]
        }

        grouper = df.col1.ne(df.col1.shift()).cumsum()

        v = df.assign(key=grouper).groupby('key').agg(aggfunc)
        v.columns = v.columns.droplevel(0)
        v[v['diff'].ne(0)].reset_index(drop=True)

        type length diff
        0 C>T 2 1000
        1 A>G 3 14000
        2 C>T 3 2000





        share|improve this answer

















        • 1




          up voted. imo, this is the most concise and optimized solution.
          – teng
          Nov 19 at 22:54








        • 1




          @teng Thanks, returned :)
          – coldspeed
          Nov 19 at 22:55










        • Thanks. How does this aggfunc work here? Could you please explain?
          – burcak
          Nov 20 at 18:59










        • @burcak The keys are columns to aggregate. The values are a list of tuples. The first element is the column name of the output column, and the second is a function (or function name as a string) that does the aggregation.
          – coldspeed
          Nov 20 at 21:17

















        up vote
        2
        down vote



        accepted










        With a little setup, you can 100% vectorise this using GroupBy.agg:



        aggfunc = {
        'col1': [('type', 'first'), ('length', 'count')],
        'start': [('diff', lambda x: abs(x.iat[-1] - x.iat[0]))]
        }

        grouper = df.col1.ne(df.col1.shift()).cumsum()

        v = df.assign(key=grouper).groupby('key').agg(aggfunc)
        v.columns = v.columns.droplevel(0)
        v[v['diff'].ne(0)].reset_index(drop=True)

        type length diff
        0 C>T 2 1000
        1 A>G 3 14000
        2 C>T 3 2000





        share|improve this answer

















        • 1




          up voted. imo, this is the most concise and optimized solution.
          – teng
          Nov 19 at 22:54








        • 1




          @teng Thanks, returned :)
          – coldspeed
          Nov 19 at 22:55










        • Thanks. How does this aggfunc work here? Could you please explain?
          – burcak
          Nov 20 at 18:59










        • @burcak The keys are columns to aggregate. The values are a list of tuples. The first element is the column name of the output column, and the second is a function (or function name as a string) that does the aggregation.
          – coldspeed
          Nov 20 at 21:17















        up vote
        2
        down vote



        accepted







        up vote
        2
        down vote



        accepted






        With a little setup, you can 100% vectorise this using GroupBy.agg:



        aggfunc = {
        'col1': [('type', 'first'), ('length', 'count')],
        'start': [('diff', lambda x: abs(x.iat[-1] - x.iat[0]))]
        }

        grouper = df.col1.ne(df.col1.shift()).cumsum()

        v = df.assign(key=grouper).groupby('key').agg(aggfunc)
        v.columns = v.columns.droplevel(0)
        v[v['diff'].ne(0)].reset_index(drop=True)

        type length diff
        0 C>T 2 1000
        1 A>G 3 14000
        2 C>T 3 2000





        share|improve this answer












        With a little setup, you can 100% vectorise this using GroupBy.agg:



        aggfunc = {
        'col1': [('type', 'first'), ('length', 'count')],
        'start': [('diff', lambda x: abs(x.iat[-1] - x.iat[0]))]
        }

        grouper = df.col1.ne(df.col1.shift()).cumsum()

        v = df.assign(key=grouper).groupby('key').agg(aggfunc)
        v.columns = v.columns.droplevel(0)
        v[v['diff'].ne(0)].reset_index(drop=True)

        type length diff
        0 C>T 2 1000
        1 A>G 3 14000
        2 C>T 3 2000






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 19 at 22:48









        coldspeed

        112k17101170




        112k17101170








        • 1




          up voted. imo, this is the most concise and optimized solution.
          – teng
          Nov 19 at 22:54








        • 1




          @teng Thanks, returned :)
          – coldspeed
          Nov 19 at 22:55










        • Thanks. How does this aggfunc work here? Could you please explain?
          – burcak
          Nov 20 at 18:59










        • @burcak The keys are columns to aggregate. The values are a list of tuples. The first element is the column name of the output column, and the second is a function (or function name as a string) that does the aggregation.
          – coldspeed
          Nov 20 at 21:17
















        • 1




          up voted. imo, this is the most concise and optimized solution.
          – teng
          Nov 19 at 22:54








        • 1




          @teng Thanks, returned :)
          – coldspeed
          Nov 19 at 22:55










        • Thanks. How does this aggfunc work here? Could you please explain?
          – burcak
          Nov 20 at 18:59










        • @burcak The keys are columns to aggregate. The values are a list of tuples. The first element is the column name of the output column, and the second is a function (or function name as a string) that does the aggregation.
          – coldspeed
          Nov 20 at 21:17










        1




        1




        up voted. imo, this is the most concise and optimized solution.
        – teng
        Nov 19 at 22:54






        up voted. imo, this is the most concise and optimized solution.
        – teng
        Nov 19 at 22:54






        1




        1




        @teng Thanks, returned :)
        – coldspeed
        Nov 19 at 22:55




        @teng Thanks, returned :)
        – coldspeed
        Nov 19 at 22:55












        Thanks. How does this aggfunc work here? Could you please explain?
        – burcak
        Nov 20 at 18:59




        Thanks. How does this aggfunc work here? Could you please explain?
        – burcak
        Nov 20 at 18:59












        @burcak The keys are columns to aggregate. The values are a list of tuples. The first element is the column name of the output column, and the second is a function (or function name as a string) that does the aggregation.
        – coldspeed
        Nov 20 at 21:17






        @burcak The keys are columns to aggregate. The values are a list of tuples. The first element is the column name of the output column, and the second is a function (or function name as a string) that does the aggregation.
        – coldspeed
        Nov 20 at 21:17














        up vote
        2
        down vote













        probably something like the below:



        import pandas as pd
        from itertools import groupby

        df = pd.DataFrame({
        'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G','C>T','C>T','C>T'],
        'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG','ACA','TCA','TCA'],
        'start':[1000,2000,3000,4000,5000,6000,10000,20000,2000,3000,4000]})

        final =
        pos = 0
        for k,g in groupby([row.col1 for n,row in df.iterrows()]):
        glist = [x for x in g]
        first_pos = pos
        last_pos = pos+len(glist)-1
        if len(glist)>1:
        print(glist)
        val = df.iloc[first_pos].col1
        first = df.iloc[first_pos].start
        last = df.iloc[last_pos].start
        final.append({'type':val,'length':len(glist),'diff':last-first})
        pos = last_pos +1
        final = pd.DataFrame(final)
        print(final)


        output:



        diff    length  type
        0 1000 2 C>T
        1 14000 3 A>G
        2 2000 3 C>T





        share|improve this answer

























          up vote
          2
          down vote













          probably something like the below:



          import pandas as pd
          from itertools import groupby

          df = pd.DataFrame({
          'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G','C>T','C>T','C>T'],
          'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG','ACA','TCA','TCA'],
          'start':[1000,2000,3000,4000,5000,6000,10000,20000,2000,3000,4000]})

          final =
          pos = 0
          for k,g in groupby([row.col1 for n,row in df.iterrows()]):
          glist = [x for x in g]
          first_pos = pos
          last_pos = pos+len(glist)-1
          if len(glist)>1:
          print(glist)
          val = df.iloc[first_pos].col1
          first = df.iloc[first_pos].start
          last = df.iloc[last_pos].start
          final.append({'type':val,'length':len(glist),'diff':last-first})
          pos = last_pos +1
          final = pd.DataFrame(final)
          print(final)


          output:



          diff    length  type
          0 1000 2 C>T
          1 14000 3 A>G
          2 2000 3 C>T





          share|improve this answer























            up vote
            2
            down vote










            up vote
            2
            down vote









            probably something like the below:



            import pandas as pd
            from itertools import groupby

            df = pd.DataFrame({
            'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G','C>T','C>T','C>T'],
            'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG','ACA','TCA','TCA'],
            'start':[1000,2000,3000,4000,5000,6000,10000,20000,2000,3000,4000]})

            final =
            pos = 0
            for k,g in groupby([row.col1 for n,row in df.iterrows()]):
            glist = [x for x in g]
            first_pos = pos
            last_pos = pos+len(glist)-1
            if len(glist)>1:
            print(glist)
            val = df.iloc[first_pos].col1
            first = df.iloc[first_pos].start
            last = df.iloc[last_pos].start
            final.append({'type':val,'length':len(glist),'diff':last-first})
            pos = last_pos +1
            final = pd.DataFrame(final)
            print(final)


            output:



            diff    length  type
            0 1000 2 C>T
            1 14000 3 A>G
            2 2000 3 C>T





            share|improve this answer












            probably something like the below:



            import pandas as pd
            from itertools import groupby

            df = pd.DataFrame({
            'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G','C>T','C>T','C>T'],
            'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG','ACA','TCA','TCA'],
            'start':[1000,2000,3000,4000,5000,6000,10000,20000,2000,3000,4000]})

            final =
            pos = 0
            for k,g in groupby([row.col1 for n,row in df.iterrows()]):
            glist = [x for x in g]
            first_pos = pos
            last_pos = pos+len(glist)-1
            if len(glist)>1:
            print(glist)
            val = df.iloc[first_pos].col1
            first = df.iloc[first_pos].start
            last = df.iloc[last_pos].start
            final.append({'type':val,'length':len(glist),'diff':last-first})
            pos = last_pos +1
            final = pd.DataFrame(final)
            print(final)


            output:



            diff    length  type
            0 1000 2 C>T
            1 14000 3 A>G
            2 2000 3 C>T






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 19 at 22:34









            teng

            767621




            767621






















                up vote
                0
                down vote













                Here is a two-step solution, first creating an auxiliary column that labels consecutive occurrences of the same string, the then using standard pandas groupby:



                # add a group variable
                values = df['col1'].values
                # get locations where value changes
                change = np.zeros(values.size, dtype=bool)
                change[1:] = values[:-1] != values[1:]
                df['group'] = change.cumsum() # summing change points yields the label

                # do the aggregation
                res = (df
                .groupby('group')
                .agg({'start': lambda x: x.max() - x.min(), 'col1': 'first', 'col2': 'size'})
                .rename(columns={'col1': 'type', 'col2': 'length', 'start': 'diff'})
                )
                # filter on more than one consecutive value
                res = res[res['length'] > 1]

                print(res)

                diff type length
                group
                1 1000 C>T 2
                4 14000 A>G 3
                5 2000 C>T 3





                share|improve this answer

























                  up vote
                  0
                  down vote













                  Here is a two-step solution, first creating an auxiliary column that labels consecutive occurrences of the same string, the then using standard pandas groupby:



                  # add a group variable
                  values = df['col1'].values
                  # get locations where value changes
                  change = np.zeros(values.size, dtype=bool)
                  change[1:] = values[:-1] != values[1:]
                  df['group'] = change.cumsum() # summing change points yields the label

                  # do the aggregation
                  res = (df
                  .groupby('group')
                  .agg({'start': lambda x: x.max() - x.min(), 'col1': 'first', 'col2': 'size'})
                  .rename(columns={'col1': 'type', 'col2': 'length', 'start': 'diff'})
                  )
                  # filter on more than one consecutive value
                  res = res[res['length'] > 1]

                  print(res)

                  diff type length
                  group
                  1 1000 C>T 2
                  4 14000 A>G 3
                  5 2000 C>T 3





                  share|improve this answer























                    up vote
                    0
                    down vote










                    up vote
                    0
                    down vote









                    Here is a two-step solution, first creating an auxiliary column that labels consecutive occurrences of the same string, the then using standard pandas groupby:



                    # add a group variable
                    values = df['col1'].values
                    # get locations where value changes
                    change = np.zeros(values.size, dtype=bool)
                    change[1:] = values[:-1] != values[1:]
                    df['group'] = change.cumsum() # summing change points yields the label

                    # do the aggregation
                    res = (df
                    .groupby('group')
                    .agg({'start': lambda x: x.max() - x.min(), 'col1': 'first', 'col2': 'size'})
                    .rename(columns={'col1': 'type', 'col2': 'length', 'start': 'diff'})
                    )
                    # filter on more than one consecutive value
                    res = res[res['length'] > 1]

                    print(res)

                    diff type length
                    group
                    1 1000 C>T 2
                    4 14000 A>G 3
                    5 2000 C>T 3





                    share|improve this answer












                    Here is a two-step solution, first creating an auxiliary column that labels consecutive occurrences of the same string, the then using standard pandas groupby:



                    # add a group variable
                    values = df['col1'].values
                    # get locations where value changes
                    change = np.zeros(values.size, dtype=bool)
                    change[1:] = values[:-1] != values[1:]
                    df['group'] = change.cumsum() # summing change points yields the label

                    # do the aggregation
                    res = (df
                    .groupby('group')
                    .agg({'start': lambda x: x.max() - x.min(), 'col1': 'first', 'col2': 'size'})
                    .rename(columns={'col1': 'type', 'col2': 'length', 'start': 'diff'})
                    )
                    # filter on more than one consecutive value
                    res = res[res['length'] > 1]

                    print(res)

                    diff type length
                    group
                    1 1000 C>T 2
                    4 14000 A>G 3
                    5 2000 C>T 3






                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Nov 19 at 22:38









                    Matthias Ossadnik

                    57427




                    57427






















                        up vote
                        0
                        down vote













                        You can use pandas groupby and more_itertools:



                        import more_itertools as mit
                        def f(g):
                        result = pd.DataFrame(, columns={'type', 'length', 'diff'})
                        tp = g['col1'].iloc[0]
                        for group in mit.consecutive_groups(g.index):
                        group = list(group)
                        if len(group) == 1:
                        continue
                        cur_df = pd.DataFrame({'type': [tp], 'length': [len(group)], 'diff': g.loc[group[-1]]['start'] - g.loc[group[0]]['start']})
                        result = pd.concat([result, cur_df], ignore_index=True)
                        return result

                        df.groupby('col1').apply(f).reset_index(drop=True)





                        share|improve this answer



























                          up vote
                          0
                          down vote













                          You can use pandas groupby and more_itertools:



                          import more_itertools as mit
                          def f(g):
                          result = pd.DataFrame(, columns={'type', 'length', 'diff'})
                          tp = g['col1'].iloc[0]
                          for group in mit.consecutive_groups(g.index):
                          group = list(group)
                          if len(group) == 1:
                          continue
                          cur_df = pd.DataFrame({'type': [tp], 'length': [len(group)], 'diff': g.loc[group[-1]]['start'] - g.loc[group[0]]['start']})
                          result = pd.concat([result, cur_df], ignore_index=True)
                          return result

                          df.groupby('col1').apply(f).reset_index(drop=True)





                          share|improve this answer

























                            up vote
                            0
                            down vote










                            up vote
                            0
                            down vote









                            You can use pandas groupby and more_itertools:



                            import more_itertools as mit
                            def f(g):
                            result = pd.DataFrame(, columns={'type', 'length', 'diff'})
                            tp = g['col1'].iloc[0]
                            for group in mit.consecutive_groups(g.index):
                            group = list(group)
                            if len(group) == 1:
                            continue
                            cur_df = pd.DataFrame({'type': [tp], 'length': [len(group)], 'diff': g.loc[group[-1]]['start'] - g.loc[group[0]]['start']})
                            result = pd.concat([result, cur_df], ignore_index=True)
                            return result

                            df.groupby('col1').apply(f).reset_index(drop=True)





                            share|improve this answer














                            You can use pandas groupby and more_itertools:



                            import more_itertools as mit
                            def f(g):
                            result = pd.DataFrame(, columns={'type', 'length', 'diff'})
                            tp = g['col1'].iloc[0]
                            for group in mit.consecutive_groups(g.index):
                            group = list(group)
                            if len(group) == 1:
                            continue
                            cur_df = pd.DataFrame({'type': [tp], 'length': [len(group)], 'diff': g.loc[group[-1]]['start'] - g.loc[group[0]]['start']})
                            result = pd.concat([result, cur_df], ignore_index=True)
                            return result

                            df.groupby('col1').apply(f).reset_index(drop=True)






                            share|improve this answer














                            share|improve this answer



                            share|improve this answer








                            edited Nov 19 at 22:44

























                            answered Nov 19 at 22:38









                            Eric Wang

                            30018




                            30018






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Stack Overflow!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                To learn more, see our tips on writing great answers.





                                Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                                Please pay close attention to the following guidance:


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53383208%2fhow-to-find-the-count-of-consecutive-same-string-values-in-a-pandas-dataframe%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Wiesbaden

                                Marschland

                                Dieringhausen