How to find the count of consecutive same string values in a pandas dataframe?

up vote
0
down vote

favorite

Assume that we have the following pandas dataframe:

df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})



input:

 col1 col2  start

0  A>G  TCT   1000

1  C>T  ACA   2000

2  C>T  TCA   3000

3  G>T  TCA   4000

4  C>T  GCT   5000

5  A>G  ACT   6000

6  A>G  CTG  10000

7  A>G  ATG  20000

8  C>A  TCT  10000

9  C>T  ACA   2000

10 C>T  TCA   3000

11 C>T  TCA   4000

What I want to get is the number of consecutive values in col1 and length of these consecutive values and the difference between the last element's start and first element's start:

output:

 type length  diff

0  C>T  2   1000

1  A>G  3   14000

2  C>T  3   2000

asked Nov 19 at 21:56

burcak

467

The data frame defined in df = ... is missing some rows compared to the example below.
– Matthias Ossadnik
Nov 19 at 22:48

add a comment |

up vote
0
down vote

favorite

Assume that we have the following pandas dataframe:

df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})



input:

 col1 col2  start

0  A>G  TCT   1000

1  C>T  ACA   2000

2  C>T  TCA   3000

3  G>T  TCA   4000

4  C>T  GCT   5000

5  A>G  ACT   6000

6  A>G  CTG  10000

7  A>G  ATG  20000

8  C>A  TCT  10000

9  C>T  ACA   2000

10 C>T  TCA   3000

11 C>T  TCA   4000

What I want to get is the number of consecutive values in col1 and length of these consecutive values and the difference between the last element's start and first element's start:

output:

 type length  diff

0  C>T  2   1000

1  A>G  3   14000

2  C>T  3   2000

asked Nov 19 at 21:56

burcak

467

The data frame defined in df = ... is missing some rows compared to the example below.
– Matthias Ossadnik
Nov 19 at 22:48

add a comment |

up vote
0
down vote

favorite

Assume that we have the following pandas dataframe:

df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})



input:

 col1 col2  start

0  A>G  TCT   1000

1  C>T  ACA   2000

2  C>T  TCA   3000

3  G>T  TCA   4000

4  C>T  GCT   5000

5  A>G  ACT   6000

6  A>G  CTG  10000

7  A>G  ATG  20000

8  C>A  TCT  10000

9  C>T  ACA   2000

10 C>T  TCA   3000

11 C>T  TCA   4000

What I want to get is the number of consecutive values in col1 and length of these consecutive values and the difference between the last element's start and first element's start:

output:

 type length  diff

0  C>T  2   1000

1  A>G  3   14000

2  C>T  3   2000

asked Nov 19 at 21:56

burcak

467

Assume that we have the following pandas dataframe:

df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})



input:

 col1 col2  start

0  A>G  TCT   1000

1  C>T  ACA   2000

2  C>T  TCA   3000

3  G>T  TCA   4000

4  C>T  GCT   5000

5  A>G  ACT   6000

6  A>G  CTG  10000

7  A>G  ATG  20000

8  C>A  TCT  10000

9  C>T  ACA   2000

10 C>T  TCA   3000

11 C>T  TCA   4000

What I want to get is the number of consecutive values in col1 and length of these consecutive values and the difference between the last element's start and first element's start:

output:

 type length  diff

0  C>T  2   1000

1  A>G  3   14000

2  C>T  3   2000

python dataframe

asked Nov 19 at 21:56

burcak

467

asked Nov 19 at 21:56

burcak

467

asked Nov 19 at 21:56

burcak

467

asked Nov 19 at 21:56

burcak

467

asked Nov 19 at 21:56

burcak

467

The data frame defined in df = ... is missing some rows compared to the example below.
– Matthias Ossadnik
Nov 19 at 22:48

add a comment |

The data frame defined in df = ... is missing some rows compared to the example below.
– Matthias Ossadnik
Nov 19 at 22:48

The data frame defined in df = ... is missing some rows compared to the example below.
– Matthias Ossadnik
Nov 19 at 22:48

add a comment |

4 Answers
4

active

oldest

votes

up vote
2
down vote

accepted

With a little setup, you can 100% vectorise this using GroupBy.agg:

aggfunc = {

    'col1': [('type', 'first'), ('length', 'count')], 

    'start': [('diff', lambda x: abs(x.iat[-1] - x.iat[0]))]

}



grouper = df.col1.ne(df.col1.shift()).cumsum()



v = df.assign(key=grouper).groupby('key').agg(aggfunc)

v.columns = v.columns.droplevel(0)

v[v['diff'].ne(0)].reset_index(drop=True)



  type  length   diff

0  C>T       2   1000

1  A>G       3  14000

2  C>T       3   2000

answered Nov 19 at 22:48

coldspeed

112k17101170

1

up voted. imo, this is the most concise and optimized solution.
– teng
Nov 19 at 22:54

1

@teng Thanks, returned :)
– coldspeed
Nov 19 at 22:55

Thanks. How does this aggfunc work here? Could you please explain?
– burcak
Nov 20 at 18:59

@burcak The keys are columns to aggregate. The values are a list of tuples. The first element is the column name of the output column, and the second is a function (or function name as a string) that does the aggregation.
– coldspeed
Nov 20 at 21:17

add a comment |

up vote
2
down vote

probably something like the below:

import pandas as pd

from itertools import groupby



df = pd.DataFrame({

    'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G','C>T','C>T','C>T'],

    'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG','ACA','TCA','TCA'], 

    'start':[1000,2000,3000,4000,5000,6000,10000,20000,2000,3000,4000]})



final = 

pos = 0

for k,g in groupby([row.col1 for n,row in df.iterrows()]):

    glist = [x for x in g]

    first_pos = pos

    last_pos = pos+len(glist)-1

    if len(glist)>1:

        print(glist)

        val = df.iloc[first_pos].col1

        first = df.iloc[first_pos].start

        last = df.iloc[last_pos].start

        final.append({'type':val,'length':len(glist),'diff':last-first})

    pos = last_pos +1

final = pd.DataFrame(final)

print(final)

output:

diff    length  type

0   1000    2   C>T

1   14000   3   A>G

2   2000    3   C>T

answered Nov 19 at 22:34

teng

767621

add a comment |

up vote
0
down vote

Here is a two-step solution, first creating an auxiliary column that labels consecutive occurrences of the same string, the then using standard pandas groupby:

# add a group variable

values = df['col1'].values

# get locations where value changes

change = np.zeros(values.size, dtype=bool)

change[1:] = values[:-1] != values[1:]

df['group'] = change.cumsum()  # summing change points yields the label



# do the aggregation

res = (df

 .groupby('group')

 .agg({'start': lambda x: x.max() - x.min(), 'col1': 'first', 'col2': 'size'})

 .rename(columns={'col1': 'type', 'col2': 'length', 'start': 'diff'})

)

# filter on more than one consecutive value

res = res[res['length'] > 1]



print(res)



        diff type  length

group                    

1       1000  C>T       2

4      14000  A>G       3

5       2000  C>T       3

answered Nov 19 at 22:38

Matthias Ossadnik

57427

add a comment |

up vote
0
down vote

You can use pandas groupby and more_itertools:

import more_itertools as mit

def f(g):

    result = pd.DataFrame(, columns={'type', 'length', 'diff'})

    tp = g['col1'].iloc[0]

    for group in mit.consecutive_groups(g.index):

        group = list(group)

        if len(group) == 1:

            continue

        cur_df = pd.DataFrame({'type': [tp], 'length': [len(group)], 'diff': g.loc[group[-1]]['start'] - g.loc[group[0]]['start']})

        result = pd.concat([result, cur_df], ignore_index=True)

    return result



df.groupby('col1').apply(f).reset_index(drop=True)

edited Nov 19 at 22:44

answered Nov 19 at 22:38

Eric Wang

30018

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53383208%2fhow-to-find-the-count-of-consecutive-same-string-values-in-a-pandas-dataframe%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

up vote
2
down vote

accepted

With a little setup, you can 100% vectorise this using GroupBy.agg:

aggfunc = {

    'col1': [('type', 'first'), ('length', 'count')], 

    'start': [('diff', lambda x: abs(x.iat[-1] - x.iat[0]))]

}



grouper = df.col1.ne(df.col1.shift()).cumsum()



v = df.assign(key=grouper).groupby('key').agg(aggfunc)

v.columns = v.columns.droplevel(0)

v[v['diff'].ne(0)].reset_index(drop=True)



  type  length   diff

0  C>T       2   1000

1  A>G       3  14000

2  C>T       3   2000

answered Nov 19 at 22:48

coldspeed

112k17101170

1

up voted. imo, this is the most concise and optimized solution.
– teng
Nov 19 at 22:54

1

@teng Thanks, returned :)
– coldspeed
Nov 19 at 22:55

Thanks. How does this aggfunc work here? Could you please explain?
– burcak
Nov 20 at 18:59

@burcak The keys are columns to aggregate. The values are a list of tuples. The first element is the column name of the output column, and the second is a function (or function name as a string) that does the aggregation.
– coldspeed
Nov 20 at 21:17

add a comment |

up vote
2
down vote

accepted

With a little setup, you can 100% vectorise this using GroupBy.agg:

aggfunc = {

    'col1': [('type', 'first'), ('length', 'count')], 

    'start': [('diff', lambda x: abs(x.iat[-1] - x.iat[0]))]

}



grouper = df.col1.ne(df.col1.shift()).cumsum()



v = df.assign(key=grouper).groupby('key').agg(aggfunc)

v.columns = v.columns.droplevel(0)

v[v['diff'].ne(0)].reset_index(drop=True)



  type  length   diff

0  C>T       2   1000

1  A>G       3  14000

2  C>T       3   2000

answered Nov 19 at 22:48

coldspeed

112k17101170

1

up voted. imo, this is the most concise and optimized solution.
– teng
Nov 19 at 22:54

1

@teng Thanks, returned :)
– coldspeed
Nov 19 at 22:55

Thanks. How does this aggfunc work here? Could you please explain?
– burcak
Nov 20 at 18:59

@burcak The keys are columns to aggregate. The values are a list of tuples. The first element is the column name of the output column, and the second is a function (or function name as a string) that does the aggregation.
– coldspeed
Nov 20 at 21:17

add a comment |

up vote
2
down vote

accepted

With a little setup, you can 100% vectorise this using GroupBy.agg:

aggfunc = {

    'col1': [('type', 'first'), ('length', 'count')], 

    'start': [('diff', lambda x: abs(x.iat[-1] - x.iat[0]))]

}



grouper = df.col1.ne(df.col1.shift()).cumsum()



v = df.assign(key=grouper).groupby('key').agg(aggfunc)

v.columns = v.columns.droplevel(0)

v[v['diff'].ne(0)].reset_index(drop=True)



  type  length   diff

0  C>T       2   1000

1  A>G       3  14000

2  C>T       3   2000

answered Nov 19 at 22:48

coldspeed

112k17101170

With a little setup, you can 100% vectorise this using GroupBy.agg:

aggfunc = {

    'col1': [('type', 'first'), ('length', 'count')], 

    'start': [('diff', lambda x: abs(x.iat[-1] - x.iat[0]))]

}



grouper = df.col1.ne(df.col1.shift()).cumsum()



v = df.assign(key=grouper).groupby('key').agg(aggfunc)

v.columns = v.columns.droplevel(0)

v[v['diff'].ne(0)].reset_index(drop=True)



  type  length   diff

0  C>T       2   1000

1  A>G       3  14000

2  C>T       3   2000

answered Nov 19 at 22:48

coldspeed

112k17101170

answered Nov 19 at 22:48

coldspeed

112k17101170

answered Nov 19 at 22:48

coldspeed

112k17101170

answered Nov 19 at 22:48

coldspeed

112k17101170

1

up voted. imo, this is the most concise and optimized solution.
– teng
Nov 19 at 22:54

1

@teng Thanks, returned :)
– coldspeed
Nov 19 at 22:55

Thanks. How does this aggfunc work here? Could you please explain?
– burcak
Nov 20 at 18:59

@burcak The keys are columns to aggregate. The values are a list of tuples. The first element is the column name of the output column, and the second is a function (or function name as a string) that does the aggregation.
– coldspeed
Nov 20 at 21:17

add a comment |

1

up voted. imo, this is the most concise and optimized solution.
– teng
Nov 19 at 22:54

1

@teng Thanks, returned :)
– coldspeed
Nov 19 at 22:55

Thanks. How does this aggfunc work here? Could you please explain?
– burcak
Nov 20 at 18:59

@burcak The keys are columns to aggregate. The values are a list of tuples. The first element is the column name of the output column, and the second is a function (or function name as a string) that does the aggregation.
– coldspeed
Nov 20 at 21:17

up voted. imo, this is the most concise and optimized solution.
– teng
Nov 19 at 22:54

@teng Thanks, returned :)
– coldspeed
Nov 19 at 22:55

Thanks. How does this aggfunc work here? Could you please explain?
– burcak
Nov 20 at 18:59

@burcak The keys are columns to aggregate. The values are a list of tuples. The first element is the column name of the output column, and the second is a function (or function name as a string) that does the aggregation.
– coldspeed
Nov 20 at 21:17

add a comment |

up vote
2
down vote

probably something like the below:

import pandas as pd

from itertools import groupby



df = pd.DataFrame({

    'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G','C>T','C>T','C>T'],

    'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG','ACA','TCA','TCA'], 

    'start':[1000,2000,3000,4000,5000,6000,10000,20000,2000,3000,4000]})



final = 

pos = 0

for k,g in groupby([row.col1 for n,row in df.iterrows()]):

    glist = [x for x in g]

    first_pos = pos

    last_pos = pos+len(glist)-1

    if len(glist)>1:

        print(glist)

        val = df.iloc[first_pos].col1

        first = df.iloc[first_pos].start

        last = df.iloc[last_pos].start

        final.append({'type':val,'length':len(glist),'diff':last-first})

    pos = last_pos +1

final = pd.DataFrame(final)

print(final)

output:

diff    length  type

0   1000    2   C>T

1   14000   3   A>G

2   2000    3   C>T

answered Nov 19 at 22:34

teng

767621

add a comment |

up vote
2
down vote

probably something like the below:

import pandas as pd

from itertools import groupby



df = pd.DataFrame({

    'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G','C>T','C>T','C>T'],

    'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG','ACA','TCA','TCA'], 

    'start':[1000,2000,3000,4000,5000,6000,10000,20000,2000,3000,4000]})



final = 

pos = 0

for k,g in groupby([row.col1 for n,row in df.iterrows()]):

    glist = [x for x in g]

    first_pos = pos

    last_pos = pos+len(glist)-1

    if len(glist)>1:

        print(glist)

        val = df.iloc[first_pos].col1

        first = df.iloc[first_pos].start

        last = df.iloc[last_pos].start

        final.append({'type':val,'length':len(glist),'diff':last-first})

    pos = last_pos +1

final = pd.DataFrame(final)

print(final)

output:

diff    length  type

0   1000    2   C>T

1   14000   3   A>G

2   2000    3   C>T

answered Nov 19 at 22:34

teng

767621

add a comment |

up vote
2
down vote

probably something like the below:

import pandas as pd

from itertools import groupby



df = pd.DataFrame({

    'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G','C>T','C>T','C>T'],

    'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG','ACA','TCA','TCA'], 

    'start':[1000,2000,3000,4000,5000,6000,10000,20000,2000,3000,4000]})



final = 

pos = 0

for k,g in groupby([row.col1 for n,row in df.iterrows()]):

    glist = [x for x in g]

    first_pos = pos

    last_pos = pos+len(glist)-1

    if len(glist)>1:

        print(glist)

        val = df.iloc[first_pos].col1

        first = df.iloc[first_pos].start

        last = df.iloc[last_pos].start

        final.append({'type':val,'length':len(glist),'diff':last-first})

    pos = last_pos +1

final = pd.DataFrame(final)

print(final)

output:

diff    length  type

0   1000    2   C>T

1   14000   3   A>G

2   2000    3   C>T

answered Nov 19 at 22:34

teng

767621

probably something like the below:

import pandas as pd

from itertools import groupby



df = pd.DataFrame({

    'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G','C>T','C>T','C>T'],

    'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG','ACA','TCA','TCA'], 

    'start':[1000,2000,3000,4000,5000,6000,10000,20000,2000,3000,4000]})



final = 

pos = 0

for k,g in groupby([row.col1 for n,row in df.iterrows()]):

    glist = [x for x in g]

    first_pos = pos

    last_pos = pos+len(glist)-1

    if len(glist)>1:

        print(glist)

        val = df.iloc[first_pos].col1

        first = df.iloc[first_pos].start

        last = df.iloc[last_pos].start

        final.append({'type':val,'length':len(glist),'diff':last-first})

    pos = last_pos +1

final = pd.DataFrame(final)

print(final)

output:

diff    length  type

0   1000    2   C>T

1   14000   3   A>G

2   2000    3   C>T

answered Nov 19 at 22:34

teng

767621

answered Nov 19 at 22:34

teng

767621

answered Nov 19 at 22:34

teng

767621

answered Nov 19 at 22:34

teng

767621

add a comment |

up vote
0
down vote

Here is a two-step solution, first creating an auxiliary column that labels consecutive occurrences of the same string, the then using standard pandas groupby:

# add a group variable

values = df['col1'].values

# get locations where value changes

change = np.zeros(values.size, dtype=bool)

change[1:] = values[:-1] != values[1:]

df['group'] = change.cumsum()  # summing change points yields the label



# do the aggregation

res = (df

 .groupby('group')

 .agg({'start': lambda x: x.max() - x.min(), 'col1': 'first', 'col2': 'size'})

 .rename(columns={'col1': 'type', 'col2': 'length', 'start': 'diff'})

)

# filter on more than one consecutive value

res = res[res['length'] > 1]



print(res)



        diff type  length

group                    

1       1000  C>T       2

4      14000  A>G       3

5       2000  C>T       3

answered Nov 19 at 22:38

Matthias Ossadnik

57427

add a comment |

up vote
0
down vote

Here is a two-step solution, first creating an auxiliary column that labels consecutive occurrences of the same string, the then using standard pandas groupby:

# add a group variable

values = df['col1'].values

# get locations where value changes

change = np.zeros(values.size, dtype=bool)

change[1:] = values[:-1] != values[1:]

df['group'] = change.cumsum()  # summing change points yields the label



# do the aggregation

res = (df

 .groupby('group')

 .agg({'start': lambda x: x.max() - x.min(), 'col1': 'first', 'col2': 'size'})

 .rename(columns={'col1': 'type', 'col2': 'length', 'start': 'diff'})

)

# filter on more than one consecutive value

res = res[res['length'] > 1]



print(res)



        diff type  length

group                    

1       1000  C>T       2

4      14000  A>G       3

5       2000  C>T       3

answered Nov 19 at 22:38

Matthias Ossadnik

57427

add a comment |

up vote
0
down vote

Here is a two-step solution, first creating an auxiliary column that labels consecutive occurrences of the same string, the then using standard pandas groupby:

# add a group variable

values = df['col1'].values

# get locations where value changes

change = np.zeros(values.size, dtype=bool)

change[1:] = values[:-1] != values[1:]

df['group'] = change.cumsum()  # summing change points yields the label



# do the aggregation

res = (df

 .groupby('group')

 .agg({'start': lambda x: x.max() - x.min(), 'col1': 'first', 'col2': 'size'})

 .rename(columns={'col1': 'type', 'col2': 'length', 'start': 'diff'})

)

# filter on more than one consecutive value

res = res[res['length'] > 1]



print(res)



        diff type  length

group                    

1       1000  C>T       2

4      14000  A>G       3

5       2000  C>T       3

answered Nov 19 at 22:38

Matthias Ossadnik

57427

Here is a two-step solution, first creating an auxiliary column that labels consecutive occurrences of the same string, the then using standard pandas groupby:

# add a group variable

values = df['col1'].values

# get locations where value changes

change = np.zeros(values.size, dtype=bool)

change[1:] = values[:-1] != values[1:]

df['group'] = change.cumsum()  # summing change points yields the label



# do the aggregation

res = (df

 .groupby('group')

 .agg({'start': lambda x: x.max() - x.min(), 'col1': 'first', 'col2': 'size'})

 .rename(columns={'col1': 'type', 'col2': 'length', 'start': 'diff'})

)

# filter on more than one consecutive value

res = res[res['length'] > 1]



print(res)



        diff type  length

group                    

1       1000  C>T       2

4      14000  A>G       3

5       2000  C>T       3

answered Nov 19 at 22:38

Matthias Ossadnik

57427

answered Nov 19 at 22:38

Matthias Ossadnik

57427

answered Nov 19 at 22:38

Matthias Ossadnik

57427

answered Nov 19 at 22:38

Matthias Ossadnik

57427

add a comment |

up vote
0
down vote

You can use pandas groupby and more_itertools:

import more_itertools as mit

def f(g):

    result = pd.DataFrame(, columns={'type', 'length', 'diff'})

    tp = g['col1'].iloc[0]

    for group in mit.consecutive_groups(g.index):

        group = list(group)

        if len(group) == 1:

            continue

        cur_df = pd.DataFrame({'type': [tp], 'length': [len(group)], 'diff': g.loc[group[-1]]['start'] - g.loc[group[0]]['start']})

        result = pd.concat([result, cur_df], ignore_index=True)

    return result



df.groupby('col1').apply(f).reset_index(drop=True)

edited Nov 19 at 22:44

answered Nov 19 at 22:38

Eric Wang

30018

add a comment |

up vote
0
down vote

You can use pandas groupby and more_itertools:

import more_itertools as mit

def f(g):

    result = pd.DataFrame(, columns={'type', 'length', 'diff'})

    tp = g['col1'].iloc[0]

    for group in mit.consecutive_groups(g.index):

        group = list(group)

        if len(group) == 1:

            continue

        cur_df = pd.DataFrame({'type': [tp], 'length': [len(group)], 'diff': g.loc[group[-1]]['start'] - g.loc[group[0]]['start']})

        result = pd.concat([result, cur_df], ignore_index=True)

    return result



df.groupby('col1').apply(f).reset_index(drop=True)

edited Nov 19 at 22:44

answered Nov 19 at 22:38

Eric Wang

30018

add a comment |

up vote
0
down vote

You can use pandas groupby and more_itertools:

import more_itertools as mit

def f(g):

    result = pd.DataFrame(, columns={'type', 'length', 'diff'})

    tp = g['col1'].iloc[0]

    for group in mit.consecutive_groups(g.index):

        group = list(group)

        if len(group) == 1:

            continue

        cur_df = pd.DataFrame({'type': [tp], 'length': [len(group)], 'diff': g.loc[group[-1]]['start'] - g.loc[group[0]]['start']})

        result = pd.concat([result, cur_df], ignore_index=True)

    return result



df.groupby('col1').apply(f).reset_index(drop=True)

edited Nov 19 at 22:44

answered Nov 19 at 22:38

Eric Wang

30018

You can use pandas groupby and more_itertools:

import more_itertools as mit

def f(g):

    result = pd.DataFrame(, columns={'type', 'length', 'diff'})

    tp = g['col1'].iloc[0]

    for group in mit.consecutive_groups(g.index):

        group = list(group)

        if len(group) == 1:

            continue

        cur_df = pd.DataFrame({'type': [tp], 'length': [len(group)], 'diff': g.loc[group[-1]]['start'] - g.loc[group[0]]['start']})

        result = pd.concat([result, cur_df], ignore_index=True)

    return result



df.groupby('col1').apply(f).reset_index(drop=True)

edited Nov 19 at 22:44

answered Nov 19 at 22:38

Eric Wang

30018

edited Nov 19 at 22:44

answered Nov 19 at 22:38

Eric Wang

30018

answered Nov 19 at 22:38

Eric Wang

30018

answered Nov 19 at 22:38

Eric Wang

30018

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

bCIrW,GGpztdxG0PU3sYYF de43Rp4gZWax NCtk2VcHqxt8KQk 6FnR P7QhCmB,AYvlEpDz27HBW XD7,Pzoud

搜尋此網誌

Ytukyg