Splitting a string into words and punctuation

I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.

For instance:

>>> c = "help, me"

>>> print c.split()

['help,', 'me']

What I really want the list to look like is:

['help', ',', 'me']

So, I want the string split at whitespace with the punctuation split from the words.

I've tried to parse the string first and then run the split:

>>> for character in c:

...     if character in ".,;!?":

...             outputCharacter = " %s" % character

...     else:

...             outputCharacter = character

...     separatedPunctuation += outputCharacter

>>> print separatedPunctuation

help , me

>>> print separatedPunctuation.split()

['help', ',', 'me']

This produces the result I want, but is painfully slow on large files.

Is there a way to do this more efficiently?

edited Dec 14 '08 at 23:56

Fionnuala

84.5k791130

asked Dec 14 '08 at 23:30

David A

3961311

1

For this example (not the general case) c.replace(' ','').partition(',')

– Chris_Rands
Nov 21 '16 at 8:59

add a comment |

I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.

For instance:

>>> c = "help, me"

>>> print c.split()

['help,', 'me']

What I really want the list to look like is:

['help', ',', 'me']

So, I want the string split at whitespace with the punctuation split from the words.

I've tried to parse the string first and then run the split:

>>> for character in c:

...     if character in ".,;!?":

...             outputCharacter = " %s" % character

...     else:

...             outputCharacter = character

...     separatedPunctuation += outputCharacter

>>> print separatedPunctuation

help , me

>>> print separatedPunctuation.split()

['help', ',', 'me']

This produces the result I want, but is painfully slow on large files.

Is there a way to do this more efficiently?

edited Dec 14 '08 at 23:56

Fionnuala

84.5k791130

asked Dec 14 '08 at 23:30

David A

3961311

1

For this example (not the general case) c.replace(' ','').partition(',')

– Chris_Rands
Nov 21 '16 at 8:59

add a comment |

I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.

For instance:

>>> c = "help, me"

>>> print c.split()

['help,', 'me']

What I really want the list to look like is:

['help', ',', 'me']

So, I want the string split at whitespace with the punctuation split from the words.

I've tried to parse the string first and then run the split:

>>> for character in c:

...     if character in ".,;!?":

...             outputCharacter = " %s" % character

...     else:

...             outputCharacter = character

...     separatedPunctuation += outputCharacter

>>> print separatedPunctuation

help , me

>>> print separatedPunctuation.split()

['help', ',', 'me']

This produces the result I want, but is painfully slow on large files.

Is there a way to do this more efficiently?

edited Dec 14 '08 at 23:56

Fionnuala

84.5k791130

asked Dec 14 '08 at 23:30

David A

3961311

I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.

For instance:

>>> c = "help, me"

>>> print c.split()

['help,', 'me']

What I really want the list to look like is:

['help', ',', 'me']

So, I want the string split at whitespace with the punctuation split from the words.

I've tried to parse the string first and then run the split:

>>> for character in c:

...     if character in ".,;!?":

...             outputCharacter = " %s" % character

...     else:

...             outputCharacter = character

...     separatedPunctuation += outputCharacter

>>> print separatedPunctuation

help , me

>>> print separatedPunctuation.split()

['help', ',', 'me']

This produces the result I want, but is painfully slow on large files.

Is there a way to do this more efficiently?

python string split

edited Dec 14 '08 at 23:56

Fionnuala

84.5k791130

asked Dec 14 '08 at 23:30

David A

3961311

edited Dec 14 '08 at 23:56

Fionnuala

84.5k791130

asked Dec 14 '08 at 23:30

David A

3961311

edited Dec 14 '08 at 23:56

Fionnuala

84.5k791130

edited Dec 14 '08 at 23:56

Fionnuala

84.5k791130

edited Dec 14 '08 at 23:56

Fionnuala

84.5k791130

asked Dec 14 '08 at 23:30

David A

3961311

asked Dec 14 '08 at 23:30

David A

3961311

asked Dec 14 '08 at 23:30

David A

3961311

1

For this example (not the general case) c.replace(' ','').partition(',')

– Chris_Rands
Nov 21 '16 at 8:59

add a comment |

1

For this example (not the general case) c.replace(' ','').partition(',')

– Chris_Rands
Nov 21 '16 at 8:59

For this example (not the general case) c.replace(' ','').partition(',')

– Chris_Rands
Nov 21 '16 at 8:59

add a comment |

10 Answers
10

active

oldest

votes

This is more or less the way to do it:

>>> import re

>>> re.findall(r"[w']+|[.,!?;]", "Hello, I'm a string!")

['Hello', ',', "I'm", 'a', 'string', '!']

The trick is, not to think about where to split the string, but what to include in the tokens.

Caveats:

The underscore (_) is considered an inner-word character. Replace w, if you don't want that.

This will not work with (single) quotes in the string.

Put any additional punctuation marks you want to use in the right half of the regular expression.

Anything not explicitely mentioned in the re is silently dropped.

edited Aug 19 '11 at 23:01

answered Dec 15 '08 at 1:53

user3850

2

If you want to split at ANY punctuation, including ', try re.findall(r"[w]+|[^sw]", "Hello, I'm a string!"). The result is ['Hello', ',', 'I', "'", 'm', 'a', 'string', '!'] Note also that digits are included in the word match.

– Codie CodeMonkey
May 15 '12 at 8:21

Sorry! could you explain how exactly this is working?

– Curious
Feb 5 '16 at 2:36

@Curious: to be honest, no I coiuld not. Because, where should I start? What do you know? Which part is a problem for you? What do you want to achieve?

– user3850
Feb 5 '16 at 19:01

Never mind! I understood this myself! Thanks for the reply :)

– Curious
Feb 5 '16 at 20:39

add a comment |

Here is a Unicode-aware version:

re.findall(r"w+|[^ws]", text, re.UNICODE)

The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']); the second catches individual non-word characters, ignoring whitespace.

Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']). This appears to be standard in NLP, so I consider it a feature.

answered Jan 19 '12 at 17:58

LaC

10.4k53138

2

Upvoted because the w+|[^ws] construct is more generic than the accepted answer but afaik in python 3 the re.UNICODE shouldn't be necessary

– rloth
Jan 5 '15 at 16:21

add a comment |

In perl-style regular expression syntax, b matches a word boundary. This should come in handy for doing a regex-based split.

edit: I have been informed by hop that "empty matches" do not work in the split function of Python's re module. I will leave this here as information for anyone else getting stumped by this "feature".

edited Dec 15 '08 at 9:41

answered Dec 15 '08 at 0:25

Svante

40k664111

1

only it doesn't because re.split will not work with r'b'...

– user3850
Dec 15 '08 at 1:09

What the hell? Is that a bug in re.split? In Perl, split /bs*/ works without any problem.

– Svante
Dec 15 '08 at 1:29

it's kind of documented that re.split() won't split on empty matches... so, no, not /really/ a bug.

– user3850
Dec 15 '08 at 1:51

1

"kind of documented"? Even if it is really documented, it is still not helpful in any way, so I guess it is, in fact, a bug-redeclared-feature.

– Svante
Dec 15 '08 at 2:08

maybe. i don't know the rationale behind it. you should have checked whether it worked in any case! i cannot remove the downvote anymore, but please consider rewording the passive-aggressive edit -- doesn't help anyone.

– user3850
Dec 15 '08 at 9:16

|
show 1 more comment

Here's my entry.

I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the "!!!" grouped together; this may or may not be a good thing).

>>> import re

>>> import string

>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"

>>> l = [item for item in map(string.strip, re.split("(W+)", s)) if len(item) > 0]

>>> l

['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']

>>>

One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.

answered Dec 15 '08 at 1:30

Chris Cameron

6,33332646

plus 1 for grouping punctuation.

– UnsignedByte
Apr 4 '18 at 3:27

add a comment |

Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.

This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.

import string



d = "Hello, I'm a string!"



result = 

word = ''



for char in d:

    if char not in string.whitespace:

        if char not in string.ascii_letters + "'":

            if word:

                    result.append(word)

            result.append(char)

            word = ''

        else:

            word = ''.join([word,char])



    else:

        if word:

            result.append(word)

            word = ''

print result

['Hello', ',', "I'm", 'a', 'string', '!']

answered Dec 15 '08 at 1:05

monkut

26k1987125

i have not profiled this, but i guess the main problem is with the char-by-char concatenation of word. i'd instead use an index and slices.

– user3850
Dec 15 '08 at 10:24

With tricks i can shave 50% off the execution time of your solution. my solution with re.findall() is still twice as fast.

– user3850
Dec 15 '08 at 12:17

1

You need to call if word: result.append(word) after the loop ends, else the last word is not in result.

– Roland Pihlakas
May 25 '17 at 12:15

add a comment |

I think you can find all the help you can imagine in the NLTK, especially since you are using python. There's a good comprehensive discussion of this issue in the tutorial.

answered Dec 15 '08 at 0:34

dkretz

33k1373130

add a comment |

I came up with a way to tokenize all words and W+ patterns using b which doesn't need hardcoding:

>>> import re

>>> sentence = 'Hello, world!'

>>> tokens = [t.strip() for t in re.findall(r'b.*?S.*?(?:b|$)', sentence)]

['Hello', ',', 'world', '!']

Here .*?S.*? is a pattern matching anything that is not a space and $ is added to match last token in a string if it's a punctuation symbol.

Note the following though -- this will group punctuation that consists of more than one symbol:

>>> print [t.strip() for t in re.findall(r'b.*?S.*?(?:b|$)', '"Oh no", she said')]

['Oh', 'no', '",', 'she', 'said']

Of course, you can find and split such groups with:

>>> for token in [t.strip() for t in re.findall(r'b.*?S.*?(?:b|$)', '"You can", she said')]:

...     print re.findall(r'(?:w+|W)', token)



['You']

['can']

['"', ',']

['she']

['said']

edited Apr 15 '14 at 19:16

answered Apr 15 '14 at 19:11

FrauHahnhen

588

add a comment |

Try this:

string_big = "One of Python's coolest features is the string format operator  This operator is unique to strings"

my_list =

x = len(string_big)

poistion_ofspace = 0

while poistion_ofspace < x:

    for i in range(poistion_ofspace,x):

        if string_big[i] == ' ':

            break

        else:

            continue

    print string_big[poistion_ofspace:(i+1)]

    my_list.append(string_big[poistion_ofspace:(i+1)])

    poistion_ofspace = i+1



print my_list

edited Apr 18 '17 at 9:28

Aurasphere

2,519102950

answered Apr 18 '17 at 9:03

Siddharth Sonone

6019

add a comment |

If you are going to work in English (or some other common languages), you can use NLTK (there are many other tools to do this such as FreeLing).

import nltk

sentence = "help, me"

nltk.word_tokenize(sentence)

edited Nov 9 '18 at 10:30

answered Nov 8 '18 at 16:16

Fernando S. Peregrino

618

add a comment |

-1

Have you tried using a regex?

http://docs.python.org/library/re.html#re-syntax

By the way. Why do you need the "," at the second one? You will know that after each text is written i.e.

[0]

","

[1]

","

So if you want to add the "," you can just do it after each iteration when you use the array..

answered Dec 14 '08 at 23:34

Filip Ekberg

29.8k18107175

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f367155%2fsplitting-a-string-into-words-and-punctuation%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

10 Answers
10

active

oldest

votes

10 Answers
10

active

oldest

votes

This is more or less the way to do it:

>>> import re

>>> re.findall(r"[w']+|[.,!?;]", "Hello, I'm a string!")

['Hello', ',', "I'm", 'a', 'string', '!']

The trick is, not to think about where to split the string, but what to include in the tokens.

Caveats:

The underscore (_) is considered an inner-word character. Replace w, if you don't want that.

This will not work with (single) quotes in the string.

Put any additional punctuation marks you want to use in the right half of the regular expression.

Anything not explicitely mentioned in the re is silently dropped.

edited Aug 19 '11 at 23:01

answered Dec 15 '08 at 1:53

user3850

2

If you want to split at ANY punctuation, including ', try re.findall(r"[w]+|[^sw]", "Hello, I'm a string!"). The result is ['Hello', ',', 'I', "'", 'm', 'a', 'string', '!'] Note also that digits are included in the word match.

– Codie CodeMonkey
May 15 '12 at 8:21

Sorry! could you explain how exactly this is working?

– Curious
Feb 5 '16 at 2:36

@Curious: to be honest, no I coiuld not. Because, where should I start? What do you know? Which part is a problem for you? What do you want to achieve?

– user3850
Feb 5 '16 at 19:01

Never mind! I understood this myself! Thanks for the reply :)

– Curious
Feb 5 '16 at 20:39

add a comment |

This is more or less the way to do it:

>>> import re

>>> re.findall(r"[w']+|[.,!?;]", "Hello, I'm a string!")

['Hello', ',', "I'm", 'a', 'string', '!']

The trick is, not to think about where to split the string, but what to include in the tokens.

Caveats:

The underscore (_) is considered an inner-word character. Replace w, if you don't want that.

This will not work with (single) quotes in the string.

Put any additional punctuation marks you want to use in the right half of the regular expression.

Anything not explicitely mentioned in the re is silently dropped.

edited Aug 19 '11 at 23:01

answered Dec 15 '08 at 1:53

user3850

2

If you want to split at ANY punctuation, including ', try re.findall(r"[w]+|[^sw]", "Hello, I'm a string!"). The result is ['Hello', ',', 'I', "'", 'm', 'a', 'string', '!'] Note also that digits are included in the word match.

– Codie CodeMonkey
May 15 '12 at 8:21

Sorry! could you explain how exactly this is working?

– Curious
Feb 5 '16 at 2:36

@Curious: to be honest, no I coiuld not. Because, where should I start? What do you know? Which part is a problem for you? What do you want to achieve?

– user3850
Feb 5 '16 at 19:01

Never mind! I understood this myself! Thanks for the reply :)

– Curious
Feb 5 '16 at 20:39

add a comment |

This is more or less the way to do it:

>>> import re

>>> re.findall(r"[w']+|[.,!?;]", "Hello, I'm a string!")

['Hello', ',', "I'm", 'a', 'string', '!']

The trick is, not to think about where to split the string, but what to include in the tokens.

Caveats:

The underscore (_) is considered an inner-word character. Replace w, if you don't want that.

This will not work with (single) quotes in the string.

Put any additional punctuation marks you want to use in the right half of the regular expression.

Anything not explicitely mentioned in the re is silently dropped.

edited Aug 19 '11 at 23:01

answered Dec 15 '08 at 1:53

user3850

This is more or less the way to do it:

>>> import re

>>> re.findall(r"[w']+|[.,!?;]", "Hello, I'm a string!")

['Hello', ',', "I'm", 'a', 'string', '!']

The trick is, not to think about where to split the string, but what to include in the tokens.

Caveats:

The underscore (_) is considered an inner-word character. Replace w, if you don't want that.

This will not work with (single) quotes in the string.

Put any additional punctuation marks you want to use in the right half of the regular expression.

Anything not explicitely mentioned in the re is silently dropped.

edited Aug 19 '11 at 23:01

answered Dec 15 '08 at 1:53

user3850

edited Aug 19 '11 at 23:01

answered Dec 15 '08 at 1:53

user3850

answered Dec 15 '08 at 1:53

user3850

answered Dec 15 '08 at 1:53

user3850

2

If you want to split at ANY punctuation, including ', try re.findall(r"[w]+|[^sw]", "Hello, I'm a string!"). The result is ['Hello', ',', 'I', "'", 'm', 'a', 'string', '!'] Note also that digits are included in the word match.

– Codie CodeMonkey
May 15 '12 at 8:21

Sorry! could you explain how exactly this is working?

– Curious
Feb 5 '16 at 2:36

@Curious: to be honest, no I coiuld not. Because, where should I start? What do you know? Which part is a problem for you? What do you want to achieve?

– user3850
Feb 5 '16 at 19:01

Never mind! I understood this myself! Thanks for the reply :)

– Curious
Feb 5 '16 at 20:39

add a comment |

2

If you want to split at ANY punctuation, including ', try re.findall(r"[w]+|[^sw]", "Hello, I'm a string!"). The result is ['Hello', ',', 'I', "'", 'm', 'a', 'string', '!'] Note also that digits are included in the word match.

– Codie CodeMonkey
May 15 '12 at 8:21

Sorry! could you explain how exactly this is working?

– Curious
Feb 5 '16 at 2:36

@Curious: to be honest, no I coiuld not. Because, where should I start? What do you know? Which part is a problem for you? What do you want to achieve?

– user3850
Feb 5 '16 at 19:01

Never mind! I understood this myself! Thanks for the reply :)

– Curious
Feb 5 '16 at 20:39

If you want to split at ANY punctuation, including ', try re.findall(r"[w]+|[^sw]", "Hello, I'm a string!"). The result is ['Hello', ',', 'I', "'", 'm', 'a', 'string', '!'] Note also that digits are included in the word match.

– Codie CodeMonkey
May 15 '12 at 8:21

Sorry! could you explain how exactly this is working?

– Curious
Feb 5 '16 at 2:36

@Curious: to be honest, no I coiuld not. Because, where should I start? What do you know? Which part is a problem for you? What do you want to achieve?

– user3850
Feb 5 '16 at 19:01

Never mind! I understood this myself! Thanks for the reply :)

– Curious
Feb 5 '16 at 20:39

add a comment |

Here is a Unicode-aware version:

re.findall(r"w+|[^ws]", text, re.UNICODE)

Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']). This appears to be standard in NLP, so I consider it a feature.

answered Jan 19 '12 at 17:58

LaC

10.4k53138

2

Upvoted because the w+|[^ws] construct is more generic than the accepted answer but afaik in python 3 the re.UNICODE shouldn't be necessary

– rloth
Jan 5 '15 at 16:21

add a comment |

Here is a Unicode-aware version:

re.findall(r"w+|[^ws]", text, re.UNICODE)

Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']). This appears to be standard in NLP, so I consider it a feature.

answered Jan 19 '12 at 17:58

LaC

10.4k53138

2

Upvoted because the w+|[^ws] construct is more generic than the accepted answer but afaik in python 3 the re.UNICODE shouldn't be necessary

– rloth
Jan 5 '15 at 16:21

add a comment |

Here is a Unicode-aware version:

re.findall(r"w+|[^ws]", text, re.UNICODE)

Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']). This appears to be standard in NLP, so I consider it a feature.

answered Jan 19 '12 at 17:58

LaC

10.4k53138

Here is a Unicode-aware version:

re.findall(r"w+|[^ws]", text, re.UNICODE)

Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']). This appears to be standard in NLP, so I consider it a feature.

answered Jan 19 '12 at 17:58

LaC

10.4k53138

answered Jan 19 '12 at 17:58

LaC

10.4k53138

answered Jan 19 '12 at 17:58

LaC

10.4k53138

answered Jan 19 '12 at 17:58

LaC

10.4k53138

2

Upvoted because the w+|[^ws] construct is more generic than the accepted answer but afaik in python 3 the re.UNICODE shouldn't be necessary

– rloth
Jan 5 '15 at 16:21

add a comment |

2

Upvoted because the w+|[^ws] construct is more generic than the accepted answer but afaik in python 3 the re.UNICODE shouldn't be necessary

– rloth
Jan 5 '15 at 16:21

Upvoted because the w+|[^ws] construct is more generic than the accepted answer but afaik in python 3 the re.UNICODE shouldn't be necessary

– rloth
Jan 5 '15 at 16:21

add a comment |

In perl-style regular expression syntax, b matches a word boundary. This should come in handy for doing a regex-based split.

edited Dec 15 '08 at 9:41

answered Dec 15 '08 at 0:25

Svante

40k664111

1

only it doesn't because re.split will not work with r'b'...

– user3850
Dec 15 '08 at 1:09

What the hell? Is that a bug in re.split? In Perl, split /bs*/ works without any problem.

– Svante
Dec 15 '08 at 1:29

it's kind of documented that re.split() won't split on empty matches... so, no, not /really/ a bug.

– user3850
Dec 15 '08 at 1:51

1

"kind of documented"? Even if it is really documented, it is still not helpful in any way, so I guess it is, in fact, a bug-redeclared-feature.

– Svante
Dec 15 '08 at 2:08

maybe. i don't know the rationale behind it. you should have checked whether it worked in any case! i cannot remove the downvote anymore, but please consider rewording the passive-aggressive edit -- doesn't help anyone.

– user3850
Dec 15 '08 at 9:16

|
show 1 more comment

In perl-style regular expression syntax, b matches a word boundary. This should come in handy for doing a regex-based split.

edited Dec 15 '08 at 9:41

answered Dec 15 '08 at 0:25

Svante

40k664111

1

only it doesn't because re.split will not work with r'b'...

– user3850
Dec 15 '08 at 1:09

What the hell? Is that a bug in re.split? In Perl, split /bs*/ works without any problem.

– Svante
Dec 15 '08 at 1:29

it's kind of documented that re.split() won't split on empty matches... so, no, not /really/ a bug.

– user3850
Dec 15 '08 at 1:51

1

"kind of documented"? Even if it is really documented, it is still not helpful in any way, so I guess it is, in fact, a bug-redeclared-feature.

– Svante
Dec 15 '08 at 2:08

maybe. i don't know the rationale behind it. you should have checked whether it worked in any case! i cannot remove the downvote anymore, but please consider rewording the passive-aggressive edit -- doesn't help anyone.

– user3850
Dec 15 '08 at 9:16

|
show 1 more comment

In perl-style regular expression syntax, b matches a word boundary. This should come in handy for doing a regex-based split.

edited Dec 15 '08 at 9:41

answered Dec 15 '08 at 0:25

Svante

40k664111

In perl-style regular expression syntax, b matches a word boundary. This should come in handy for doing a regex-based split.

edited Dec 15 '08 at 9:41

answered Dec 15 '08 at 0:25

Svante

40k664111

edited Dec 15 '08 at 9:41

answered Dec 15 '08 at 0:25

Svante

40k664111

answered Dec 15 '08 at 0:25

Svante

40k664111

answered Dec 15 '08 at 0:25

Svante

40k664111

1

only it doesn't because re.split will not work with r'b'...

– user3850
Dec 15 '08 at 1:09

What the hell? Is that a bug in re.split? In Perl, split /bs*/ works without any problem.

– Svante
Dec 15 '08 at 1:29

it's kind of documented that re.split() won't split on empty matches... so, no, not /really/ a bug.

– user3850
Dec 15 '08 at 1:51

1

"kind of documented"? Even if it is really documented, it is still not helpful in any way, so I guess it is, in fact, a bug-redeclared-feature.

– Svante
Dec 15 '08 at 2:08

maybe. i don't know the rationale behind it. you should have checked whether it worked in any case! i cannot remove the downvote anymore, but please consider rewording the passive-aggressive edit -- doesn't help anyone.

– user3850
Dec 15 '08 at 9:16

|
show 1 more comment

1

only it doesn't because re.split will not work with r'b'...

– user3850
Dec 15 '08 at 1:09

What the hell? Is that a bug in re.split? In Perl, split /bs*/ works without any problem.

– Svante
Dec 15 '08 at 1:29

it's kind of documented that re.split() won't split on empty matches... so, no, not /really/ a bug.

– user3850
Dec 15 '08 at 1:51

1

"kind of documented"? Even if it is really documented, it is still not helpful in any way, so I guess it is, in fact, a bug-redeclared-feature.

– Svante
Dec 15 '08 at 2:08

maybe. i don't know the rationale behind it. you should have checked whether it worked in any case! i cannot remove the downvote anymore, but please consider rewording the passive-aggressive edit -- doesn't help anyone.

– user3850
Dec 15 '08 at 9:16

only it doesn't because re.split will not work with r'b'...

– user3850
Dec 15 '08 at 1:09

What the hell? Is that a bug in re.split? In Perl, split /bs*/ works without any problem.

– Svante
Dec 15 '08 at 1:29

it's kind of documented that re.split() won't split on empty matches... so, no, not /really/ a bug.

– user3850
Dec 15 '08 at 1:51

"kind of documented"? Even if it is really documented, it is still not helpful in any way, so I guess it is, in fact, a bug-redeclared-feature.

– Svante
Dec 15 '08 at 2:08

maybe. i don't know the rationale behind it. you should have checked whether it worked in any case! i cannot remove the downvote anymore, but please consider rewording the passive-aggressive edit -- doesn't help anyone.

– user3850
Dec 15 '08 at 9:16

|
show 1 more comment

Here's my entry.

I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the "!!!" grouped together; this may or may not be a good thing).

>>> import re

>>> import string

>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"

>>> l = [item for item in map(string.strip, re.split("(W+)", s)) if len(item) > 0]

>>> l

['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']

>>>

One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.

answered Dec 15 '08 at 1:30

Chris Cameron

6,33332646

plus 1 for grouping punctuation.

– UnsignedByte
Apr 4 '18 at 3:27

add a comment |

Here's my entry.

I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the "!!!" grouped together; this may or may not be a good thing).

>>> import re

>>> import string

>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"

>>> l = [item for item in map(string.strip, re.split("(W+)", s)) if len(item) > 0]

>>> l

['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']

>>>

One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.

answered Dec 15 '08 at 1:30

Chris Cameron

6,33332646

plus 1 for grouping punctuation.

– UnsignedByte
Apr 4 '18 at 3:27

add a comment |

Here's my entry.

I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the "!!!" grouped together; this may or may not be a good thing).

>>> import re

>>> import string

>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"

>>> l = [item for item in map(string.strip, re.split("(W+)", s)) if len(item) > 0]

>>> l

['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']

>>>

One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.

answered Dec 15 '08 at 1:30

Chris Cameron

6,33332646

Here's my entry.

I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the "!!!" grouped together; this may or may not be a good thing).

>>> import re

>>> import string

>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"

>>> l = [item for item in map(string.strip, re.split("(W+)", s)) if len(item) > 0]

>>> l

['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']

>>>

One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.

answered Dec 15 '08 at 1:30

Chris Cameron

6,33332646

answered Dec 15 '08 at 1:30

Chris Cameron

6,33332646

answered Dec 15 '08 at 1:30

Chris Cameron

6,33332646

answered Dec 15 '08 at 1:30

Chris Cameron

6,33332646

plus 1 for grouping punctuation.

– UnsignedByte
Apr 4 '18 at 3:27

add a comment |

plus 1 for grouping punctuation.

– UnsignedByte
Apr 4 '18 at 3:27

plus 1 for grouping punctuation.

– UnsignedByte
Apr 4 '18 at 3:27

add a comment |

Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.

This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.

import string



d = "Hello, I'm a string!"



result = 

word = ''



for char in d:

    if char not in string.whitespace:

        if char not in string.ascii_letters + "'":

            if word:

                    result.append(word)

            result.append(char)

            word = ''

        else:

            word = ''.join([word,char])



    else:

        if word:

            result.append(word)

            word = ''

print result

['Hello', ',', "I'm", 'a', 'string', '!']

answered Dec 15 '08 at 1:05

monkut

26k1987125

i have not profiled this, but i guess the main problem is with the char-by-char concatenation of word. i'd instead use an index and slices.

– user3850
Dec 15 '08 at 10:24

With tricks i can shave 50% off the execution time of your solution. my solution with re.findall() is still twice as fast.

– user3850
Dec 15 '08 at 12:17

1

You need to call if word: result.append(word) after the loop ends, else the last word is not in result.

– Roland Pihlakas
May 25 '17 at 12:15

add a comment |

Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.

This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.

import string



d = "Hello, I'm a string!"



result = 

word = ''



for char in d:

    if char not in string.whitespace:

        if char not in string.ascii_letters + "'":

            if word:

                    result.append(word)

            result.append(char)

            word = ''

        else:

            word = ''.join([word,char])



    else:

        if word:

            result.append(word)

            word = ''

print result

['Hello', ',', "I'm", 'a', 'string', '!']

answered Dec 15 '08 at 1:05

monkut

26k1987125

i have not profiled this, but i guess the main problem is with the char-by-char concatenation of word. i'd instead use an index and slices.

– user3850
Dec 15 '08 at 10:24

With tricks i can shave 50% off the execution time of your solution. my solution with re.findall() is still twice as fast.

– user3850
Dec 15 '08 at 12:17

1

You need to call if word: result.append(word) after the loop ends, else the last word is not in result.

– Roland Pihlakas
May 25 '17 at 12:15

add a comment |

Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.

This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.

import string



d = "Hello, I'm a string!"



result = 

word = ''



for char in d:

    if char not in string.whitespace:

        if char not in string.ascii_letters + "'":

            if word:

                    result.append(word)

            result.append(char)

            word = ''

        else:

            word = ''.join([word,char])



    else:

        if word:

            result.append(word)

            word = ''

print result

['Hello', ',', "I'm", 'a', 'string', '!']

answered Dec 15 '08 at 1:05

monkut

26k1987125

Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.

This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.

import string



d = "Hello, I'm a string!"



result = 

word = ''



for char in d:

    if char not in string.whitespace:

        if char not in string.ascii_letters + "'":

            if word:

                    result.append(word)

            result.append(char)

            word = ''

        else:

            word = ''.join([word,char])



    else:

        if word:

            result.append(word)

            word = ''

print result

['Hello', ',', "I'm", 'a', 'string', '!']

answered Dec 15 '08 at 1:05

monkut

26k1987125

answered Dec 15 '08 at 1:05

monkut

26k1987125

answered Dec 15 '08 at 1:05

monkut

26k1987125

answered Dec 15 '08 at 1:05

monkut

26k1987125

i have not profiled this, but i guess the main problem is with the char-by-char concatenation of word. i'd instead use an index and slices.

– user3850
Dec 15 '08 at 10:24

With tricks i can shave 50% off the execution time of your solution. my solution with re.findall() is still twice as fast.

– user3850
Dec 15 '08 at 12:17

1

You need to call if word: result.append(word) after the loop ends, else the last word is not in result.

– Roland Pihlakas
May 25 '17 at 12:15

add a comment |

i have not profiled this, but i guess the main problem is with the char-by-char concatenation of word. i'd instead use an index and slices.

– user3850
Dec 15 '08 at 10:24

With tricks i can shave 50% off the execution time of your solution. my solution with re.findall() is still twice as fast.

– user3850
Dec 15 '08 at 12:17

1

You need to call if word: result.append(word) after the loop ends, else the last word is not in result.

– Roland Pihlakas
May 25 '17 at 12:15

i have not profiled this, but i guess the main problem is with the char-by-char concatenation of word. i'd instead use an index and slices.

– user3850
Dec 15 '08 at 10:24

With tricks i can shave 50% off the execution time of your solution. my solution with re.findall() is still twice as fast.

– user3850
Dec 15 '08 at 12:17

You need to call if word: result.append(word) after the loop ends, else the last word is not in result.

– Roland Pihlakas
May 25 '17 at 12:15

add a comment |

I think you can find all the help you can imagine in the NLTK, especially since you are using python. There's a good comprehensive discussion of this issue in the tutorial.

answered Dec 15 '08 at 0:34

dkretz

33k1373130

add a comment |

I think you can find all the help you can imagine in the NLTK, especially since you are using python. There's a good comprehensive discussion of this issue in the tutorial.

answered Dec 15 '08 at 0:34

dkretz

33k1373130

add a comment |

I think you can find all the help you can imagine in the NLTK, especially since you are using python. There's a good comprehensive discussion of this issue in the tutorial.

answered Dec 15 '08 at 0:34

dkretz

33k1373130

I think you can find all the help you can imagine in the NLTK, especially since you are using python. There's a good comprehensive discussion of this issue in the tutorial.

answered Dec 15 '08 at 0:34

dkretz

33k1373130

answered Dec 15 '08 at 0:34

dkretz

33k1373130

answered Dec 15 '08 at 0:34

dkretz

33k1373130

answered Dec 15 '08 at 0:34

dkretz

33k1373130

add a comment |

I came up with a way to tokenize all words and W+ patterns using b which doesn't need hardcoding:

>>> import re

>>> sentence = 'Hello, world!'

>>> tokens = [t.strip() for t in re.findall(r'b.*?S.*?(?:b|$)', sentence)]

['Hello', ',', 'world', '!']

Here .*?S.*? is a pattern matching anything that is not a space and $ is added to match last token in a string if it's a punctuation symbol.

Note the following though -- this will group punctuation that consists of more than one symbol:

>>> print [t.strip() for t in re.findall(r'b.*?S.*?(?:b|$)', '"Oh no", she said')]

['Oh', 'no', '",', 'she', 'said']

Of course, you can find and split such groups with:

>>> for token in [t.strip() for t in re.findall(r'b.*?S.*?(?:b|$)', '"You can", she said')]:

...     print re.findall(r'(?:w+|W)', token)



['You']

['can']

['"', ',']

['she']

['said']

edited Apr 15 '14 at 19:16

answered Apr 15 '14 at 19:11

FrauHahnhen

588

add a comment |

I came up with a way to tokenize all words and W+ patterns using b which doesn't need hardcoding:

>>> import re

>>> sentence = 'Hello, world!'

>>> tokens = [t.strip() for t in re.findall(r'b.*?S.*?(?:b|$)', sentence)]

['Hello', ',', 'world', '!']

Here .*?S.*? is a pattern matching anything that is not a space and $ is added to match last token in a string if it's a punctuation symbol.

Note the following though -- this will group punctuation that consists of more than one symbol:

>>> print [t.strip() for t in re.findall(r'b.*?S.*?(?:b|$)', '"Oh no", she said')]

['Oh', 'no', '",', 'she', 'said']

Of course, you can find and split such groups with:

>>> for token in [t.strip() for t in re.findall(r'b.*?S.*?(?:b|$)', '"You can", she said')]:

...     print re.findall(r'(?:w+|W)', token)



['You']

['can']

['"', ',']

['she']

['said']

edited Apr 15 '14 at 19:16

answered Apr 15 '14 at 19:11

FrauHahnhen

588

add a comment |

I came up with a way to tokenize all words and W+ patterns using b which doesn't need hardcoding:

>>> import re

>>> sentence = 'Hello, world!'

>>> tokens = [t.strip() for t in re.findall(r'b.*?S.*?(?:b|$)', sentence)]

['Hello', ',', 'world', '!']

Here .*?S.*? is a pattern matching anything that is not a space and $ is added to match last token in a string if it's a punctuation symbol.

Note the following though -- this will group punctuation that consists of more than one symbol:

>>> print [t.strip() for t in re.findall(r'b.*?S.*?(?:b|$)', '"Oh no", she said')]

['Oh', 'no', '",', 'she', 'said']

Of course, you can find and split such groups with:

>>> for token in [t.strip() for t in re.findall(r'b.*?S.*?(?:b|$)', '"You can", she said')]:

...     print re.findall(r'(?:w+|W)', token)



['You']

['can']

['"', ',']

['she']

['said']

edited Apr 15 '14 at 19:16

answered Apr 15 '14 at 19:11

FrauHahnhen

588

I came up with a way to tokenize all words and W+ patterns using b which doesn't need hardcoding:

>>> import re

>>> sentence = 'Hello, world!'

>>> tokens = [t.strip() for t in re.findall(r'b.*?S.*?(?:b|$)', sentence)]

['Hello', ',', 'world', '!']

Here .*?S.*? is a pattern matching anything that is not a space and $ is added to match last token in a string if it's a punctuation symbol.

Note the following though -- this will group punctuation that consists of more than one symbol:

>>> print [t.strip() for t in re.findall(r'b.*?S.*?(?:b|$)', '"Oh no", she said')]

['Oh', 'no', '",', 'she', 'said']

Of course, you can find and split such groups with:

>>> for token in [t.strip() for t in re.findall(r'b.*?S.*?(?:b|$)', '"You can", she said')]:

...     print re.findall(r'(?:w+|W)', token)



['You']

['can']

['"', ',']

['she']

['said']

edited Apr 15 '14 at 19:16

answered Apr 15 '14 at 19:11

FrauHahnhen

588

edited Apr 15 '14 at 19:16

answered Apr 15 '14 at 19:11

FrauHahnhen

588

answered Apr 15 '14 at 19:11

FrauHahnhen

588

answered Apr 15 '14 at 19:11

FrauHahnhen

588

add a comment |

Try this:

string_big = "One of Python's coolest features is the string format operator  This operator is unique to strings"

my_list =

x = len(string_big)

poistion_ofspace = 0

while poistion_ofspace < x:

    for i in range(poistion_ofspace,x):

        if string_big[i] == ' ':

            break

        else:

            continue

    print string_big[poistion_ofspace:(i+1)]

    my_list.append(string_big[poistion_ofspace:(i+1)])

    poistion_ofspace = i+1



print my_list

edited Apr 18 '17 at 9:28

Aurasphere

2,519102950

answered Apr 18 '17 at 9:03

Siddharth Sonone

6019

add a comment |

Try this:

string_big = "One of Python's coolest features is the string format operator  This operator is unique to strings"

my_list =

x = len(string_big)

poistion_ofspace = 0

while poistion_ofspace < x:

    for i in range(poistion_ofspace,x):

        if string_big[i] == ' ':

            break

        else:

            continue

    print string_big[poistion_ofspace:(i+1)]

    my_list.append(string_big[poistion_ofspace:(i+1)])

    poistion_ofspace = i+1



print my_list

edited Apr 18 '17 at 9:28

Aurasphere

2,519102950

answered Apr 18 '17 at 9:03

Siddharth Sonone

6019

add a comment |

Try this:

string_big = "One of Python's coolest features is the string format operator  This operator is unique to strings"

my_list =

x = len(string_big)

poistion_ofspace = 0

while poistion_ofspace < x:

    for i in range(poistion_ofspace,x):

        if string_big[i] == ' ':

            break

        else:

            continue

    print string_big[poistion_ofspace:(i+1)]

    my_list.append(string_big[poistion_ofspace:(i+1)])

    poistion_ofspace = i+1



print my_list

edited Apr 18 '17 at 9:28

Aurasphere

2,519102950

answered Apr 18 '17 at 9:03

Siddharth Sonone

6019

Try this:

string_big = "One of Python's coolest features is the string format operator  This operator is unique to strings"

my_list =

x = len(string_big)

poistion_ofspace = 0

while poistion_ofspace < x:

    for i in range(poistion_ofspace,x):

        if string_big[i] == ' ':

            break

        else:

            continue

    print string_big[poistion_ofspace:(i+1)]

    my_list.append(string_big[poistion_ofspace:(i+1)])

    poistion_ofspace = i+1



print my_list

edited Apr 18 '17 at 9:28

Aurasphere

2,519102950

answered Apr 18 '17 at 9:03

Siddharth Sonone

6019

edited Apr 18 '17 at 9:28

Aurasphere

2,519102950

edited Apr 18 '17 at 9:28

Aurasphere

2,519102950

edited Apr 18 '17 at 9:28

Aurasphere

2,519102950

answered Apr 18 '17 at 9:03

Siddharth Sonone

6019

answered Apr 18 '17 at 9:03

Siddharth Sonone

6019

answered Apr 18 '17 at 9:03

Siddharth Sonone

6019

add a comment |

If you are going to work in English (or some other common languages), you can use NLTK (there are many other tools to do this such as FreeLing).

import nltk

sentence = "help, me"

nltk.word_tokenize(sentence)

edited Nov 9 '18 at 10:30

answered Nov 8 '18 at 16:16

Fernando S. Peregrino

618

add a comment |

If you are going to work in English (or some other common languages), you can use NLTK (there are many other tools to do this such as FreeLing).

import nltk

sentence = "help, me"

nltk.word_tokenize(sentence)

edited Nov 9 '18 at 10:30

answered Nov 8 '18 at 16:16

Fernando S. Peregrino

618

add a comment |

If you are going to work in English (or some other common languages), you can use NLTK (there are many other tools to do this such as FreeLing).

import nltk

sentence = "help, me"

nltk.word_tokenize(sentence)

edited Nov 9 '18 at 10:30

answered Nov 8 '18 at 16:16

Fernando S. Peregrino

618

If you are going to work in English (or some other common languages), you can use NLTK (there are many other tools to do this such as FreeLing).

import nltk

sentence = "help, me"

nltk.word_tokenize(sentence)

edited Nov 9 '18 at 10:30

answered Nov 8 '18 at 16:16

Fernando S. Peregrino

618

edited Nov 9 '18 at 10:30

answered Nov 8 '18 at 16:16

Fernando S. Peregrino

618

answered Nov 8 '18 at 16:16

Fernando S. Peregrino

618

answered Nov 8 '18 at 16:16

Fernando S. Peregrino

618

add a comment |

-1

Have you tried using a regex?

http://docs.python.org/library/re.html#re-syntax

By the way. Why do you need the "," at the second one? You will know that after each text is written i.e.

[0]

","

[1]

","

So if you want to add the "," you can just do it after each iteration when you use the array..

answered Dec 14 '08 at 23:34

Filip Ekberg

29.8k18107175

add a comment |

-1

Have you tried using a regex?

http://docs.python.org/library/re.html#re-syntax

By the way. Why do you need the "," at the second one? You will know that after each text is written i.e.

[0]

","

[1]

","

So if you want to add the "," you can just do it after each iteration when you use the array..

answered Dec 14 '08 at 23:34

Filip Ekberg

29.8k18107175

add a comment |

-1

Have you tried using a regex?

http://docs.python.org/library/re.html#re-syntax

By the way. Why do you need the "," at the second one? You will know that after each text is written i.e.

[0]

","

[1]

","

So if you want to add the "," you can just do it after each iteration when you use the array..

answered Dec 14 '08 at 23:34

Filip Ekberg

29.8k18107175

Have you tried using a regex?

http://docs.python.org/library/re.html#re-syntax

By the way. Why do you need the "," at the second one? You will know that after each text is written i.e.

[0]

","

[1]

","

So if you want to add the "," you can just do it after each iteration when you use the array..

answered Dec 14 '08 at 23:34

Filip Ekberg

29.8k18107175

answered Dec 14 '08 at 23:34

Filip Ekberg

29.8k18107175

answered Dec 14 '08 at 23:34

Filip Ekberg

29.8k18107175

answered Dec 14 '08 at 23:34

Filip Ekberg

29.8k18107175

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

UQ,O tfVaHe2BFQcr9TYC2O2r lO6shTSA pcvNkQRhuRh6reAqMlEeU6tOYj1nfYxg,ADbIGFu,Af cgHzZhUtD,bOTP2 d4UKOTv7Vy R,W

搜尋此網誌

Ytukyg