Python function for splitting strung together word into individual words

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I am trying to come up with a function that takes entries like

"businessidentifier", "firstname", "streetaddress"

and outputs

"business identifier", "first name", "street address"

This seems to be a fairly complicated problem involving NLP, since the function will have to iterate over a string and test against a vocabulary to see when it arrives at a word in the vocabulary, but for the first example "businessidentifier" might be seen first as "bus I ness identifier". Has anyone come across a function that accomplishes this task?

asked Nov 26 '18 at 19:34

jgcello

1737

You might need to look into a language model... a naive approach might be to try all possible ways to break the sentence such that you end up with valid words, then use a language model to tell you which possible sentence is most likely. I doubt there's any simple function you can call which does this for you, though

– scnerd
Nov 26 '18 at 19:40

I think this is a two step problem: 1) You're going to need to identify possible words which shouldn't be too hard given enough compute time and a corpus. 2) Given all the possible combos of words identify the "most" correct sentence given some NLP or other technique. Second step is definitely non-trivial and the first step will take exponentially long given large string(s) and a representative corpus.

– n8sty
Nov 26 '18 at 19:43

this is what I figured. thank you both for confirming my suspicions, I will give that a try.

– jgcello
Nov 26 '18 at 19:58

add a comment |

I am trying to come up with a function that takes entries like

"businessidentifier", "firstname", "streetaddress"

and outputs

"business identifier", "first name", "street address"

asked Nov 26 '18 at 19:34

jgcello

1737

You might need to look into a language model... a naive approach might be to try all possible ways to break the sentence such that you end up with valid words, then use a language model to tell you which possible sentence is most likely. I doubt there's any simple function you can call which does this for you, though

– scnerd
Nov 26 '18 at 19:40

I think this is a two step problem: 1) You're going to need to identify possible words which shouldn't be too hard given enough compute time and a corpus. 2) Given all the possible combos of words identify the "most" correct sentence given some NLP or other technique. Second step is definitely non-trivial and the first step will take exponentially long given large string(s) and a representative corpus.

– n8sty
Nov 26 '18 at 19:43

this is what I figured. thank you both for confirming my suspicions, I will give that a try.

– jgcello
Nov 26 '18 at 19:58

add a comment |

I am trying to come up with a function that takes entries like

"businessidentifier", "firstname", "streetaddress"

and outputs

"business identifier", "first name", "street address"

asked Nov 26 '18 at 19:34

jgcello

1737

I am trying to come up with a function that takes entries like

"businessidentifier", "firstname", "streetaddress"

and outputs

"business identifier", "first name", "street address"

python nlp

asked Nov 26 '18 at 19:34

jgcello

1737

asked Nov 26 '18 at 19:34

jgcello

1737

asked Nov 26 '18 at 19:34

jgcello

1737

asked Nov 26 '18 at 19:34

jgcello

1737

asked Nov 26 '18 at 19:34

jgcello

1737

You might need to look into a language model... a naive approach might be to try all possible ways to break the sentence such that you end up with valid words, then use a language model to tell you which possible sentence is most likely. I doubt there's any simple function you can call which does this for you, though

– scnerd
Nov 26 '18 at 19:40

I think this is a two step problem: 1) You're going to need to identify possible words which shouldn't be too hard given enough compute time and a corpus. 2) Given all the possible combos of words identify the "most" correct sentence given some NLP or other technique. Second step is definitely non-trivial and the first step will take exponentially long given large string(s) and a representative corpus.

– n8sty
Nov 26 '18 at 19:43

this is what I figured. thank you both for confirming my suspicions, I will give that a try.

– jgcello
Nov 26 '18 at 19:58

add a comment |

You might need to look into a language model... a naive approach might be to try all possible ways to break the sentence such that you end up with valid words, then use a language model to tell you which possible sentence is most likely. I doubt there's any simple function you can call which does this for you, though

– scnerd
Nov 26 '18 at 19:40

I think this is a two step problem: 1) You're going to need to identify possible words which shouldn't be too hard given enough compute time and a corpus. 2) Given all the possible combos of words identify the "most" correct sentence given some NLP or other technique. Second step is definitely non-trivial and the first step will take exponentially long given large string(s) and a representative corpus.

– n8sty
Nov 26 '18 at 19:43

this is what I figured. thank you both for confirming my suspicions, I will give that a try.

– jgcello
Nov 26 '18 at 19:58

You might need to look into a language model... a naive approach might be to try all possible ways to break the sentence such that you end up with valid words, then use a language model to tell you which possible sentence is most likely. I doubt there's any simple function you can call which does this for you, though

– scnerd
Nov 26 '18 at 19:40

I think this is a two step problem: 1) You're going to need to identify possible words which shouldn't be too hard given enough compute time and a corpus. 2) Given all the possible combos of words identify the "most" correct sentence given some NLP or other technique. Second step is definitely non-trivial and the first step will take exponentially long given large string(s) and a representative corpus.

– n8sty
Nov 26 '18 at 19:43

this is what I figured. thank you both for confirming my suspicions, I will give that a try.

– jgcello
Nov 26 '18 at 19:58

add a comment |

1 Answer
1

active

oldest

votes

First we need to get a lot of English words, i used nltk here. Then I loaded all words into dict so that all words that started for example with 'a' are in dict eng_dict under key 'a' to make searching for words faster. Then I sorted all words by their length so that when we look for words in our sentence we first are going to try to match it with longest ones so given 'businessidentifier' we'll first check 'business' instead of for example 'bus'.
Now that we have our words in nice format we can create function to match our sentence with that words. Here I created recurrent function that tries to match all words that starts with same letter as sentence and if we find one then add it to our return list and recurrently look for next one.

from nltk.corpus import words

word_list = words.words()



eng_dict = {chr(i): sorted([word for word in word_list if word[0] == chr(i)], key=len, reverse=True) for i in range(ord('a'), ord('z')+1)}



def split_into_words(x):

    ret = 

    for word in eng_dict[x[0]]:

        if x.startswith(word):

            ret.append(word)

            x = x[len(word):]

            break

    if len(x) != 0:

        ret.extend(split_into_words(x))

    return ret



raw_sentences = ["businessidentifier", "firstname", "streetaddress"]

finall_sentence = [split_into_words(i) for i in raw_sentences]



print(finall_sentence)

Output:

[['business', 'identifier'], ['first', 'name'], ['street', 'address']]

answered Nov 26 '18 at 20:00

Filip Młynarski

2,0341415

this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!

– jgcello
Nov 26 '18 at 20:09

Glad to have been for help!

– Filip Młynarski
Nov 26 '18 at 20:10

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53487850%2fpython-function-for-splitting-strung-together-word-into-individual-words%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

from nltk.corpus import words

word_list = words.words()



eng_dict = {chr(i): sorted([word for word in word_list if word[0] == chr(i)], key=len, reverse=True) for i in range(ord('a'), ord('z')+1)}



def split_into_words(x):

    ret = 

    for word in eng_dict[x[0]]:

        if x.startswith(word):

            ret.append(word)

            x = x[len(word):]

            break

    if len(x) != 0:

        ret.extend(split_into_words(x))

    return ret



raw_sentences = ["businessidentifier", "firstname", "streetaddress"]

finall_sentence = [split_into_words(i) for i in raw_sentences]



print(finall_sentence)

Output:

[['business', 'identifier'], ['first', 'name'], ['street', 'address']]

answered Nov 26 '18 at 20:00

Filip Młynarski

2,0341415

this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!

– jgcello
Nov 26 '18 at 20:09

Glad to have been for help!

– Filip Młynarski
Nov 26 '18 at 20:10

add a comment |

from nltk.corpus import words

word_list = words.words()



eng_dict = {chr(i): sorted([word for word in word_list if word[0] == chr(i)], key=len, reverse=True) for i in range(ord('a'), ord('z')+1)}



def split_into_words(x):

    ret = 

    for word in eng_dict[x[0]]:

        if x.startswith(word):

            ret.append(word)

            x = x[len(word):]

            break

    if len(x) != 0:

        ret.extend(split_into_words(x))

    return ret



raw_sentences = ["businessidentifier", "firstname", "streetaddress"]

finall_sentence = [split_into_words(i) for i in raw_sentences]



print(finall_sentence)

Output:

[['business', 'identifier'], ['first', 'name'], ['street', 'address']]

answered Nov 26 '18 at 20:00

Filip Młynarski

2,0341415

this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!

– jgcello
Nov 26 '18 at 20:09

Glad to have been for help!

– Filip Młynarski
Nov 26 '18 at 20:10

add a comment |

from nltk.corpus import words

word_list = words.words()



eng_dict = {chr(i): sorted([word for word in word_list if word[0] == chr(i)], key=len, reverse=True) for i in range(ord('a'), ord('z')+1)}



def split_into_words(x):

    ret = 

    for word in eng_dict[x[0]]:

        if x.startswith(word):

            ret.append(word)

            x = x[len(word):]

            break

    if len(x) != 0:

        ret.extend(split_into_words(x))

    return ret



raw_sentences = ["businessidentifier", "firstname", "streetaddress"]

finall_sentence = [split_into_words(i) for i in raw_sentences]



print(finall_sentence)

Output:

[['business', 'identifier'], ['first', 'name'], ['street', 'address']]

answered Nov 26 '18 at 20:00

Filip Młynarski

2,0341415

from nltk.corpus import words

word_list = words.words()



eng_dict = {chr(i): sorted([word for word in word_list if word[0] == chr(i)], key=len, reverse=True) for i in range(ord('a'), ord('z')+1)}



def split_into_words(x):

    ret = 

    for word in eng_dict[x[0]]:

        if x.startswith(word):

            ret.append(word)

            x = x[len(word):]

            break

    if len(x) != 0:

        ret.extend(split_into_words(x))

    return ret



raw_sentences = ["businessidentifier", "firstname", "streetaddress"]

finall_sentence = [split_into_words(i) for i in raw_sentences]



print(finall_sentence)

Output:

[['business', 'identifier'], ['first', 'name'], ['street', 'address']]

answered Nov 26 '18 at 20:00

Filip Młynarski

2,0341415

answered Nov 26 '18 at 20:00

Filip Młynarski

2,0341415

answered Nov 26 '18 at 20:00

Filip Młynarski

2,0341415

answered Nov 26 '18 at 20:00

Filip Młynarski

2,0341415

this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!

– jgcello
Nov 26 '18 at 20:09

Glad to have been for help!

– Filip Młynarski
Nov 26 '18 at 20:10

add a comment |

this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!

– jgcello
Nov 26 '18 at 20:09

Glad to have been for help!

– Filip Młynarski
Nov 26 '18 at 20:10

this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!

– jgcello
Nov 26 '18 at 20:09

Glad to have been for help!

– Filip Młynarski
Nov 26 '18 at 20:10

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

exRRZNBrN37VrS,q9v8aR3ul9f9Vn,mdc0MjGeY cXuMdDoAw1XdvGfkS

搜尋此網誌

Ytukyg