Python function for splitting strung together word into individual words





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







3















I am trying to come up with a function that takes entries like




"businessidentifier", "firstname", "streetaddress"




and outputs




"business identifier", "first name", "street address"




This seems to be a fairly complicated problem involving NLP, since the function will have to iterate over a string and test against a vocabulary to see when it arrives at a word in the vocabulary, but for the first example "businessidentifier" might be seen first as "bus I ness identifier". Has anyone come across a function that accomplishes this task?










share|improve this question























  • You might need to look into a language model... a naive approach might be to try all possible ways to break the sentence such that you end up with valid words, then use a language model to tell you which possible sentence is most likely. I doubt there's any simple function you can call which does this for you, though

    – scnerd
    Nov 26 '18 at 19:40











  • I think this is a two step problem: 1) You're going to need to identify possible words which shouldn't be too hard given enough compute time and a corpus. 2) Given all the possible combos of words identify the "most" correct sentence given some NLP or other technique. Second step is definitely non-trivial and the first step will take exponentially long given large string(s) and a representative corpus.

    – n8sty
    Nov 26 '18 at 19:43











  • this is what I figured. thank you both for confirming my suspicions, I will give that a try.

    – jgcello
    Nov 26 '18 at 19:58


















3















I am trying to come up with a function that takes entries like




"businessidentifier", "firstname", "streetaddress"




and outputs




"business identifier", "first name", "street address"




This seems to be a fairly complicated problem involving NLP, since the function will have to iterate over a string and test against a vocabulary to see when it arrives at a word in the vocabulary, but for the first example "businessidentifier" might be seen first as "bus I ness identifier". Has anyone come across a function that accomplishes this task?










share|improve this question























  • You might need to look into a language model... a naive approach might be to try all possible ways to break the sentence such that you end up with valid words, then use a language model to tell you which possible sentence is most likely. I doubt there's any simple function you can call which does this for you, though

    – scnerd
    Nov 26 '18 at 19:40











  • I think this is a two step problem: 1) You're going to need to identify possible words which shouldn't be too hard given enough compute time and a corpus. 2) Given all the possible combos of words identify the "most" correct sentence given some NLP or other technique. Second step is definitely non-trivial and the first step will take exponentially long given large string(s) and a representative corpus.

    – n8sty
    Nov 26 '18 at 19:43











  • this is what I figured. thank you both for confirming my suspicions, I will give that a try.

    – jgcello
    Nov 26 '18 at 19:58














3












3








3


0






I am trying to come up with a function that takes entries like




"businessidentifier", "firstname", "streetaddress"




and outputs




"business identifier", "first name", "street address"




This seems to be a fairly complicated problem involving NLP, since the function will have to iterate over a string and test against a vocabulary to see when it arrives at a word in the vocabulary, but for the first example "businessidentifier" might be seen first as "bus I ness identifier". Has anyone come across a function that accomplishes this task?










share|improve this question














I am trying to come up with a function that takes entries like




"businessidentifier", "firstname", "streetaddress"




and outputs




"business identifier", "first name", "street address"




This seems to be a fairly complicated problem involving NLP, since the function will have to iterate over a string and test against a vocabulary to see when it arrives at a word in the vocabulary, but for the first example "businessidentifier" might be seen first as "bus I ness identifier". Has anyone come across a function that accomplishes this task?







python nlp






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 26 '18 at 19:34









jgcellojgcello

1737




1737













  • You might need to look into a language model... a naive approach might be to try all possible ways to break the sentence such that you end up with valid words, then use a language model to tell you which possible sentence is most likely. I doubt there's any simple function you can call which does this for you, though

    – scnerd
    Nov 26 '18 at 19:40











  • I think this is a two step problem: 1) You're going to need to identify possible words which shouldn't be too hard given enough compute time and a corpus. 2) Given all the possible combos of words identify the "most" correct sentence given some NLP or other technique. Second step is definitely non-trivial and the first step will take exponentially long given large string(s) and a representative corpus.

    – n8sty
    Nov 26 '18 at 19:43











  • this is what I figured. thank you both for confirming my suspicions, I will give that a try.

    – jgcello
    Nov 26 '18 at 19:58



















  • You might need to look into a language model... a naive approach might be to try all possible ways to break the sentence such that you end up with valid words, then use a language model to tell you which possible sentence is most likely. I doubt there's any simple function you can call which does this for you, though

    – scnerd
    Nov 26 '18 at 19:40











  • I think this is a two step problem: 1) You're going to need to identify possible words which shouldn't be too hard given enough compute time and a corpus. 2) Given all the possible combos of words identify the "most" correct sentence given some NLP or other technique. Second step is definitely non-trivial and the first step will take exponentially long given large string(s) and a representative corpus.

    – n8sty
    Nov 26 '18 at 19:43











  • this is what I figured. thank you both for confirming my suspicions, I will give that a try.

    – jgcello
    Nov 26 '18 at 19:58

















You might need to look into a language model... a naive approach might be to try all possible ways to break the sentence such that you end up with valid words, then use a language model to tell you which possible sentence is most likely. I doubt there's any simple function you can call which does this for you, though

– scnerd
Nov 26 '18 at 19:40





You might need to look into a language model... a naive approach might be to try all possible ways to break the sentence such that you end up with valid words, then use a language model to tell you which possible sentence is most likely. I doubt there's any simple function you can call which does this for you, though

– scnerd
Nov 26 '18 at 19:40













I think this is a two step problem: 1) You're going to need to identify possible words which shouldn't be too hard given enough compute time and a corpus. 2) Given all the possible combos of words identify the "most" correct sentence given some NLP or other technique. Second step is definitely non-trivial and the first step will take exponentially long given large string(s) and a representative corpus.

– n8sty
Nov 26 '18 at 19:43





I think this is a two step problem: 1) You're going to need to identify possible words which shouldn't be too hard given enough compute time and a corpus. 2) Given all the possible combos of words identify the "most" correct sentence given some NLP or other technique. Second step is definitely non-trivial and the first step will take exponentially long given large string(s) and a representative corpus.

– n8sty
Nov 26 '18 at 19:43













this is what I figured. thank you both for confirming my suspicions, I will give that a try.

– jgcello
Nov 26 '18 at 19:58





this is what I figured. thank you both for confirming my suspicions, I will give that a try.

– jgcello
Nov 26 '18 at 19:58












1 Answer
1






active

oldest

votes


















2














First we need to get a lot of English words, i used nltk here. Then I loaded all words into dict so that all words that started for example with 'a' are in dict eng_dict under key 'a' to make searching for words faster. Then I sorted all words by their length so that when we look for words in our sentence we first are going to try to match it with longest ones so given 'businessidentifier' we'll first check 'business' instead of for example 'bus'.
Now that we have our words in nice format we can create function to match our sentence with that words. Here I created recurrent function that tries to match all words that starts with same letter as sentence and if we find one then add it to our return list and recurrently look for next one.



from nltk.corpus import words
word_list = words.words()

eng_dict = {chr(i): sorted([word for word in word_list if word[0] == chr(i)], key=len, reverse=True) for i in range(ord('a'), ord('z')+1)}

def split_into_words(x):
ret =
for word in eng_dict[x[0]]:
if x.startswith(word):
ret.append(word)
x = x[len(word):]
break
if len(x) != 0:
ret.extend(split_into_words(x))
return ret

raw_sentences = ["businessidentifier", "firstname", "streetaddress"]
finall_sentence = [split_into_words(i) for i in raw_sentences]

print(finall_sentence)


Output:



[['business', 'identifier'], ['first', 'name'], ['street', 'address']]





share|improve this answer
























  • this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!

    – jgcello
    Nov 26 '18 at 20:09











  • Glad to have been for help!

    – Filip Młynarski
    Nov 26 '18 at 20:10












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53487850%2fpython-function-for-splitting-strung-together-word-into-individual-words%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














First we need to get a lot of English words, i used nltk here. Then I loaded all words into dict so that all words that started for example with 'a' are in dict eng_dict under key 'a' to make searching for words faster. Then I sorted all words by their length so that when we look for words in our sentence we first are going to try to match it with longest ones so given 'businessidentifier' we'll first check 'business' instead of for example 'bus'.
Now that we have our words in nice format we can create function to match our sentence with that words. Here I created recurrent function that tries to match all words that starts with same letter as sentence and if we find one then add it to our return list and recurrently look for next one.



from nltk.corpus import words
word_list = words.words()

eng_dict = {chr(i): sorted([word for word in word_list if word[0] == chr(i)], key=len, reverse=True) for i in range(ord('a'), ord('z')+1)}

def split_into_words(x):
ret =
for word in eng_dict[x[0]]:
if x.startswith(word):
ret.append(word)
x = x[len(word):]
break
if len(x) != 0:
ret.extend(split_into_words(x))
return ret

raw_sentences = ["businessidentifier", "firstname", "streetaddress"]
finall_sentence = [split_into_words(i) for i in raw_sentences]

print(finall_sentence)


Output:



[['business', 'identifier'], ['first', 'name'], ['street', 'address']]





share|improve this answer
























  • this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!

    – jgcello
    Nov 26 '18 at 20:09











  • Glad to have been for help!

    – Filip Młynarski
    Nov 26 '18 at 20:10
















2














First we need to get a lot of English words, i used nltk here. Then I loaded all words into dict so that all words that started for example with 'a' are in dict eng_dict under key 'a' to make searching for words faster. Then I sorted all words by their length so that when we look for words in our sentence we first are going to try to match it with longest ones so given 'businessidentifier' we'll first check 'business' instead of for example 'bus'.
Now that we have our words in nice format we can create function to match our sentence with that words. Here I created recurrent function that tries to match all words that starts with same letter as sentence and if we find one then add it to our return list and recurrently look for next one.



from nltk.corpus import words
word_list = words.words()

eng_dict = {chr(i): sorted([word for word in word_list if word[0] == chr(i)], key=len, reverse=True) for i in range(ord('a'), ord('z')+1)}

def split_into_words(x):
ret =
for word in eng_dict[x[0]]:
if x.startswith(word):
ret.append(word)
x = x[len(word):]
break
if len(x) != 0:
ret.extend(split_into_words(x))
return ret

raw_sentences = ["businessidentifier", "firstname", "streetaddress"]
finall_sentence = [split_into_words(i) for i in raw_sentences]

print(finall_sentence)


Output:



[['business', 'identifier'], ['first', 'name'], ['street', 'address']]





share|improve this answer
























  • this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!

    – jgcello
    Nov 26 '18 at 20:09











  • Glad to have been for help!

    – Filip Młynarski
    Nov 26 '18 at 20:10














2












2








2







First we need to get a lot of English words, i used nltk here. Then I loaded all words into dict so that all words that started for example with 'a' are in dict eng_dict under key 'a' to make searching for words faster. Then I sorted all words by their length so that when we look for words in our sentence we first are going to try to match it with longest ones so given 'businessidentifier' we'll first check 'business' instead of for example 'bus'.
Now that we have our words in nice format we can create function to match our sentence with that words. Here I created recurrent function that tries to match all words that starts with same letter as sentence and if we find one then add it to our return list and recurrently look for next one.



from nltk.corpus import words
word_list = words.words()

eng_dict = {chr(i): sorted([word for word in word_list if word[0] == chr(i)], key=len, reverse=True) for i in range(ord('a'), ord('z')+1)}

def split_into_words(x):
ret =
for word in eng_dict[x[0]]:
if x.startswith(word):
ret.append(word)
x = x[len(word):]
break
if len(x) != 0:
ret.extend(split_into_words(x))
return ret

raw_sentences = ["businessidentifier", "firstname", "streetaddress"]
finall_sentence = [split_into_words(i) for i in raw_sentences]

print(finall_sentence)


Output:



[['business', 'identifier'], ['first', 'name'], ['street', 'address']]





share|improve this answer













First we need to get a lot of English words, i used nltk here. Then I loaded all words into dict so that all words that started for example with 'a' are in dict eng_dict under key 'a' to make searching for words faster. Then I sorted all words by their length so that when we look for words in our sentence we first are going to try to match it with longest ones so given 'businessidentifier' we'll first check 'business' instead of for example 'bus'.
Now that we have our words in nice format we can create function to match our sentence with that words. Here I created recurrent function that tries to match all words that starts with same letter as sentence and if we find one then add it to our return list and recurrently look for next one.



from nltk.corpus import words
word_list = words.words()

eng_dict = {chr(i): sorted([word for word in word_list if word[0] == chr(i)], key=len, reverse=True) for i in range(ord('a'), ord('z')+1)}

def split_into_words(x):
ret =
for word in eng_dict[x[0]]:
if x.startswith(word):
ret.append(word)
x = x[len(word):]
break
if len(x) != 0:
ret.extend(split_into_words(x))
return ret

raw_sentences = ["businessidentifier", "firstname", "streetaddress"]
finall_sentence = [split_into_words(i) for i in raw_sentences]

print(finall_sentence)


Output:



[['business', 'identifier'], ['first', 'name'], ['street', 'address']]






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 26 '18 at 20:00









Filip MłynarskiFilip Młynarski

2,0341415




2,0341415













  • this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!

    – jgcello
    Nov 26 '18 at 20:09











  • Glad to have been for help!

    – Filip Młynarski
    Nov 26 '18 at 20:10



















  • this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!

    – jgcello
    Nov 26 '18 at 20:09











  • Glad to have been for help!

    – Filip Młynarski
    Nov 26 '18 at 20:10

















this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!

– jgcello
Nov 26 '18 at 20:09





this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!

– jgcello
Nov 26 '18 at 20:09













Glad to have been for help!

– Filip Młynarski
Nov 26 '18 at 20:10





Glad to have been for help!

– Filip Młynarski
Nov 26 '18 at 20:10




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53487850%2fpython-function-for-splitting-strung-together-word-into-individual-words%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

To store a contact into the json file from server.js file using a class in NodeJS

Redirect URL with Chrome Remote Debugging Android Devices

Dieringhausen