Python function for splitting strung together word into individual words
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I am trying to come up with a function that takes entries like
"businessidentifier", "firstname", "streetaddress"
and outputs
"business identifier", "first name", "street address"
This seems to be a fairly complicated problem involving NLP, since the function will have to iterate over a string and test against a vocabulary to see when it arrives at a word in the vocabulary, but for the first example "businessidentifier" might be seen first as "bus I ness identifier". Has anyone come across a function that accomplishes this task?
python nlp
add a comment |
I am trying to come up with a function that takes entries like
"businessidentifier", "firstname", "streetaddress"
and outputs
"business identifier", "first name", "street address"
This seems to be a fairly complicated problem involving NLP, since the function will have to iterate over a string and test against a vocabulary to see when it arrives at a word in the vocabulary, but for the first example "businessidentifier" might be seen first as "bus I ness identifier". Has anyone come across a function that accomplishes this task?
python nlp
You might need to look into a language model... a naive approach might be to try all possible ways to break the sentence such that you end up with valid words, then use a language model to tell you which possible sentence is most likely. I doubt there's any simple function you can call which does this for you, though
– scnerd
Nov 26 '18 at 19:40
I think this is a two step problem: 1) You're going to need to identify possible words which shouldn't be too hard given enough compute time and a corpus. 2) Given all the possible combos of words identify the "most" correct sentence given some NLP or other technique. Second step is definitely non-trivial and the first step will take exponentially long given large string(s) and a representative corpus.
– n8sty
Nov 26 '18 at 19:43
this is what I figured. thank you both for confirming my suspicions, I will give that a try.
– jgcello
Nov 26 '18 at 19:58
add a comment |
I am trying to come up with a function that takes entries like
"businessidentifier", "firstname", "streetaddress"
and outputs
"business identifier", "first name", "street address"
This seems to be a fairly complicated problem involving NLP, since the function will have to iterate over a string and test against a vocabulary to see when it arrives at a word in the vocabulary, but for the first example "businessidentifier" might be seen first as "bus I ness identifier". Has anyone come across a function that accomplishes this task?
python nlp
I am trying to come up with a function that takes entries like
"businessidentifier", "firstname", "streetaddress"
and outputs
"business identifier", "first name", "street address"
This seems to be a fairly complicated problem involving NLP, since the function will have to iterate over a string and test against a vocabulary to see when it arrives at a word in the vocabulary, but for the first example "businessidentifier" might be seen first as "bus I ness identifier". Has anyone come across a function that accomplishes this task?
python nlp
python nlp
asked Nov 26 '18 at 19:34
jgcellojgcello
1737
1737
You might need to look into a language model... a naive approach might be to try all possible ways to break the sentence such that you end up with valid words, then use a language model to tell you which possible sentence is most likely. I doubt there's any simple function you can call which does this for you, though
– scnerd
Nov 26 '18 at 19:40
I think this is a two step problem: 1) You're going to need to identify possible words which shouldn't be too hard given enough compute time and a corpus. 2) Given all the possible combos of words identify the "most" correct sentence given some NLP or other technique. Second step is definitely non-trivial and the first step will take exponentially long given large string(s) and a representative corpus.
– n8sty
Nov 26 '18 at 19:43
this is what I figured. thank you both for confirming my suspicions, I will give that a try.
– jgcello
Nov 26 '18 at 19:58
add a comment |
You might need to look into a language model... a naive approach might be to try all possible ways to break the sentence such that you end up with valid words, then use a language model to tell you which possible sentence is most likely. I doubt there's any simple function you can call which does this for you, though
– scnerd
Nov 26 '18 at 19:40
I think this is a two step problem: 1) You're going to need to identify possible words which shouldn't be too hard given enough compute time and a corpus. 2) Given all the possible combos of words identify the "most" correct sentence given some NLP or other technique. Second step is definitely non-trivial and the first step will take exponentially long given large string(s) and a representative corpus.
– n8sty
Nov 26 '18 at 19:43
this is what I figured. thank you both for confirming my suspicions, I will give that a try.
– jgcello
Nov 26 '18 at 19:58
You might need to look into a language model... a naive approach might be to try all possible ways to break the sentence such that you end up with valid words, then use a language model to tell you which possible sentence is most likely. I doubt there's any simple function you can call which does this for you, though
– scnerd
Nov 26 '18 at 19:40
You might need to look into a language model... a naive approach might be to try all possible ways to break the sentence such that you end up with valid words, then use a language model to tell you which possible sentence is most likely. I doubt there's any simple function you can call which does this for you, though
– scnerd
Nov 26 '18 at 19:40
I think this is a two step problem: 1) You're going to need to identify possible words which shouldn't be too hard given enough compute time and a corpus. 2) Given all the possible combos of words identify the "most" correct sentence given some NLP or other technique. Second step is definitely non-trivial and the first step will take exponentially long given large string(s) and a representative corpus.
– n8sty
Nov 26 '18 at 19:43
I think this is a two step problem: 1) You're going to need to identify possible words which shouldn't be too hard given enough compute time and a corpus. 2) Given all the possible combos of words identify the "most" correct sentence given some NLP or other technique. Second step is definitely non-trivial and the first step will take exponentially long given large string(s) and a representative corpus.
– n8sty
Nov 26 '18 at 19:43
this is what I figured. thank you both for confirming my suspicions, I will give that a try.
– jgcello
Nov 26 '18 at 19:58
this is what I figured. thank you both for confirming my suspicions, I will give that a try.
– jgcello
Nov 26 '18 at 19:58
add a comment |
1 Answer
1
active
oldest
votes
First we need to get a lot of English words, i used nltk here. Then I loaded all words into dict so that all words that started for example with 'a' are in dict eng_dict
under key 'a' to make searching for words faster. Then I sorted all words by their length so that when we look for words in our sentence we first are going to try to match it with longest ones so given 'businessidentifier' we'll first check 'business' instead of for example 'bus'.
Now that we have our words in nice format we can create function to match our sentence with that words. Here I created recurrent function that tries to match all words that starts with same letter as sentence and if we find one then add it to our return list and recurrently look for next one.
from nltk.corpus import words
word_list = words.words()
eng_dict = {chr(i): sorted([word for word in word_list if word[0] == chr(i)], key=len, reverse=True) for i in range(ord('a'), ord('z')+1)}
def split_into_words(x):
ret =
for word in eng_dict[x[0]]:
if x.startswith(word):
ret.append(word)
x = x[len(word):]
break
if len(x) != 0:
ret.extend(split_into_words(x))
return ret
raw_sentences = ["businessidentifier", "firstname", "streetaddress"]
finall_sentence = [split_into_words(i) for i in raw_sentences]
print(finall_sentence)
Output:
[['business', 'identifier'], ['first', 'name'], ['street', 'address']]
this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!
– jgcello
Nov 26 '18 at 20:09
Glad to have been for help!
– Filip Młynarski
Nov 26 '18 at 20:10
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53487850%2fpython-function-for-splitting-strung-together-word-into-individual-words%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
First we need to get a lot of English words, i used nltk here. Then I loaded all words into dict so that all words that started for example with 'a' are in dict eng_dict
under key 'a' to make searching for words faster. Then I sorted all words by their length so that when we look for words in our sentence we first are going to try to match it with longest ones so given 'businessidentifier' we'll first check 'business' instead of for example 'bus'.
Now that we have our words in nice format we can create function to match our sentence with that words. Here I created recurrent function that tries to match all words that starts with same letter as sentence and if we find one then add it to our return list and recurrently look for next one.
from nltk.corpus import words
word_list = words.words()
eng_dict = {chr(i): sorted([word for word in word_list if word[0] == chr(i)], key=len, reverse=True) for i in range(ord('a'), ord('z')+1)}
def split_into_words(x):
ret =
for word in eng_dict[x[0]]:
if x.startswith(word):
ret.append(word)
x = x[len(word):]
break
if len(x) != 0:
ret.extend(split_into_words(x))
return ret
raw_sentences = ["businessidentifier", "firstname", "streetaddress"]
finall_sentence = [split_into_words(i) for i in raw_sentences]
print(finall_sentence)
Output:
[['business', 'identifier'], ['first', 'name'], ['street', 'address']]
this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!
– jgcello
Nov 26 '18 at 20:09
Glad to have been for help!
– Filip Młynarski
Nov 26 '18 at 20:10
add a comment |
First we need to get a lot of English words, i used nltk here. Then I loaded all words into dict so that all words that started for example with 'a' are in dict eng_dict
under key 'a' to make searching for words faster. Then I sorted all words by their length so that when we look for words in our sentence we first are going to try to match it with longest ones so given 'businessidentifier' we'll first check 'business' instead of for example 'bus'.
Now that we have our words in nice format we can create function to match our sentence with that words. Here I created recurrent function that tries to match all words that starts with same letter as sentence and if we find one then add it to our return list and recurrently look for next one.
from nltk.corpus import words
word_list = words.words()
eng_dict = {chr(i): sorted([word for word in word_list if word[0] == chr(i)], key=len, reverse=True) for i in range(ord('a'), ord('z')+1)}
def split_into_words(x):
ret =
for word in eng_dict[x[0]]:
if x.startswith(word):
ret.append(word)
x = x[len(word):]
break
if len(x) != 0:
ret.extend(split_into_words(x))
return ret
raw_sentences = ["businessidentifier", "firstname", "streetaddress"]
finall_sentence = [split_into_words(i) for i in raw_sentences]
print(finall_sentence)
Output:
[['business', 'identifier'], ['first', 'name'], ['street', 'address']]
this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!
– jgcello
Nov 26 '18 at 20:09
Glad to have been for help!
– Filip Młynarski
Nov 26 '18 at 20:10
add a comment |
First we need to get a lot of English words, i used nltk here. Then I loaded all words into dict so that all words that started for example with 'a' are in dict eng_dict
under key 'a' to make searching for words faster. Then I sorted all words by their length so that when we look for words in our sentence we first are going to try to match it with longest ones so given 'businessidentifier' we'll first check 'business' instead of for example 'bus'.
Now that we have our words in nice format we can create function to match our sentence with that words. Here I created recurrent function that tries to match all words that starts with same letter as sentence and if we find one then add it to our return list and recurrently look for next one.
from nltk.corpus import words
word_list = words.words()
eng_dict = {chr(i): sorted([word for word in word_list if word[0] == chr(i)], key=len, reverse=True) for i in range(ord('a'), ord('z')+1)}
def split_into_words(x):
ret =
for word in eng_dict[x[0]]:
if x.startswith(word):
ret.append(word)
x = x[len(word):]
break
if len(x) != 0:
ret.extend(split_into_words(x))
return ret
raw_sentences = ["businessidentifier", "firstname", "streetaddress"]
finall_sentence = [split_into_words(i) for i in raw_sentences]
print(finall_sentence)
Output:
[['business', 'identifier'], ['first', 'name'], ['street', 'address']]
First we need to get a lot of English words, i used nltk here. Then I loaded all words into dict so that all words that started for example with 'a' are in dict eng_dict
under key 'a' to make searching for words faster. Then I sorted all words by their length so that when we look for words in our sentence we first are going to try to match it with longest ones so given 'businessidentifier' we'll first check 'business' instead of for example 'bus'.
Now that we have our words in nice format we can create function to match our sentence with that words. Here I created recurrent function that tries to match all words that starts with same letter as sentence and if we find one then add it to our return list and recurrently look for next one.
from nltk.corpus import words
word_list = words.words()
eng_dict = {chr(i): sorted([word for word in word_list if word[0] == chr(i)], key=len, reverse=True) for i in range(ord('a'), ord('z')+1)}
def split_into_words(x):
ret =
for word in eng_dict[x[0]]:
if x.startswith(word):
ret.append(word)
x = x[len(word):]
break
if len(x) != 0:
ret.extend(split_into_words(x))
return ret
raw_sentences = ["businessidentifier", "firstname", "streetaddress"]
finall_sentence = [split_into_words(i) for i in raw_sentences]
print(finall_sentence)
Output:
[['business', 'identifier'], ['first', 'name'], ['street', 'address']]
answered Nov 26 '18 at 20:00
Filip MłynarskiFilip Młynarski
2,0341415
2,0341415
this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!
– jgcello
Nov 26 '18 at 20:09
Glad to have been for help!
– Filip Młynarski
Nov 26 '18 at 20:10
add a comment |
this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!
– jgcello
Nov 26 '18 at 20:09
Glad to have been for help!
– Filip Młynarski
Nov 26 '18 at 20:10
this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!
– jgcello
Nov 26 '18 at 20:09
this is great! basically what I was in the process of creating, but much more concise. Thanks so much for writing this!
– jgcello
Nov 26 '18 at 20:09
Glad to have been for help!
– Filip Młynarski
Nov 26 '18 at 20:10
Glad to have been for help!
– Filip Młynarski
Nov 26 '18 at 20:10
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53487850%2fpython-function-for-splitting-strung-together-word-into-individual-words%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You might need to look into a language model... a naive approach might be to try all possible ways to break the sentence such that you end up with valid words, then use a language model to tell you which possible sentence is most likely. I doubt there's any simple function you can call which does this for you, though
– scnerd
Nov 26 '18 at 19:40
I think this is a two step problem: 1) You're going to need to identify possible words which shouldn't be too hard given enough compute time and a corpus. 2) Given all the possible combos of words identify the "most" correct sentence given some NLP or other technique. Second step is definitely non-trivial and the first step will take exponentially long given large string(s) and a representative corpus.
– n8sty
Nov 26 '18 at 19:43
this is what I figured. thank you both for confirming my suspicions, I will give that a try.
– jgcello
Nov 26 '18 at 19:58