How to get frequency of each string across multiple sets of strings by sets [closed]
I have a text file for each set as below
Set1: Cow Goat Lion Mole
Set2: Mole Badger Snake
Set3: Goat Snake Zebra
My aim is to get a matrix of distribution of each unique value across sets and a total count for each value
S1 S2 S3 Total
Goat Y N Y ....2
Snake N Y Y ....2
At the outset it may look like an excel problem but the data set is large and i am not sure a pivot table can do this. My approach would be in python but i am new and looking for advice on best approach
-read each csv to dataframe (concat?)
-find unique values across all columns (store in a df?)
-run iteration for each unique value to get frequency
-i am not sure how would i then keep track of set count and then produce the tabular output i want
-tx
python dataframe pivot-table
closed as too broad by usr2564301, Mr. T, greg-449, David Maze, Rob Nov 25 '18 at 16:53
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |
I have a text file for each set as below
Set1: Cow Goat Lion Mole
Set2: Mole Badger Snake
Set3: Goat Snake Zebra
My aim is to get a matrix of distribution of each unique value across sets and a total count for each value
S1 S2 S3 Total
Goat Y N Y ....2
Snake N Y Y ....2
At the outset it may look like an excel problem but the data set is large and i am not sure a pivot table can do this. My approach would be in python but i am new and looking for advice on best approach
-read each csv to dataframe (concat?)
-find unique values across all columns (store in a df?)
-run iteration for each unique value to get frequency
-i am not sure how would i then keep track of set count and then produce the tabular output i want
-tx
python dataframe pivot-table
closed as too broad by usr2564301, Mr. T, greg-449, David Maze, Rob Nov 25 '18 at 16:53
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
1
What did you do so far? What part of the coding you have a problem with? Also, you speak of CSV, but the sample input you provide is not CSV?
– trincot
Nov 25 '18 at 10:40
Please reformat your question such that relations between the data involved are apparent.
– chb
Nov 25 '18 at 10:40
I edited , hope it is clearer now
– user10701663
Nov 25 '18 at 10:51
Your input and expected output format is still not clear. I suggest reading How to create a Minimal, Complete, and Verifiable example and editing the question accordingly. I thought first that this was a pandas question, but it seems you do not want to use this library which makes this task rather simple.
– Mr. T
Nov 25 '18 at 11:13
I will try again. I've 'n' text files, lets call each of them a set. Set has a list of words and some words are present in several sets while some are unique to a set. I want to find out 1. All the unique words across the sets and then 2. for each unique word find how many sets contain that word. So in my original question i want to obtain that 'Goat' is present in sets 1 and 3 and total sets that contain goat are 2. In my case each list is going into 1000 + words and there are several sets . I would not mind using pandas, though would appreciate some hints. whatever gives a solution
– user10701663
Nov 25 '18 at 12:36
add a comment |
I have a text file for each set as below
Set1: Cow Goat Lion Mole
Set2: Mole Badger Snake
Set3: Goat Snake Zebra
My aim is to get a matrix of distribution of each unique value across sets and a total count for each value
S1 S2 S3 Total
Goat Y N Y ....2
Snake N Y Y ....2
At the outset it may look like an excel problem but the data set is large and i am not sure a pivot table can do this. My approach would be in python but i am new and looking for advice on best approach
-read each csv to dataframe (concat?)
-find unique values across all columns (store in a df?)
-run iteration for each unique value to get frequency
-i am not sure how would i then keep track of set count and then produce the tabular output i want
-tx
python dataframe pivot-table
I have a text file for each set as below
Set1: Cow Goat Lion Mole
Set2: Mole Badger Snake
Set3: Goat Snake Zebra
My aim is to get a matrix of distribution of each unique value across sets and a total count for each value
S1 S2 S3 Total
Goat Y N Y ....2
Snake N Y Y ....2
At the outset it may look like an excel problem but the data set is large and i am not sure a pivot table can do this. My approach would be in python but i am new and looking for advice on best approach
-read each csv to dataframe (concat?)
-find unique values across all columns (store in a df?)
-run iteration for each unique value to get frequency
-i am not sure how would i then keep track of set count and then produce the tabular output i want
-tx
python dataframe pivot-table
python dataframe pivot-table
edited Nov 25 '18 at 11:23
MaJoR
536115
536115
asked Nov 25 '18 at 10:33
user10701663user10701663
64
64
closed as too broad by usr2564301, Mr. T, greg-449, David Maze, Rob Nov 25 '18 at 16:53
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
closed as too broad by usr2564301, Mr. T, greg-449, David Maze, Rob Nov 25 '18 at 16:53
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
1
What did you do so far? What part of the coding you have a problem with? Also, you speak of CSV, but the sample input you provide is not CSV?
– trincot
Nov 25 '18 at 10:40
Please reformat your question such that relations between the data involved are apparent.
– chb
Nov 25 '18 at 10:40
I edited , hope it is clearer now
– user10701663
Nov 25 '18 at 10:51
Your input and expected output format is still not clear. I suggest reading How to create a Minimal, Complete, and Verifiable example and editing the question accordingly. I thought first that this was a pandas question, but it seems you do not want to use this library which makes this task rather simple.
– Mr. T
Nov 25 '18 at 11:13
I will try again. I've 'n' text files, lets call each of them a set. Set has a list of words and some words are present in several sets while some are unique to a set. I want to find out 1. All the unique words across the sets and then 2. for each unique word find how many sets contain that word. So in my original question i want to obtain that 'Goat' is present in sets 1 and 3 and total sets that contain goat are 2. In my case each list is going into 1000 + words and there are several sets . I would not mind using pandas, though would appreciate some hints. whatever gives a solution
– user10701663
Nov 25 '18 at 12:36
add a comment |
1
What did you do so far? What part of the coding you have a problem with? Also, you speak of CSV, but the sample input you provide is not CSV?
– trincot
Nov 25 '18 at 10:40
Please reformat your question such that relations between the data involved are apparent.
– chb
Nov 25 '18 at 10:40
I edited , hope it is clearer now
– user10701663
Nov 25 '18 at 10:51
Your input and expected output format is still not clear. I suggest reading How to create a Minimal, Complete, and Verifiable example and editing the question accordingly. I thought first that this was a pandas question, but it seems you do not want to use this library which makes this task rather simple.
– Mr. T
Nov 25 '18 at 11:13
I will try again. I've 'n' text files, lets call each of them a set. Set has a list of words and some words are present in several sets while some are unique to a set. I want to find out 1. All the unique words across the sets and then 2. for each unique word find how many sets contain that word. So in my original question i want to obtain that 'Goat' is present in sets 1 and 3 and total sets that contain goat are 2. In my case each list is going into 1000 + words and there are several sets . I would not mind using pandas, though would appreciate some hints. whatever gives a solution
– user10701663
Nov 25 '18 at 12:36
1
1
What did you do so far? What part of the coding you have a problem with? Also, you speak of CSV, but the sample input you provide is not CSV?
– trincot
Nov 25 '18 at 10:40
What did you do so far? What part of the coding you have a problem with? Also, you speak of CSV, but the sample input you provide is not CSV?
– trincot
Nov 25 '18 at 10:40
Please reformat your question such that relations between the data involved are apparent.
– chb
Nov 25 '18 at 10:40
Please reformat your question such that relations between the data involved are apparent.
– chb
Nov 25 '18 at 10:40
I edited , hope it is clearer now
– user10701663
Nov 25 '18 at 10:51
I edited , hope it is clearer now
– user10701663
Nov 25 '18 at 10:51
Your input and expected output format is still not clear. I suggest reading How to create a Minimal, Complete, and Verifiable example and editing the question accordingly. I thought first that this was a pandas question, but it seems you do not want to use this library which makes this task rather simple.
– Mr. T
Nov 25 '18 at 11:13
Your input and expected output format is still not clear. I suggest reading How to create a Minimal, Complete, and Verifiable example and editing the question accordingly. I thought first that this was a pandas question, but it seems you do not want to use this library which makes this task rather simple.
– Mr. T
Nov 25 '18 at 11:13
I will try again. I've 'n' text files, lets call each of them a set. Set has a list of words and some words are present in several sets while some are unique to a set. I want to find out 1. All the unique words across the sets and then 2. for each unique word find how many sets contain that word. So in my original question i want to obtain that 'Goat' is present in sets 1 and 3 and total sets that contain goat are 2. In my case each list is going into 1000 + words and there are several sets . I would not mind using pandas, though would appreciate some hints. whatever gives a solution
– user10701663
Nov 25 '18 at 12:36
I will try again. I've 'n' text files, lets call each of them a set. Set has a list of words and some words are present in several sets while some are unique to a set. I want to find out 1. All the unique words across the sets and then 2. for each unique word find how many sets contain that word. So in my original question i want to obtain that 'Goat' is present in sets 1 and 3 and total sets that contain goat are 2. In my case each list is going into 1000 + words and there are several sets . I would not mind using pandas, though would appreciate some hints. whatever gives a solution
– user10701663
Nov 25 '18 at 12:36
add a comment |
1 Answer
1
active
oldest
votes
Import necessary packages
import pandas as pd
import os
import glob
Set Path where all your .txt files are
path = r'C:rawdata_files' # use your path
all_files = glob.glob(os.path.join(path, "*.txt")) # advisable to use os.path.join as this makes concatenation OS independent
The list would look something like below:
all_files = ['val1.txt', 'val2.txt']
Make a df with name of text file as column and entries as rows
df = pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in all_files], axis=1) ##-4 is done as you would not need .txt as your column name
Get the total value for every element across columns:
df.stack().value_counts()
tx Rahul , on it now
– user10701663
Nov 25 '18 at 18:51
If you found the answer helpful..do upvote and accept
– Rahul Agarwal
Nov 25 '18 at 19:05
@user10701663: Does it worked?
– Rahul Agarwal
Nov 27 '18 at 19:08
yes it helped. Thank you
– user10701663
Nov 28 '18 at 23:32
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Import necessary packages
import pandas as pd
import os
import glob
Set Path where all your .txt files are
path = r'C:rawdata_files' # use your path
all_files = glob.glob(os.path.join(path, "*.txt")) # advisable to use os.path.join as this makes concatenation OS independent
The list would look something like below:
all_files = ['val1.txt', 'val2.txt']
Make a df with name of text file as column and entries as rows
df = pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in all_files], axis=1) ##-4 is done as you would not need .txt as your column name
Get the total value for every element across columns:
df.stack().value_counts()
tx Rahul , on it now
– user10701663
Nov 25 '18 at 18:51
If you found the answer helpful..do upvote and accept
– Rahul Agarwal
Nov 25 '18 at 19:05
@user10701663: Does it worked?
– Rahul Agarwal
Nov 27 '18 at 19:08
yes it helped. Thank you
– user10701663
Nov 28 '18 at 23:32
add a comment |
Import necessary packages
import pandas as pd
import os
import glob
Set Path where all your .txt files are
path = r'C:rawdata_files' # use your path
all_files = glob.glob(os.path.join(path, "*.txt")) # advisable to use os.path.join as this makes concatenation OS independent
The list would look something like below:
all_files = ['val1.txt', 'val2.txt']
Make a df with name of text file as column and entries as rows
df = pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in all_files], axis=1) ##-4 is done as you would not need .txt as your column name
Get the total value for every element across columns:
df.stack().value_counts()
tx Rahul , on it now
– user10701663
Nov 25 '18 at 18:51
If you found the answer helpful..do upvote and accept
– Rahul Agarwal
Nov 25 '18 at 19:05
@user10701663: Does it worked?
– Rahul Agarwal
Nov 27 '18 at 19:08
yes it helped. Thank you
– user10701663
Nov 28 '18 at 23:32
add a comment |
Import necessary packages
import pandas as pd
import os
import glob
Set Path where all your .txt files are
path = r'C:rawdata_files' # use your path
all_files = glob.glob(os.path.join(path, "*.txt")) # advisable to use os.path.join as this makes concatenation OS independent
The list would look something like below:
all_files = ['val1.txt', 'val2.txt']
Make a df with name of text file as column and entries as rows
df = pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in all_files], axis=1) ##-4 is done as you would not need .txt as your column name
Get the total value for every element across columns:
df.stack().value_counts()
Import necessary packages
import pandas as pd
import os
import glob
Set Path where all your .txt files are
path = r'C:rawdata_files' # use your path
all_files = glob.glob(os.path.join(path, "*.txt")) # advisable to use os.path.join as this makes concatenation OS independent
The list would look something like below:
all_files = ['val1.txt', 'val2.txt']
Make a df with name of text file as column and entries as rows
df = pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in all_files], axis=1) ##-4 is done as you would not need .txt as your column name
Get the total value for every element across columns:
df.stack().value_counts()
answered Nov 25 '18 at 13:23
Rahul AgarwalRahul Agarwal
2,27851029
2,27851029
tx Rahul , on it now
– user10701663
Nov 25 '18 at 18:51
If you found the answer helpful..do upvote and accept
– Rahul Agarwal
Nov 25 '18 at 19:05
@user10701663: Does it worked?
– Rahul Agarwal
Nov 27 '18 at 19:08
yes it helped. Thank you
– user10701663
Nov 28 '18 at 23:32
add a comment |
tx Rahul , on it now
– user10701663
Nov 25 '18 at 18:51
If you found the answer helpful..do upvote and accept
– Rahul Agarwal
Nov 25 '18 at 19:05
@user10701663: Does it worked?
– Rahul Agarwal
Nov 27 '18 at 19:08
yes it helped. Thank you
– user10701663
Nov 28 '18 at 23:32
tx Rahul , on it now
– user10701663
Nov 25 '18 at 18:51
tx Rahul , on it now
– user10701663
Nov 25 '18 at 18:51
If you found the answer helpful..do upvote and accept
– Rahul Agarwal
Nov 25 '18 at 19:05
If you found the answer helpful..do upvote and accept
– Rahul Agarwal
Nov 25 '18 at 19:05
@user10701663: Does it worked?
– Rahul Agarwal
Nov 27 '18 at 19:08
@user10701663: Does it worked?
– Rahul Agarwal
Nov 27 '18 at 19:08
yes it helped. Thank you
– user10701663
Nov 28 '18 at 23:32
yes it helped. Thank you
– user10701663
Nov 28 '18 at 23:32
add a comment |
1
What did you do so far? What part of the coding you have a problem with? Also, you speak of CSV, but the sample input you provide is not CSV?
– trincot
Nov 25 '18 at 10:40
Please reformat your question such that relations between the data involved are apparent.
– chb
Nov 25 '18 at 10:40
I edited , hope it is clearer now
– user10701663
Nov 25 '18 at 10:51
Your input and expected output format is still not clear. I suggest reading How to create a Minimal, Complete, and Verifiable example and editing the question accordingly. I thought first that this was a pandas question, but it seems you do not want to use this library which makes this task rather simple.
– Mr. T
Nov 25 '18 at 11:13
I will try again. I've 'n' text files, lets call each of them a set. Set has a list of words and some words are present in several sets while some are unique to a set. I want to find out 1. All the unique words across the sets and then 2. for each unique word find how many sets contain that word. So in my original question i want to obtain that 'Goat' is present in sets 1 and 3 and total sets that contain goat are 2. In my case each list is going into 1000 + words and there are several sets . I would not mind using pandas, though would appreciate some hints. whatever gives a solution
– user10701663
Nov 25 '18 at 12:36