How to get frequency of each string across multiple sets of strings by sets [closed]












-2















I have a text file for each set as below



Set1: Cow Goat Lion Mole

Set2: Mole Badger Snake

Set3: Goat Snake Zebra


My aim is to get a matrix of distribution of each unique value across sets and a total count for each value



        S1 S2 S3  Total
Goat Y N Y ....2

Snake N Y Y ....2


At the outset it may look like an excel problem but the data set is large and i am not sure a pivot table can do this. My approach would be in python but i am new and looking for advice on best approach
-read each csv to dataframe (concat?)
-find unique values across all columns (store in a df?)
-run iteration for each unique value to get frequency
-i am not sure how would i then keep track of set count and then produce the tabular output i want
-tx










share|improve this question















closed as too broad by usr2564301, Mr. T, greg-449, David Maze, Rob Nov 25 '18 at 16:53


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.














  • 1





    What did you do so far? What part of the coding you have a problem with? Also, you speak of CSV, but the sample input you provide is not CSV?

    – trincot
    Nov 25 '18 at 10:40











  • Please reformat your question such that relations between the data involved are apparent.

    – chb
    Nov 25 '18 at 10:40











  • I edited , hope it is clearer now

    – user10701663
    Nov 25 '18 at 10:51











  • Your input and expected output format is still not clear. I suggest reading How to create a Minimal, Complete, and Verifiable example and editing the question accordingly. I thought first that this was a pandas question, but it seems you do not want to use this library which makes this task rather simple.

    – Mr. T
    Nov 25 '18 at 11:13













  • I will try again. I've 'n' text files, lets call each of them a set. Set has a list of words and some words are present in several sets while some are unique to a set. I want to find out 1. All the unique words across the sets and then 2. for each unique word find how many sets contain that word. So in my original question i want to obtain that 'Goat' is present in sets 1 and 3 and total sets that contain goat are 2. In my case each list is going into 1000 + words and there are several sets . I would not mind using pandas, though would appreciate some hints. whatever gives a solution

    – user10701663
    Nov 25 '18 at 12:36
















-2















I have a text file for each set as below



Set1: Cow Goat Lion Mole

Set2: Mole Badger Snake

Set3: Goat Snake Zebra


My aim is to get a matrix of distribution of each unique value across sets and a total count for each value



        S1 S2 S3  Total
Goat Y N Y ....2

Snake N Y Y ....2


At the outset it may look like an excel problem but the data set is large and i am not sure a pivot table can do this. My approach would be in python but i am new and looking for advice on best approach
-read each csv to dataframe (concat?)
-find unique values across all columns (store in a df?)
-run iteration for each unique value to get frequency
-i am not sure how would i then keep track of set count and then produce the tabular output i want
-tx










share|improve this question















closed as too broad by usr2564301, Mr. T, greg-449, David Maze, Rob Nov 25 '18 at 16:53


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.














  • 1





    What did you do so far? What part of the coding you have a problem with? Also, you speak of CSV, but the sample input you provide is not CSV?

    – trincot
    Nov 25 '18 at 10:40











  • Please reformat your question such that relations between the data involved are apparent.

    – chb
    Nov 25 '18 at 10:40











  • I edited , hope it is clearer now

    – user10701663
    Nov 25 '18 at 10:51











  • Your input and expected output format is still not clear. I suggest reading How to create a Minimal, Complete, and Verifiable example and editing the question accordingly. I thought first that this was a pandas question, but it seems you do not want to use this library which makes this task rather simple.

    – Mr. T
    Nov 25 '18 at 11:13













  • I will try again. I've 'n' text files, lets call each of them a set. Set has a list of words and some words are present in several sets while some are unique to a set. I want to find out 1. All the unique words across the sets and then 2. for each unique word find how many sets contain that word. So in my original question i want to obtain that 'Goat' is present in sets 1 and 3 and total sets that contain goat are 2. In my case each list is going into 1000 + words and there are several sets . I would not mind using pandas, though would appreciate some hints. whatever gives a solution

    – user10701663
    Nov 25 '18 at 12:36














-2












-2








-2








I have a text file for each set as below



Set1: Cow Goat Lion Mole

Set2: Mole Badger Snake

Set3: Goat Snake Zebra


My aim is to get a matrix of distribution of each unique value across sets and a total count for each value



        S1 S2 S3  Total
Goat Y N Y ....2

Snake N Y Y ....2


At the outset it may look like an excel problem but the data set is large and i am not sure a pivot table can do this. My approach would be in python but i am new and looking for advice on best approach
-read each csv to dataframe (concat?)
-find unique values across all columns (store in a df?)
-run iteration for each unique value to get frequency
-i am not sure how would i then keep track of set count and then produce the tabular output i want
-tx










share|improve this question
















I have a text file for each set as below



Set1: Cow Goat Lion Mole

Set2: Mole Badger Snake

Set3: Goat Snake Zebra


My aim is to get a matrix of distribution of each unique value across sets and a total count for each value



        S1 S2 S3  Total
Goat Y N Y ....2

Snake N Y Y ....2


At the outset it may look like an excel problem but the data set is large and i am not sure a pivot table can do this. My approach would be in python but i am new and looking for advice on best approach
-read each csv to dataframe (concat?)
-find unique values across all columns (store in a df?)
-run iteration for each unique value to get frequency
-i am not sure how would i then keep track of set count and then produce the tabular output i want
-tx







python dataframe pivot-table






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 25 '18 at 11:23









MaJoR

536115




536115










asked Nov 25 '18 at 10:33









user10701663user10701663

64




64




closed as too broad by usr2564301, Mr. T, greg-449, David Maze, Rob Nov 25 '18 at 16:53


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.









closed as too broad by usr2564301, Mr. T, greg-449, David Maze, Rob Nov 25 '18 at 16:53


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.










  • 1





    What did you do so far? What part of the coding you have a problem with? Also, you speak of CSV, but the sample input you provide is not CSV?

    – trincot
    Nov 25 '18 at 10:40











  • Please reformat your question such that relations between the data involved are apparent.

    – chb
    Nov 25 '18 at 10:40











  • I edited , hope it is clearer now

    – user10701663
    Nov 25 '18 at 10:51











  • Your input and expected output format is still not clear. I suggest reading How to create a Minimal, Complete, and Verifiable example and editing the question accordingly. I thought first that this was a pandas question, but it seems you do not want to use this library which makes this task rather simple.

    – Mr. T
    Nov 25 '18 at 11:13













  • I will try again. I've 'n' text files, lets call each of them a set. Set has a list of words and some words are present in several sets while some are unique to a set. I want to find out 1. All the unique words across the sets and then 2. for each unique word find how many sets contain that word. So in my original question i want to obtain that 'Goat' is present in sets 1 and 3 and total sets that contain goat are 2. In my case each list is going into 1000 + words and there are several sets . I would not mind using pandas, though would appreciate some hints. whatever gives a solution

    – user10701663
    Nov 25 '18 at 12:36














  • 1





    What did you do so far? What part of the coding you have a problem with? Also, you speak of CSV, but the sample input you provide is not CSV?

    – trincot
    Nov 25 '18 at 10:40











  • Please reformat your question such that relations between the data involved are apparent.

    – chb
    Nov 25 '18 at 10:40











  • I edited , hope it is clearer now

    – user10701663
    Nov 25 '18 at 10:51











  • Your input and expected output format is still not clear. I suggest reading How to create a Minimal, Complete, and Verifiable example and editing the question accordingly. I thought first that this was a pandas question, but it seems you do not want to use this library which makes this task rather simple.

    – Mr. T
    Nov 25 '18 at 11:13













  • I will try again. I've 'n' text files, lets call each of them a set. Set has a list of words and some words are present in several sets while some are unique to a set. I want to find out 1. All the unique words across the sets and then 2. for each unique word find how many sets contain that word. So in my original question i want to obtain that 'Goat' is present in sets 1 and 3 and total sets that contain goat are 2. In my case each list is going into 1000 + words and there are several sets . I would not mind using pandas, though would appreciate some hints. whatever gives a solution

    – user10701663
    Nov 25 '18 at 12:36








1




1





What did you do so far? What part of the coding you have a problem with? Also, you speak of CSV, but the sample input you provide is not CSV?

– trincot
Nov 25 '18 at 10:40





What did you do so far? What part of the coding you have a problem with? Also, you speak of CSV, but the sample input you provide is not CSV?

– trincot
Nov 25 '18 at 10:40













Please reformat your question such that relations between the data involved are apparent.

– chb
Nov 25 '18 at 10:40





Please reformat your question such that relations between the data involved are apparent.

– chb
Nov 25 '18 at 10:40













I edited , hope it is clearer now

– user10701663
Nov 25 '18 at 10:51





I edited , hope it is clearer now

– user10701663
Nov 25 '18 at 10:51













Your input and expected output format is still not clear. I suggest reading How to create a Minimal, Complete, and Verifiable example and editing the question accordingly. I thought first that this was a pandas question, but it seems you do not want to use this library which makes this task rather simple.

– Mr. T
Nov 25 '18 at 11:13







Your input and expected output format is still not clear. I suggest reading How to create a Minimal, Complete, and Verifiable example and editing the question accordingly. I thought first that this was a pandas question, but it seems you do not want to use this library which makes this task rather simple.

– Mr. T
Nov 25 '18 at 11:13















I will try again. I've 'n' text files, lets call each of them a set. Set has a list of words and some words are present in several sets while some are unique to a set. I want to find out 1. All the unique words across the sets and then 2. for each unique word find how many sets contain that word. So in my original question i want to obtain that 'Goat' is present in sets 1 and 3 and total sets that contain goat are 2. In my case each list is going into 1000 + words and there are several sets . I would not mind using pandas, though would appreciate some hints. whatever gives a solution

– user10701663
Nov 25 '18 at 12:36





I will try again. I've 'n' text files, lets call each of them a set. Set has a list of words and some words are present in several sets while some are unique to a set. I want to find out 1. All the unique words across the sets and then 2. for each unique word find how many sets contain that word. So in my original question i want to obtain that 'Goat' is present in sets 1 and 3 and total sets that contain goat are 2. In my case each list is going into 1000 + words and there are several sets . I would not mind using pandas, though would appreciate some hints. whatever gives a solution

– user10701663
Nov 25 '18 at 12:36












1 Answer
1






active

oldest

votes


















0














Import necessary packages



import pandas as pd
import os
import glob


Set Path where all your .txt files are



path = r'C:rawdata_files'                     # use your path
all_files = glob.glob(os.path.join(path, "*.txt")) # advisable to use os.path.join as this makes concatenation OS independent


The list would look something like below:



all_files = ['val1.txt', 'val2.txt']


Make a df with name of text file as column and entries as rows



df = pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in all_files], axis=1) ##-4 is done as you would not need .txt as your column name


Get the total value for every element across columns:



df.stack().value_counts()





share|improve this answer
























  • tx Rahul , on it now

    – user10701663
    Nov 25 '18 at 18:51











  • If you found the answer helpful..do upvote and accept

    – Rahul Agarwal
    Nov 25 '18 at 19:05











  • @user10701663: Does it worked?

    – Rahul Agarwal
    Nov 27 '18 at 19:08











  • yes it helped. Thank you

    – user10701663
    Nov 28 '18 at 23:32


















1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














Import necessary packages



import pandas as pd
import os
import glob


Set Path where all your .txt files are



path = r'C:rawdata_files'                     # use your path
all_files = glob.glob(os.path.join(path, "*.txt")) # advisable to use os.path.join as this makes concatenation OS independent


The list would look something like below:



all_files = ['val1.txt', 'val2.txt']


Make a df with name of text file as column and entries as rows



df = pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in all_files], axis=1) ##-4 is done as you would not need .txt as your column name


Get the total value for every element across columns:



df.stack().value_counts()





share|improve this answer
























  • tx Rahul , on it now

    – user10701663
    Nov 25 '18 at 18:51











  • If you found the answer helpful..do upvote and accept

    – Rahul Agarwal
    Nov 25 '18 at 19:05











  • @user10701663: Does it worked?

    – Rahul Agarwal
    Nov 27 '18 at 19:08











  • yes it helped. Thank you

    – user10701663
    Nov 28 '18 at 23:32
















0














Import necessary packages



import pandas as pd
import os
import glob


Set Path where all your .txt files are



path = r'C:rawdata_files'                     # use your path
all_files = glob.glob(os.path.join(path, "*.txt")) # advisable to use os.path.join as this makes concatenation OS independent


The list would look something like below:



all_files = ['val1.txt', 'val2.txt']


Make a df with name of text file as column and entries as rows



df = pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in all_files], axis=1) ##-4 is done as you would not need .txt as your column name


Get the total value for every element across columns:



df.stack().value_counts()





share|improve this answer
























  • tx Rahul , on it now

    – user10701663
    Nov 25 '18 at 18:51











  • If you found the answer helpful..do upvote and accept

    – Rahul Agarwal
    Nov 25 '18 at 19:05











  • @user10701663: Does it worked?

    – Rahul Agarwal
    Nov 27 '18 at 19:08











  • yes it helped. Thank you

    – user10701663
    Nov 28 '18 at 23:32














0












0








0







Import necessary packages



import pandas as pd
import os
import glob


Set Path where all your .txt files are



path = r'C:rawdata_files'                     # use your path
all_files = glob.glob(os.path.join(path, "*.txt")) # advisable to use os.path.join as this makes concatenation OS independent


The list would look something like below:



all_files = ['val1.txt', 'val2.txt']


Make a df with name of text file as column and entries as rows



df = pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in all_files], axis=1) ##-4 is done as you would not need .txt as your column name


Get the total value for every element across columns:



df.stack().value_counts()





share|improve this answer













Import necessary packages



import pandas as pd
import os
import glob


Set Path where all your .txt files are



path = r'C:rawdata_files'                     # use your path
all_files = glob.glob(os.path.join(path, "*.txt")) # advisable to use os.path.join as this makes concatenation OS independent


The list would look something like below:



all_files = ['val1.txt', 'val2.txt']


Make a df with name of text file as column and entries as rows



df = pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in all_files], axis=1) ##-4 is done as you would not need .txt as your column name


Get the total value for every element across columns:



df.stack().value_counts()






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 25 '18 at 13:23









Rahul AgarwalRahul Agarwal

2,27851029




2,27851029













  • tx Rahul , on it now

    – user10701663
    Nov 25 '18 at 18:51











  • If you found the answer helpful..do upvote and accept

    – Rahul Agarwal
    Nov 25 '18 at 19:05











  • @user10701663: Does it worked?

    – Rahul Agarwal
    Nov 27 '18 at 19:08











  • yes it helped. Thank you

    – user10701663
    Nov 28 '18 at 23:32



















  • tx Rahul , on it now

    – user10701663
    Nov 25 '18 at 18:51











  • If you found the answer helpful..do upvote and accept

    – Rahul Agarwal
    Nov 25 '18 at 19:05











  • @user10701663: Does it worked?

    – Rahul Agarwal
    Nov 27 '18 at 19:08











  • yes it helped. Thank you

    – user10701663
    Nov 28 '18 at 23:32

















tx Rahul , on it now

– user10701663
Nov 25 '18 at 18:51





tx Rahul , on it now

– user10701663
Nov 25 '18 at 18:51













If you found the answer helpful..do upvote and accept

– Rahul Agarwal
Nov 25 '18 at 19:05





If you found the answer helpful..do upvote and accept

– Rahul Agarwal
Nov 25 '18 at 19:05













@user10701663: Does it worked?

– Rahul Agarwal
Nov 27 '18 at 19:08





@user10701663: Does it worked?

– Rahul Agarwal
Nov 27 '18 at 19:08













yes it helped. Thank you

– user10701663
Nov 28 '18 at 23:32





yes it helped. Thank you

– user10701663
Nov 28 '18 at 23:32





Popular posts from this blog

Tonle Sap (See)

I get strange results when I access the Sqlitedatabase with Unity C# via XAMPP

Guatemaltekische Davis-Cup-Mannschaft