How to get frequency of each string across multiple sets of strings by sets [closed]

-2

I have a text file for each set as below

Set1: Cow Goat Lion Mole



Set2: Mole Badger Snake



Set3: Goat Snake Zebra

My aim is to get a matrix of distribution of each unique value across sets and a total count for each value

        S1 S2 S3  Total

Goat     Y  N Y   ....2



Snake    N  Y Y   ....2

At the outset it may look like an excel problem but the data set is large and i am not sure a pivot table can do this. My approach would be in python but i am new and looking for advice on best approach
-read each csv to dataframe (concat?)
-find unique values across all columns (store in a df?)
-run iteration for each unique value to get frequency
-i am not sure how would i then keep track of set count and then produce the tabular output i want
-tx

edited Nov 25 '18 at 11:23

MaJoR

536115

asked Nov 25 '18 at 10:33

user10701663

closed as too broad by usr2564301, Mr. T, greg-449, David Maze, Rob Nov 25 '18 at 16:53

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

1

What did you do so far? What part of the coding you have a problem with? Also, you speak of CSV, but the sample input you provide is not CSV?

– trincot
Nov 25 '18 at 10:40

Please reformat your question such that relations between the data involved are apparent.

– chb
Nov 25 '18 at 10:40

I edited , hope it is clearer now

– user10701663
Nov 25 '18 at 10:51

Your input and expected output format is still not clear. I suggest reading How to create a Minimal, Complete, and Verifiable example and editing the question accordingly. I thought first that this was a pandas question, but it seems you do not want to use this library which makes this task rather simple.

– Mr. T
Nov 25 '18 at 11:13

I will try again. I've 'n' text files, lets call each of them a set. Set has a list of words and some words are present in several sets while some are unique to a set. I want to find out 1. All the unique words across the sets and then 2. for each unique word find how many sets contain that word. So in my original question i want to obtain that 'Goat' is present in sets 1 and 3 and total sets that contain goat are 2. In my case each list is going into 1000 + words and there are several sets . I would not mind using pandas, though would appreciate some hints. whatever gives a solution

– user10701663
Nov 25 '18 at 12:36

add a comment |

-2

I have a text file for each set as below

Set1: Cow Goat Lion Mole



Set2: Mole Badger Snake



Set3: Goat Snake Zebra

My aim is to get a matrix of distribution of each unique value across sets and a total count for each value

        S1 S2 S3  Total

Goat     Y  N Y   ....2



Snake    N  Y Y   ....2

edited Nov 25 '18 at 11:23

MaJoR

536115

asked Nov 25 '18 at 10:33

user10701663

closed as too broad by usr2564301, Mr. T, greg-449, David Maze, Rob Nov 25 '18 at 16:53

1

What did you do so far? What part of the coding you have a problem with? Also, you speak of CSV, but the sample input you provide is not CSV?

– trincot
Nov 25 '18 at 10:40

Please reformat your question such that relations between the data involved are apparent.

– chb
Nov 25 '18 at 10:40

I edited , hope it is clearer now

– user10701663
Nov 25 '18 at 10:51

Your input and expected output format is still not clear. I suggest reading How to create a Minimal, Complete, and Verifiable example and editing the question accordingly. I thought first that this was a pandas question, but it seems you do not want to use this library which makes this task rather simple.

– Mr. T
Nov 25 '18 at 11:13

I will try again. I've 'n' text files, lets call each of them a set. Set has a list of words and some words are present in several sets while some are unique to a set. I want to find out 1. All the unique words across the sets and then 2. for each unique word find how many sets contain that word. So in my original question i want to obtain that 'Goat' is present in sets 1 and 3 and total sets that contain goat are 2. In my case each list is going into 1000 + words and there are several sets . I would not mind using pandas, though would appreciate some hints. whatever gives a solution

– user10701663
Nov 25 '18 at 12:36

add a comment |

-2

I have a text file for each set as below

Set1: Cow Goat Lion Mole



Set2: Mole Badger Snake



Set3: Goat Snake Zebra

My aim is to get a matrix of distribution of each unique value across sets and a total count for each value

        S1 S2 S3  Total

Goat     Y  N Y   ....2



Snake    N  Y Y   ....2

edited Nov 25 '18 at 11:23

MaJoR

536115

asked Nov 25 '18 at 10:33

user10701663

I have a text file for each set as below

Set1: Cow Goat Lion Mole



Set2: Mole Badger Snake



Set3: Goat Snake Zebra

My aim is to get a matrix of distribution of each unique value across sets and a total count for each value

        S1 S2 S3  Total

Goat     Y  N Y   ....2



Snake    N  Y Y   ....2

python dataframe pivot-table

edited Nov 25 '18 at 11:23

MaJoR

536115

asked Nov 25 '18 at 10:33

user10701663

edited Nov 25 '18 at 11:23

MaJoR

536115

asked Nov 25 '18 at 10:33

user10701663

edited Nov 25 '18 at 11:23

MaJoR

536115

edited Nov 25 '18 at 11:23

MaJoR

536115

edited Nov 25 '18 at 11:23

MaJoR

536115

asked Nov 25 '18 at 10:33

user10701663

asked Nov 25 '18 at 10:33

user10701663

asked Nov 25 '18 at 10:33

user10701663

closed as too broad by usr2564301, Mr. T, greg-449, David Maze, Rob Nov 25 '18 at 16:53

1

What did you do so far? What part of the coding you have a problem with? Also, you speak of CSV, but the sample input you provide is not CSV?

– trincot
Nov 25 '18 at 10:40

Please reformat your question such that relations between the data involved are apparent.

– chb
Nov 25 '18 at 10:40

I edited , hope it is clearer now

– user10701663
Nov 25 '18 at 10:51

Your input and expected output format is still not clear. I suggest reading How to create a Minimal, Complete, and Verifiable example and editing the question accordingly. I thought first that this was a pandas question, but it seems you do not want to use this library which makes this task rather simple.

– Mr. T
Nov 25 '18 at 11:13

I will try again. I've 'n' text files, lets call each of them a set. Set has a list of words and some words are present in several sets while some are unique to a set. I want to find out 1. All the unique words across the sets and then 2. for each unique word find how many sets contain that word. So in my original question i want to obtain that 'Goat' is present in sets 1 and 3 and total sets that contain goat are 2. In my case each list is going into 1000 + words and there are several sets . I would not mind using pandas, though would appreciate some hints. whatever gives a solution

– user10701663
Nov 25 '18 at 12:36

add a comment |

1

What did you do so far? What part of the coding you have a problem with? Also, you speak of CSV, but the sample input you provide is not CSV?

– trincot
Nov 25 '18 at 10:40

Please reformat your question such that relations between the data involved are apparent.

– chb
Nov 25 '18 at 10:40

I edited , hope it is clearer now

– user10701663
Nov 25 '18 at 10:51

Your input and expected output format is still not clear. I suggest reading How to create a Minimal, Complete, and Verifiable example and editing the question accordingly. I thought first that this was a pandas question, but it seems you do not want to use this library which makes this task rather simple.

– Mr. T
Nov 25 '18 at 11:13

I will try again. I've 'n' text files, lets call each of them a set. Set has a list of words and some words are present in several sets while some are unique to a set. I want to find out 1. All the unique words across the sets and then 2. for each unique word find how many sets contain that word. So in my original question i want to obtain that 'Goat' is present in sets 1 and 3 and total sets that contain goat are 2. In my case each list is going into 1000 + words and there are several sets . I would not mind using pandas, though would appreciate some hints. whatever gives a solution

– user10701663
Nov 25 '18 at 12:36

What did you do so far? What part of the coding you have a problem with? Also, you speak of CSV, but the sample input you provide is not CSV?

– trincot
Nov 25 '18 at 10:40

Please reformat your question such that relations between the data involved are apparent.

– chb
Nov 25 '18 at 10:40

I edited , hope it is clearer now

– user10701663
Nov 25 '18 at 10:51

Your input and expected output format is still not clear. I suggest reading How to create a Minimal, Complete, and Verifiable example and editing the question accordingly. I thought first that this was a pandas question, but it seems you do not want to use this library which makes this task rather simple.

– Mr. T
Nov 25 '18 at 11:13

I will try again. I've 'n' text files, lets call each of them a set. Set has a list of words and some words are present in several sets while some are unique to a set. I want to find out 1. All the unique words across the sets and then 2. for each unique word find how many sets contain that word. So in my original question i want to obtain that 'Goat' is present in sets 1 and 3 and total sets that contain goat are 2. In my case each list is going into 1000 + words and there are several sets . I would not mind using pandas, though would appreciate some hints. whatever gives a solution

– user10701663
Nov 25 '18 at 12:36

add a comment |

1 Answer
1

active

oldest

votes

Import necessary packages

import pandas as pd

import os

import glob

Set Path where all your .txt files are

path = r'C:rawdata_files'                     # use your path

all_files = glob.glob(os.path.join(path, "*.txt"))     # advisable to use os.path.join as this makes concatenation OS independent

The list would look something like below:

all_files = ['val1.txt', 'val2.txt']

Make a df with name of text file as column and entries as rows

df = pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in all_files], axis=1) ##-4 is done as you would not need .txt as your column name

Get the total value for every element across columns:

df.stack().value_counts()

answered Nov 25 '18 at 13:23

Rahul Agarwal

2,27851029

tx Rahul , on it now

– user10701663
Nov 25 '18 at 18:51

If you found the answer helpful..do upvote and accept

– Rahul Agarwal
Nov 25 '18 at 19:05

@user10701663: Does it worked?

– Rahul Agarwal
Nov 27 '18 at 19:08

yes it helped. Thank you

– user10701663
Nov 28 '18 at 23:32

add a comment |

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Import necessary packages

import pandas as pd

import os

import glob

Set Path where all your .txt files are

path = r'C:rawdata_files'                     # use your path

all_files = glob.glob(os.path.join(path, "*.txt"))     # advisable to use os.path.join as this makes concatenation OS independent

The list would look something like below:

all_files = ['val1.txt', 'val2.txt']

Make a df with name of text file as column and entries as rows

df = pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in all_files], axis=1) ##-4 is done as you would not need .txt as your column name

Get the total value for every element across columns:

df.stack().value_counts()

answered Nov 25 '18 at 13:23

Rahul Agarwal

2,27851029

tx Rahul , on it now

– user10701663
Nov 25 '18 at 18:51

If you found the answer helpful..do upvote and accept

– Rahul Agarwal
Nov 25 '18 at 19:05

@user10701663: Does it worked?

– Rahul Agarwal
Nov 27 '18 at 19:08

yes it helped. Thank you

– user10701663
Nov 28 '18 at 23:32

add a comment |

Import necessary packages

import pandas as pd

import os

import glob

Set Path where all your .txt files are

path = r'C:rawdata_files'                     # use your path

all_files = glob.glob(os.path.join(path, "*.txt"))     # advisable to use os.path.join as this makes concatenation OS independent

The list would look something like below:

all_files = ['val1.txt', 'val2.txt']

Make a df with name of text file as column and entries as rows

df = pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in all_files], axis=1) ##-4 is done as you would not need .txt as your column name

Get the total value for every element across columns:

df.stack().value_counts()

answered Nov 25 '18 at 13:23

Rahul Agarwal

2,27851029

tx Rahul , on it now

– user10701663
Nov 25 '18 at 18:51

If you found the answer helpful..do upvote and accept

– Rahul Agarwal
Nov 25 '18 at 19:05

@user10701663: Does it worked?

– Rahul Agarwal
Nov 27 '18 at 19:08

yes it helped. Thank you

– user10701663
Nov 28 '18 at 23:32

add a comment |

Import necessary packages

import pandas as pd

import os

import glob

Set Path where all your .txt files are

path = r'C:rawdata_files'                     # use your path

all_files = glob.glob(os.path.join(path, "*.txt"))     # advisable to use os.path.join as this makes concatenation OS independent

The list would look something like below:

all_files = ['val1.txt', 'val2.txt']

Make a df with name of text file as column and entries as rows

df = pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in all_files], axis=1) ##-4 is done as you would not need .txt as your column name

Get the total value for every element across columns:

df.stack().value_counts()

answered Nov 25 '18 at 13:23

Rahul Agarwal

2,27851029

Import necessary packages

import pandas as pd

import os

import glob

Set Path where all your .txt files are

path = r'C:rawdata_files'                     # use your path

all_files = glob.glob(os.path.join(path, "*.txt"))     # advisable to use os.path.join as this makes concatenation OS independent

The list would look something like below:

all_files = ['val1.txt', 'val2.txt']

Make a df with name of text file as column and entries as rows

df = pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in all_files], axis=1) ##-4 is done as you would not need .txt as your column name

Get the total value for every element across columns:

df.stack().value_counts()

answered Nov 25 '18 at 13:23

Rahul Agarwal

2,27851029

answered Nov 25 '18 at 13:23

Rahul Agarwal

2,27851029

answered Nov 25 '18 at 13:23

Rahul Agarwal

2,27851029

answered Nov 25 '18 at 13:23

Rahul Agarwal

2,27851029

tx Rahul , on it now

– user10701663
Nov 25 '18 at 18:51

If you found the answer helpful..do upvote and accept

– Rahul Agarwal
Nov 25 '18 at 19:05

@user10701663: Does it worked?

– Rahul Agarwal
Nov 27 '18 at 19:08

yes it helped. Thank you

– user10701663
Nov 28 '18 at 23:32

add a comment |

tx Rahul , on it now

– user10701663
Nov 25 '18 at 18:51

If you found the answer helpful..do upvote and accept

– Rahul Agarwal
Nov 25 '18 at 19:05

@user10701663: Does it worked?

– Rahul Agarwal
Nov 27 '18 at 19:08

yes it helped. Thank you

– user10701663
Nov 28 '18 at 23:32

tx Rahul , on it now

– user10701663
Nov 25 '18 at 18:51

If you found the answer helpful..do upvote and accept

– Rahul Agarwal
Nov 25 '18 at 19:05

@user10701663: Does it worked?

– Rahul Agarwal
Nov 27 '18 at 19:08

yes it helped. Thank you

– user10701663
Nov 28 '18 at 23:32

add a comment |

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytukyg