Subset datasets by variable before using expand.grid to calculate distance matrix





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







3















I have two datasets. One dataset has about ~30k rows, and the second dataset has ~60k rows. The smaller dataset (df1) has a unique identifier (upc), which is critical to my analysis.



The larger dataset (df2) does not have this unique identifier, but it does have a descriptive variable (product_title) that can be matched with a similar description variable in df1 and used to infer the unique identifier.



I am trying to keep things simple, so I used expand.grid.



df1_titles<-unique(df1$product_title) # List of 30k titles
df2_titles<-unique(df2$product_title) # List of 60k titles
r<- expand.grid(df1_titles,df2_titles) # Distance matrix
names(r) <- c("df1_titles","df2_titles")
r$dist <- stringdist(r$df1_titles,r$df2_titles, method="jw") # Calculate distance
r<-r[order(r$dist),]
r<-r[!duplicated(r$df1_titles),]
r<-subset(r,dist<.10)


Unfortunately, R is struggling to expand such a large grid. So, I had the idea to use a second variable (c1) in both datasets to constrain the expand.grid to similar items. Let's assume the values for category are the same for both datasets.



While I know it's not recommended to create dataframes using a loop, I didn't have a better idea for how to subset the data to expand.grid so I tried this approach anyways:



categories<-c("Beauty","Personal Care","Grocery","Household Essentials") # Variable with categories to subset

for (i in seq_along(categories)) {
df1_sub<-subset(wmt,category==categories[i])
df2_sub<-subset(m,category==categories[i])
df1_titles<-unique(df1_sub$product_title)
df2_titles<-unique(df2_sub$product_title)
### HOW DO I CREATE A LIST/GRID DYNAMICALLY? ### <-expand.grid(df1_titles,df2_titles)
}


After creating these grids, the plan would be to grab the unique identifier upc from df1 and assign it to matches in df2 before consolidating the datasets.



I'm sure there is a better way to do this, and hope that identifying a better way to reduce data.frames to relevant subsets before using expand.grid will be helpful to others!



dput(sample_n(subset(df1,select=c(product_title,c1)),50)) structure(list(product_title = c("Sriracha Hot Chili Sauce Single Packets 25 Count .25 oz each (3 Items Per Order, not per case)",  "Duncan Hines Double Fudge Decadent Brownie Mix 17.6 oz by Duncan Hines",  "Mikee Tropical Teriyaki Sauce, 20 oz, (Pack of 12)", "NESQUIK Strawberry Low Fat Milk 6-8 fl. oz. Bottles",  "Dove Nutritive Solutions Conditioner, Coconut & Hydration 12 oz (Pack of 12)",  "FLORATA 24" Long Straight Velcro Wrap Around Ponytail Hair Extensions",  "Bing Cherries, Dried (16 oz, ZIN: 527111) - 3-Pack", "San-J Tamari Brown Sesame Crackers, 3.7 oz (Pack of 12)",  "PERDUE HARVESTLAND Breaded Chicken Breast Nugget (22 oz.)",  "Fray Bentos Just Chicken Pie (425g) - Pack of 6", "Product of Thomas Coffee Regular Roast, Portion Packs (64 ct.) - Ground Coffee [Bulk Savings]", "Bombay Basmati Rice White, 2 LB (Pack of 12)", "Herbs for Kids, Sugar Free Elderberry Syrup, Cherry-Berry Flavor, 4 fl oz (pack of 3)",  "Grain Millers BG13916 Grain Millers Rolled Oats No. 5 - 1x50LB",  "Tuning Fork C 512 C512 SURGICAL MEDICAL INSTRUMENTS NEW", "Garnier Fructis Style Pure Clean Finishing Paste, All Hair Types, 2 oz. (Packaging May Vary) (Pack of 8)",  "Stretch Island Organic Fruit Strips Grape -- 6 Pocket-Sized Fruit Strips pack of 6",  "Torani Cinnamon Syrup 750ml", "JFC Nori Maki Arare Crackers 3 oz each (6 Items Per Order)",  "FLORATA Ponytail Buns Wrap Bun Chignon Hair Extensions Wavy Curly Wedding Donut Hair Extensions Hairpiece Wig",  "Kenra Platinum Hot Spray #20 8oz, PACK OF 8", "GBS Red and Black Shampoo Scalp Massage Brushes Plus 1 Soft Pocket Brush Made In USA 3 Pack Promotes Healthy Hair Growth Compliments Any Shampoo and Conditioner",  "Clairol Professional Creme Permanent Developer - 20 volume (Size : 2 oz)",  "Garnier Nutrisse Ultra Color Permanent Haircolor R3 Light Intense Auburn 1.0 ea(pack of 12)",  "Kemps Swiss Style Chocolate Low Fat Milk, 1 gal", "Aussie Kids 3n1 Shampoo, Conditioner, & Bodywash with Pump Coral Reef Cupcake 29.2 oz.(pack of 4)",  "Dequmana Gordal Olives, 12 Oz", "Duncan Hines Caramel Creamy Home-Style Frosting 16 Oz Canister",  "Goya Goya Mole, 9 oz", "Fruit Roll-Ups Fruit Flavored Snacks Variety Pack (Pack of 16)",  "Wild Huckleberry Mountain Huckleberry Barbecue Sauce", "La Flor Spicy Hot Seasoned Salt, 13 oz",  "Clairol Nice n Easy Hair Color #79 Dark Brown, UK Loving Care (Pack of 3) + Beyond BodiHeat Patch, 1 Ct",  "White Vinegar Liquid ''1 gallon, 4 Count, Liquid''", "Metallic Gold Dried Canella Berries - 6 oz Bunch",  "La Flor Adobo All-Purpose Seasoning, 13 oz", "Marlos Bakeshop Marlos Bakeshop Biscotti, 1.25 oz",  "Sam's Choice Frozen Burrito Bowl, Fajita Vegetable, 12.5 oz",  "Conchita guava marmalade 14.1 oz Pack of 3", "HC Industries Kids Organics Kids Organics Shampoo, 12 oz",  "6 Pack - Head & Shoulders Full & Thick 2-in-1 Anti-Dandruff Shampoo + Conditioner 32.1 oz",  "Ice Breakers, Wintergreen Mints Tin, 1.5 Oz (Pack of 8)", "Mason Pearson - Boar Bristle & Nylon - Medium Junior Military Nylon & Bristle Hair Brush (Dark Ruby) -1pc",  "Dove Nutritive Solutions Revival Cleansing Shampoo, 20.4 oz",  "Boston's Best 12 Ct Jamaican Me Crazy", "Ultimate Baker Edible Glitter Mix It Up (1x3oz)",  "Nori Maki Arare Rice Crackers with Seaweed 5 oz per Pack (1 Pack)",  "H&S 2in1 MENS REFRESH POO 13.5oz-Pack of 5", "Keebler Club Mini Crackers, Multi-Grain, 11 Ounce (Pack of 20)",  "Briess Sparkling Amber Liquid Malt Extract (30 Pound Pail)"), 
c1 = c("Grocery", "Grocery", "Grocery", "Grocery", "Personal Care",
"Beauty", "Grocery", "Grocery", "Grocery", "Grocery", "Grocery",
"Grocery", "Grocery", "Grocery", "Beauty", "Beauty", "Grocery",
"Grocery", "Grocery", "Beauty", "Beauty", "Beauty", "Beauty",
"Beauty", "Grocery", "Beauty", "Grocery", "Grocery", "Grocery",
"Grocery", "Grocery", "Grocery", "Beauty", "Grocery", "Grocery",
"Grocery", "Grocery", "Grocery", "Grocery", "Personal Care",
"Beauty", "Grocery", "Beauty", "Beauty", "Grocery", "Grocery",
"Grocery", "Beauty", "Grocery", "Grocery")), row.names = c(16523L, 111871L, 28667L, 32067L, 8269L, 11076L, 50328L, 47200L, 99415L, 100031L, 39011L, 104854L, 29516L, 104643L, 3486L, 9689L, 52157L, 28995L, 47000L, 10895L, 3035L, 4992L, 3589L, 4276L, 32212L, 6055L, 22991L, 110279L, 27436L, 52282L, 14879L, 25710L, 6989L, 30133L, 51068L, 25490L, 45685L, 99073L, 18547L, 4991L, 5792L, 36241L, 10237L, 1430L, 40383L, 112458L, 46261L, 5875L, 46597L, 108099L ), class = "data.frame")

dput(sample_n(subset(df2,select=c(product_title,c1)),50))
structure(list(product_title = c("Drive Medical Heavy Duty Bariatric Plastic Seat Transfer Bench",
"Always Pure & Clean Ultra Thin Feminine Pads With Wings, Super Long",
"Patriot Candles Jar Candle Apple Clove Red", "Nature's Bounty Cardio-Health Probiotic Capsules",
"Finest Nutrition Biotin Plus Keratin", "Dr. Scholl's DuraGel Corn Remover",
"Humm Coconut Lime Kombucha 14 oz", "OneTouch Ultra Blue Test Strips",
"Kellogg's Rice Krispies Treats Bars M&M's", "Westbrae Natural Organic Chili Beans",
"Neutrogena Rapid Clear Acne Eliminating Spot Treatment Gel - 0.5 fl oz",
"Harris Bed Bug Killer", "Quart Storage Bags - 80ct - Up&Up cent (Compare to Ziploc Storage Bags)",
"Care Free Curl Gold Instant Curl Activator", "Purple Dessert Plate",
"Wexford Big Bubble Plastic Mailer 2", "L'Oreal Paris Advanced Haircare Total Repair Extreme Emergency Recovery Mask",
"Soap & Glory Spectaculips Matteallic Lip Cream Bronze Girl,Bronze Girl",
"No7 Instant Results Purifying Heating Mask - 2.5oz", "NuMe Classic Curling Wand",
"Revlon ColorSilk ColorStay Nourishing Conditioner Glowing Blonde",
"Weiman Lemon Oil Furniture Polish Lemon", "Dunkin' Donuts Ground Coffee Hazelnut",
"CocoaVia Cocoa Extract 375mg, Capsules", "Triple Paste AF Antifungal Ointment",
"Welch's Halloween Fruit Snacks 0.5oz 28 ct", "Studio 35 Purifying Natural Facial Wipes",
"Magnum Double Raspberry Mini Ice Cream Bars - 3ct", "CHI Twisted Fabric Finishing Paste",
"Creme Of Nature Argan Oil Intensive Conditioning Hair Treatment",
"Exergen Temporal Artery Thermometer", "Tolerex Formulated Liquid Diet Elemental Powder 6 Pack Unflavored",
"Gerber Nature Select 2nd Foods Nutritious Dinner Baby Food Chicken Noodle",
"Abreva Cold Sore Cream", "Super Macho Vitality and Stamina Dietary Supplement Softgel",
"M&M's Peanut Chocolates Halloween Ghoul's Mix - 3.27oz", "TruMoo protein milk cookies n' cream - 14 fl oz",
"DISNEY 25 Inch Plush Toy Assorted", "Beauty Infusion HYDRATING Manuka Honey & Collagen Sheet Mask",
"Edge Shave Gel, Twin Pack Sensitive Skin", "Haribo Sour Gold Bears Resealable Stand Up Pouch Pineapple",
"Jarrow Formulas Extra Virgin Coconut Oil, 1000mg, Softgels",
"Bliss Pore Patrol Oil-Free Hydrator with Willow Bark - 1.7oz",
"Airheads Candy Bites Watermelon", "Thrive Market Organic Sprouted Quinoa",
"Garnier Fructis Curl Stretch Loosening Pudding", "Systane Nighttime Lubricant Eye Ointment",
"SOHO Resort Organizer", "Enfamil Enfacare Lipil Infant Formula Powder",
"Fancy Feast Flaked Gourmet Cat Food Tuna"), c1 = c("Home Health Care Solutions",
"Personal Care", "Household Essentials", "Vitamin & Supplements",
"Vitamin & Supplements", "Personal Care", "Grocery", "Home Health Care Solutions",
"Grocery", "Grocery", "Beauty", "Household Essentials", "Household Essentials",
"Beauty", "Household Essentials", "Household Essentials", "Beauty",
"Beauty", "Beauty", "Beauty", "Beauty", "Household Essentials",
"Grocery", "Vitamin & Supplements", "Personal Care", "Grocery",
"Beauty", "Grocery", "Beauty", "Personal Care", "Personal Care",
"Home Health Care Solutions", "Grocery", "Personal Care", "Vitamin & Supplements",
"Grocery", "Grocery", "Baby, Kids & Toys", "Beauty", "Personal Care",
"Grocery", "Vitamin & Supplements", "Beauty", "Grocery", "Grocery",
"Beauty", "Personal Care", "Beauty", "Grocery", "Household Essentials"
)), row.names = c(39590L, 6987L, 13810L, 19403L, 26966L, 446L,
41599L, 28238L, 7622L, 19653L, 16458L, 18164L, 738L, 19819L,
43731L, 13310L, 17113L, 29729L, 29725L, 38903L, 25464L, 10048L,
42932L, 41179L, 37568L, 5830L, 14276L, 20526L, 31614L, 20119L,
40084L, 25978L, 1573L, 25121L, 3660L, 8850L, 10201L, 43313L,
17973L, 40423L, 10299L, 37320L, 32177L, 18491L, 32860L, 30439L,
24518L, 21579L, 24597L, 14687L), class = "data.frame")









share|improve this question




















  • 1





    You might try looking at the fuzzyjoin package and trying to join on some measure of string distance. I haven't used it with data quite that large, though.

    – joran
    Nov 26 '18 at 19:33













  • @CPak just spotted an error in how a column in df1 was labeled. Fixed now - let me know if you still have issues?

    – roody
    Nov 26 '18 at 21:02













  • Still getting an error - Error: unexpected symbol in "dput(sample_n(subset(df1,select=c(product_title,c1)),50)) structure - df2 seems ok

    – CPak
    Nov 26 '18 at 21:08




















3















I have two datasets. One dataset has about ~30k rows, and the second dataset has ~60k rows. The smaller dataset (df1) has a unique identifier (upc), which is critical to my analysis.



The larger dataset (df2) does not have this unique identifier, but it does have a descriptive variable (product_title) that can be matched with a similar description variable in df1 and used to infer the unique identifier.



I am trying to keep things simple, so I used expand.grid.



df1_titles<-unique(df1$product_title) # List of 30k titles
df2_titles<-unique(df2$product_title) # List of 60k titles
r<- expand.grid(df1_titles,df2_titles) # Distance matrix
names(r) <- c("df1_titles","df2_titles")
r$dist <- stringdist(r$df1_titles,r$df2_titles, method="jw") # Calculate distance
r<-r[order(r$dist),]
r<-r[!duplicated(r$df1_titles),]
r<-subset(r,dist<.10)


Unfortunately, R is struggling to expand such a large grid. So, I had the idea to use a second variable (c1) in both datasets to constrain the expand.grid to similar items. Let's assume the values for category are the same for both datasets.



While I know it's not recommended to create dataframes using a loop, I didn't have a better idea for how to subset the data to expand.grid so I tried this approach anyways:



categories<-c("Beauty","Personal Care","Grocery","Household Essentials") # Variable with categories to subset

for (i in seq_along(categories)) {
df1_sub<-subset(wmt,category==categories[i])
df2_sub<-subset(m,category==categories[i])
df1_titles<-unique(df1_sub$product_title)
df2_titles<-unique(df2_sub$product_title)
### HOW DO I CREATE A LIST/GRID DYNAMICALLY? ### <-expand.grid(df1_titles,df2_titles)
}


After creating these grids, the plan would be to grab the unique identifier upc from df1 and assign it to matches in df2 before consolidating the datasets.



I'm sure there is a better way to do this, and hope that identifying a better way to reduce data.frames to relevant subsets before using expand.grid will be helpful to others!



dput(sample_n(subset(df1,select=c(product_title,c1)),50)) structure(list(product_title = c("Sriracha Hot Chili Sauce Single Packets 25 Count .25 oz each (3 Items Per Order, not per case)",  "Duncan Hines Double Fudge Decadent Brownie Mix 17.6 oz by Duncan Hines",  "Mikee Tropical Teriyaki Sauce, 20 oz, (Pack of 12)", "NESQUIK Strawberry Low Fat Milk 6-8 fl. oz. Bottles",  "Dove Nutritive Solutions Conditioner, Coconut & Hydration 12 oz (Pack of 12)",  "FLORATA 24" Long Straight Velcro Wrap Around Ponytail Hair Extensions",  "Bing Cherries, Dried (16 oz, ZIN: 527111) - 3-Pack", "San-J Tamari Brown Sesame Crackers, 3.7 oz (Pack of 12)",  "PERDUE HARVESTLAND Breaded Chicken Breast Nugget (22 oz.)",  "Fray Bentos Just Chicken Pie (425g) - Pack of 6", "Product of Thomas Coffee Regular Roast, Portion Packs (64 ct.) - Ground Coffee [Bulk Savings]", "Bombay Basmati Rice White, 2 LB (Pack of 12)", "Herbs for Kids, Sugar Free Elderberry Syrup, Cherry-Berry Flavor, 4 fl oz (pack of 3)",  "Grain Millers BG13916 Grain Millers Rolled Oats No. 5 - 1x50LB",  "Tuning Fork C 512 C512 SURGICAL MEDICAL INSTRUMENTS NEW", "Garnier Fructis Style Pure Clean Finishing Paste, All Hair Types, 2 oz. (Packaging May Vary) (Pack of 8)",  "Stretch Island Organic Fruit Strips Grape -- 6 Pocket-Sized Fruit Strips pack of 6",  "Torani Cinnamon Syrup 750ml", "JFC Nori Maki Arare Crackers 3 oz each (6 Items Per Order)",  "FLORATA Ponytail Buns Wrap Bun Chignon Hair Extensions Wavy Curly Wedding Donut Hair Extensions Hairpiece Wig",  "Kenra Platinum Hot Spray #20 8oz, PACK OF 8", "GBS Red and Black Shampoo Scalp Massage Brushes Plus 1 Soft Pocket Brush Made In USA 3 Pack Promotes Healthy Hair Growth Compliments Any Shampoo and Conditioner",  "Clairol Professional Creme Permanent Developer - 20 volume (Size : 2 oz)",  "Garnier Nutrisse Ultra Color Permanent Haircolor R3 Light Intense Auburn 1.0 ea(pack of 12)",  "Kemps Swiss Style Chocolate Low Fat Milk, 1 gal", "Aussie Kids 3n1 Shampoo, Conditioner, & Bodywash with Pump Coral Reef Cupcake 29.2 oz.(pack of 4)",  "Dequmana Gordal Olives, 12 Oz", "Duncan Hines Caramel Creamy Home-Style Frosting 16 Oz Canister",  "Goya Goya Mole, 9 oz", "Fruit Roll-Ups Fruit Flavored Snacks Variety Pack (Pack of 16)",  "Wild Huckleberry Mountain Huckleberry Barbecue Sauce", "La Flor Spicy Hot Seasoned Salt, 13 oz",  "Clairol Nice n Easy Hair Color #79 Dark Brown, UK Loving Care (Pack of 3) + Beyond BodiHeat Patch, 1 Ct",  "White Vinegar Liquid ''1 gallon, 4 Count, Liquid''", "Metallic Gold Dried Canella Berries - 6 oz Bunch",  "La Flor Adobo All-Purpose Seasoning, 13 oz", "Marlos Bakeshop Marlos Bakeshop Biscotti, 1.25 oz",  "Sam's Choice Frozen Burrito Bowl, Fajita Vegetable, 12.5 oz",  "Conchita guava marmalade 14.1 oz Pack of 3", "HC Industries Kids Organics Kids Organics Shampoo, 12 oz",  "6 Pack - Head & Shoulders Full & Thick 2-in-1 Anti-Dandruff Shampoo + Conditioner 32.1 oz",  "Ice Breakers, Wintergreen Mints Tin, 1.5 Oz (Pack of 8)", "Mason Pearson - Boar Bristle & Nylon - Medium Junior Military Nylon & Bristle Hair Brush (Dark Ruby) -1pc",  "Dove Nutritive Solutions Revival Cleansing Shampoo, 20.4 oz",  "Boston's Best 12 Ct Jamaican Me Crazy", "Ultimate Baker Edible Glitter Mix It Up (1x3oz)",  "Nori Maki Arare Rice Crackers with Seaweed 5 oz per Pack (1 Pack)",  "H&S 2in1 MENS REFRESH POO 13.5oz-Pack of 5", "Keebler Club Mini Crackers, Multi-Grain, 11 Ounce (Pack of 20)",  "Briess Sparkling Amber Liquid Malt Extract (30 Pound Pail)"), 
c1 = c("Grocery", "Grocery", "Grocery", "Grocery", "Personal Care",
"Beauty", "Grocery", "Grocery", "Grocery", "Grocery", "Grocery",
"Grocery", "Grocery", "Grocery", "Beauty", "Beauty", "Grocery",
"Grocery", "Grocery", "Beauty", "Beauty", "Beauty", "Beauty",
"Beauty", "Grocery", "Beauty", "Grocery", "Grocery", "Grocery",
"Grocery", "Grocery", "Grocery", "Beauty", "Grocery", "Grocery",
"Grocery", "Grocery", "Grocery", "Grocery", "Personal Care",
"Beauty", "Grocery", "Beauty", "Beauty", "Grocery", "Grocery",
"Grocery", "Beauty", "Grocery", "Grocery")), row.names = c(16523L, 111871L, 28667L, 32067L, 8269L, 11076L, 50328L, 47200L, 99415L, 100031L, 39011L, 104854L, 29516L, 104643L, 3486L, 9689L, 52157L, 28995L, 47000L, 10895L, 3035L, 4992L, 3589L, 4276L, 32212L, 6055L, 22991L, 110279L, 27436L, 52282L, 14879L, 25710L, 6989L, 30133L, 51068L, 25490L, 45685L, 99073L, 18547L, 4991L, 5792L, 36241L, 10237L, 1430L, 40383L, 112458L, 46261L, 5875L, 46597L, 108099L ), class = "data.frame")

dput(sample_n(subset(df2,select=c(product_title,c1)),50))
structure(list(product_title = c("Drive Medical Heavy Duty Bariatric Plastic Seat Transfer Bench",
"Always Pure & Clean Ultra Thin Feminine Pads With Wings, Super Long",
"Patriot Candles Jar Candle Apple Clove Red", "Nature's Bounty Cardio-Health Probiotic Capsules",
"Finest Nutrition Biotin Plus Keratin", "Dr. Scholl's DuraGel Corn Remover",
"Humm Coconut Lime Kombucha 14 oz", "OneTouch Ultra Blue Test Strips",
"Kellogg's Rice Krispies Treats Bars M&M's", "Westbrae Natural Organic Chili Beans",
"Neutrogena Rapid Clear Acne Eliminating Spot Treatment Gel - 0.5 fl oz",
"Harris Bed Bug Killer", "Quart Storage Bags - 80ct - Up&Up cent (Compare to Ziploc Storage Bags)",
"Care Free Curl Gold Instant Curl Activator", "Purple Dessert Plate",
"Wexford Big Bubble Plastic Mailer 2", "L'Oreal Paris Advanced Haircare Total Repair Extreme Emergency Recovery Mask",
"Soap & Glory Spectaculips Matteallic Lip Cream Bronze Girl,Bronze Girl",
"No7 Instant Results Purifying Heating Mask - 2.5oz", "NuMe Classic Curling Wand",
"Revlon ColorSilk ColorStay Nourishing Conditioner Glowing Blonde",
"Weiman Lemon Oil Furniture Polish Lemon", "Dunkin' Donuts Ground Coffee Hazelnut",
"CocoaVia Cocoa Extract 375mg, Capsules", "Triple Paste AF Antifungal Ointment",
"Welch's Halloween Fruit Snacks 0.5oz 28 ct", "Studio 35 Purifying Natural Facial Wipes",
"Magnum Double Raspberry Mini Ice Cream Bars - 3ct", "CHI Twisted Fabric Finishing Paste",
"Creme Of Nature Argan Oil Intensive Conditioning Hair Treatment",
"Exergen Temporal Artery Thermometer", "Tolerex Formulated Liquid Diet Elemental Powder 6 Pack Unflavored",
"Gerber Nature Select 2nd Foods Nutritious Dinner Baby Food Chicken Noodle",
"Abreva Cold Sore Cream", "Super Macho Vitality and Stamina Dietary Supplement Softgel",
"M&M's Peanut Chocolates Halloween Ghoul's Mix - 3.27oz", "TruMoo protein milk cookies n' cream - 14 fl oz",
"DISNEY 25 Inch Plush Toy Assorted", "Beauty Infusion HYDRATING Manuka Honey & Collagen Sheet Mask",
"Edge Shave Gel, Twin Pack Sensitive Skin", "Haribo Sour Gold Bears Resealable Stand Up Pouch Pineapple",
"Jarrow Formulas Extra Virgin Coconut Oil, 1000mg, Softgels",
"Bliss Pore Patrol Oil-Free Hydrator with Willow Bark - 1.7oz",
"Airheads Candy Bites Watermelon", "Thrive Market Organic Sprouted Quinoa",
"Garnier Fructis Curl Stretch Loosening Pudding", "Systane Nighttime Lubricant Eye Ointment",
"SOHO Resort Organizer", "Enfamil Enfacare Lipil Infant Formula Powder",
"Fancy Feast Flaked Gourmet Cat Food Tuna"), c1 = c("Home Health Care Solutions",
"Personal Care", "Household Essentials", "Vitamin & Supplements",
"Vitamin & Supplements", "Personal Care", "Grocery", "Home Health Care Solutions",
"Grocery", "Grocery", "Beauty", "Household Essentials", "Household Essentials",
"Beauty", "Household Essentials", "Household Essentials", "Beauty",
"Beauty", "Beauty", "Beauty", "Beauty", "Household Essentials",
"Grocery", "Vitamin & Supplements", "Personal Care", "Grocery",
"Beauty", "Grocery", "Beauty", "Personal Care", "Personal Care",
"Home Health Care Solutions", "Grocery", "Personal Care", "Vitamin & Supplements",
"Grocery", "Grocery", "Baby, Kids & Toys", "Beauty", "Personal Care",
"Grocery", "Vitamin & Supplements", "Beauty", "Grocery", "Grocery",
"Beauty", "Personal Care", "Beauty", "Grocery", "Household Essentials"
)), row.names = c(39590L, 6987L, 13810L, 19403L, 26966L, 446L,
41599L, 28238L, 7622L, 19653L, 16458L, 18164L, 738L, 19819L,
43731L, 13310L, 17113L, 29729L, 29725L, 38903L, 25464L, 10048L,
42932L, 41179L, 37568L, 5830L, 14276L, 20526L, 31614L, 20119L,
40084L, 25978L, 1573L, 25121L, 3660L, 8850L, 10201L, 43313L,
17973L, 40423L, 10299L, 37320L, 32177L, 18491L, 32860L, 30439L,
24518L, 21579L, 24597L, 14687L), class = "data.frame")









share|improve this question




















  • 1





    You might try looking at the fuzzyjoin package and trying to join on some measure of string distance. I haven't used it with data quite that large, though.

    – joran
    Nov 26 '18 at 19:33













  • @CPak just spotted an error in how a column in df1 was labeled. Fixed now - let me know if you still have issues?

    – roody
    Nov 26 '18 at 21:02













  • Still getting an error - Error: unexpected symbol in "dput(sample_n(subset(df1,select=c(product_title,c1)),50)) structure - df2 seems ok

    – CPak
    Nov 26 '18 at 21:08
















3












3








3








I have two datasets. One dataset has about ~30k rows, and the second dataset has ~60k rows. The smaller dataset (df1) has a unique identifier (upc), which is critical to my analysis.



The larger dataset (df2) does not have this unique identifier, but it does have a descriptive variable (product_title) that can be matched with a similar description variable in df1 and used to infer the unique identifier.



I am trying to keep things simple, so I used expand.grid.



df1_titles<-unique(df1$product_title) # List of 30k titles
df2_titles<-unique(df2$product_title) # List of 60k titles
r<- expand.grid(df1_titles,df2_titles) # Distance matrix
names(r) <- c("df1_titles","df2_titles")
r$dist <- stringdist(r$df1_titles,r$df2_titles, method="jw") # Calculate distance
r<-r[order(r$dist),]
r<-r[!duplicated(r$df1_titles),]
r<-subset(r,dist<.10)


Unfortunately, R is struggling to expand such a large grid. So, I had the idea to use a second variable (c1) in both datasets to constrain the expand.grid to similar items. Let's assume the values for category are the same for both datasets.



While I know it's not recommended to create dataframes using a loop, I didn't have a better idea for how to subset the data to expand.grid so I tried this approach anyways:



categories<-c("Beauty","Personal Care","Grocery","Household Essentials") # Variable with categories to subset

for (i in seq_along(categories)) {
df1_sub<-subset(wmt,category==categories[i])
df2_sub<-subset(m,category==categories[i])
df1_titles<-unique(df1_sub$product_title)
df2_titles<-unique(df2_sub$product_title)
### HOW DO I CREATE A LIST/GRID DYNAMICALLY? ### <-expand.grid(df1_titles,df2_titles)
}


After creating these grids, the plan would be to grab the unique identifier upc from df1 and assign it to matches in df2 before consolidating the datasets.



I'm sure there is a better way to do this, and hope that identifying a better way to reduce data.frames to relevant subsets before using expand.grid will be helpful to others!



dput(sample_n(subset(df1,select=c(product_title,c1)),50)) structure(list(product_title = c("Sriracha Hot Chili Sauce Single Packets 25 Count .25 oz each (3 Items Per Order, not per case)",  "Duncan Hines Double Fudge Decadent Brownie Mix 17.6 oz by Duncan Hines",  "Mikee Tropical Teriyaki Sauce, 20 oz, (Pack of 12)", "NESQUIK Strawberry Low Fat Milk 6-8 fl. oz. Bottles",  "Dove Nutritive Solutions Conditioner, Coconut & Hydration 12 oz (Pack of 12)",  "FLORATA 24" Long Straight Velcro Wrap Around Ponytail Hair Extensions",  "Bing Cherries, Dried (16 oz, ZIN: 527111) - 3-Pack", "San-J Tamari Brown Sesame Crackers, 3.7 oz (Pack of 12)",  "PERDUE HARVESTLAND Breaded Chicken Breast Nugget (22 oz.)",  "Fray Bentos Just Chicken Pie (425g) - Pack of 6", "Product of Thomas Coffee Regular Roast, Portion Packs (64 ct.) - Ground Coffee [Bulk Savings]", "Bombay Basmati Rice White, 2 LB (Pack of 12)", "Herbs for Kids, Sugar Free Elderberry Syrup, Cherry-Berry Flavor, 4 fl oz (pack of 3)",  "Grain Millers BG13916 Grain Millers Rolled Oats No. 5 - 1x50LB",  "Tuning Fork C 512 C512 SURGICAL MEDICAL INSTRUMENTS NEW", "Garnier Fructis Style Pure Clean Finishing Paste, All Hair Types, 2 oz. (Packaging May Vary) (Pack of 8)",  "Stretch Island Organic Fruit Strips Grape -- 6 Pocket-Sized Fruit Strips pack of 6",  "Torani Cinnamon Syrup 750ml", "JFC Nori Maki Arare Crackers 3 oz each (6 Items Per Order)",  "FLORATA Ponytail Buns Wrap Bun Chignon Hair Extensions Wavy Curly Wedding Donut Hair Extensions Hairpiece Wig",  "Kenra Platinum Hot Spray #20 8oz, PACK OF 8", "GBS Red and Black Shampoo Scalp Massage Brushes Plus 1 Soft Pocket Brush Made In USA 3 Pack Promotes Healthy Hair Growth Compliments Any Shampoo and Conditioner",  "Clairol Professional Creme Permanent Developer - 20 volume (Size : 2 oz)",  "Garnier Nutrisse Ultra Color Permanent Haircolor R3 Light Intense Auburn 1.0 ea(pack of 12)",  "Kemps Swiss Style Chocolate Low Fat Milk, 1 gal", "Aussie Kids 3n1 Shampoo, Conditioner, & Bodywash with Pump Coral Reef Cupcake 29.2 oz.(pack of 4)",  "Dequmana Gordal Olives, 12 Oz", "Duncan Hines Caramel Creamy Home-Style Frosting 16 Oz Canister",  "Goya Goya Mole, 9 oz", "Fruit Roll-Ups Fruit Flavored Snacks Variety Pack (Pack of 16)",  "Wild Huckleberry Mountain Huckleberry Barbecue Sauce", "La Flor Spicy Hot Seasoned Salt, 13 oz",  "Clairol Nice n Easy Hair Color #79 Dark Brown, UK Loving Care (Pack of 3) + Beyond BodiHeat Patch, 1 Ct",  "White Vinegar Liquid ''1 gallon, 4 Count, Liquid''", "Metallic Gold Dried Canella Berries - 6 oz Bunch",  "La Flor Adobo All-Purpose Seasoning, 13 oz", "Marlos Bakeshop Marlos Bakeshop Biscotti, 1.25 oz",  "Sam's Choice Frozen Burrito Bowl, Fajita Vegetable, 12.5 oz",  "Conchita guava marmalade 14.1 oz Pack of 3", "HC Industries Kids Organics Kids Organics Shampoo, 12 oz",  "6 Pack - Head & Shoulders Full & Thick 2-in-1 Anti-Dandruff Shampoo + Conditioner 32.1 oz",  "Ice Breakers, Wintergreen Mints Tin, 1.5 Oz (Pack of 8)", "Mason Pearson - Boar Bristle & Nylon - Medium Junior Military Nylon & Bristle Hair Brush (Dark Ruby) -1pc",  "Dove Nutritive Solutions Revival Cleansing Shampoo, 20.4 oz",  "Boston's Best 12 Ct Jamaican Me Crazy", "Ultimate Baker Edible Glitter Mix It Up (1x3oz)",  "Nori Maki Arare Rice Crackers with Seaweed 5 oz per Pack (1 Pack)",  "H&S 2in1 MENS REFRESH POO 13.5oz-Pack of 5", "Keebler Club Mini Crackers, Multi-Grain, 11 Ounce (Pack of 20)",  "Briess Sparkling Amber Liquid Malt Extract (30 Pound Pail)"), 
c1 = c("Grocery", "Grocery", "Grocery", "Grocery", "Personal Care",
"Beauty", "Grocery", "Grocery", "Grocery", "Grocery", "Grocery",
"Grocery", "Grocery", "Grocery", "Beauty", "Beauty", "Grocery",
"Grocery", "Grocery", "Beauty", "Beauty", "Beauty", "Beauty",
"Beauty", "Grocery", "Beauty", "Grocery", "Grocery", "Grocery",
"Grocery", "Grocery", "Grocery", "Beauty", "Grocery", "Grocery",
"Grocery", "Grocery", "Grocery", "Grocery", "Personal Care",
"Beauty", "Grocery", "Beauty", "Beauty", "Grocery", "Grocery",
"Grocery", "Beauty", "Grocery", "Grocery")), row.names = c(16523L, 111871L, 28667L, 32067L, 8269L, 11076L, 50328L, 47200L, 99415L, 100031L, 39011L, 104854L, 29516L, 104643L, 3486L, 9689L, 52157L, 28995L, 47000L, 10895L, 3035L, 4992L, 3589L, 4276L, 32212L, 6055L, 22991L, 110279L, 27436L, 52282L, 14879L, 25710L, 6989L, 30133L, 51068L, 25490L, 45685L, 99073L, 18547L, 4991L, 5792L, 36241L, 10237L, 1430L, 40383L, 112458L, 46261L, 5875L, 46597L, 108099L ), class = "data.frame")

dput(sample_n(subset(df2,select=c(product_title,c1)),50))
structure(list(product_title = c("Drive Medical Heavy Duty Bariatric Plastic Seat Transfer Bench",
"Always Pure & Clean Ultra Thin Feminine Pads With Wings, Super Long",
"Patriot Candles Jar Candle Apple Clove Red", "Nature's Bounty Cardio-Health Probiotic Capsules",
"Finest Nutrition Biotin Plus Keratin", "Dr. Scholl's DuraGel Corn Remover",
"Humm Coconut Lime Kombucha 14 oz", "OneTouch Ultra Blue Test Strips",
"Kellogg's Rice Krispies Treats Bars M&M's", "Westbrae Natural Organic Chili Beans",
"Neutrogena Rapid Clear Acne Eliminating Spot Treatment Gel - 0.5 fl oz",
"Harris Bed Bug Killer", "Quart Storage Bags - 80ct - Up&Up cent (Compare to Ziploc Storage Bags)",
"Care Free Curl Gold Instant Curl Activator", "Purple Dessert Plate",
"Wexford Big Bubble Plastic Mailer 2", "L'Oreal Paris Advanced Haircare Total Repair Extreme Emergency Recovery Mask",
"Soap & Glory Spectaculips Matteallic Lip Cream Bronze Girl,Bronze Girl",
"No7 Instant Results Purifying Heating Mask - 2.5oz", "NuMe Classic Curling Wand",
"Revlon ColorSilk ColorStay Nourishing Conditioner Glowing Blonde",
"Weiman Lemon Oil Furniture Polish Lemon", "Dunkin' Donuts Ground Coffee Hazelnut",
"CocoaVia Cocoa Extract 375mg, Capsules", "Triple Paste AF Antifungal Ointment",
"Welch's Halloween Fruit Snacks 0.5oz 28 ct", "Studio 35 Purifying Natural Facial Wipes",
"Magnum Double Raspberry Mini Ice Cream Bars - 3ct", "CHI Twisted Fabric Finishing Paste",
"Creme Of Nature Argan Oil Intensive Conditioning Hair Treatment",
"Exergen Temporal Artery Thermometer", "Tolerex Formulated Liquid Diet Elemental Powder 6 Pack Unflavored",
"Gerber Nature Select 2nd Foods Nutritious Dinner Baby Food Chicken Noodle",
"Abreva Cold Sore Cream", "Super Macho Vitality and Stamina Dietary Supplement Softgel",
"M&M's Peanut Chocolates Halloween Ghoul's Mix - 3.27oz", "TruMoo protein milk cookies n' cream - 14 fl oz",
"DISNEY 25 Inch Plush Toy Assorted", "Beauty Infusion HYDRATING Manuka Honey & Collagen Sheet Mask",
"Edge Shave Gel, Twin Pack Sensitive Skin", "Haribo Sour Gold Bears Resealable Stand Up Pouch Pineapple",
"Jarrow Formulas Extra Virgin Coconut Oil, 1000mg, Softgels",
"Bliss Pore Patrol Oil-Free Hydrator with Willow Bark - 1.7oz",
"Airheads Candy Bites Watermelon", "Thrive Market Organic Sprouted Quinoa",
"Garnier Fructis Curl Stretch Loosening Pudding", "Systane Nighttime Lubricant Eye Ointment",
"SOHO Resort Organizer", "Enfamil Enfacare Lipil Infant Formula Powder",
"Fancy Feast Flaked Gourmet Cat Food Tuna"), c1 = c("Home Health Care Solutions",
"Personal Care", "Household Essentials", "Vitamin & Supplements",
"Vitamin & Supplements", "Personal Care", "Grocery", "Home Health Care Solutions",
"Grocery", "Grocery", "Beauty", "Household Essentials", "Household Essentials",
"Beauty", "Household Essentials", "Household Essentials", "Beauty",
"Beauty", "Beauty", "Beauty", "Beauty", "Household Essentials",
"Grocery", "Vitamin & Supplements", "Personal Care", "Grocery",
"Beauty", "Grocery", "Beauty", "Personal Care", "Personal Care",
"Home Health Care Solutions", "Grocery", "Personal Care", "Vitamin & Supplements",
"Grocery", "Grocery", "Baby, Kids & Toys", "Beauty", "Personal Care",
"Grocery", "Vitamin & Supplements", "Beauty", "Grocery", "Grocery",
"Beauty", "Personal Care", "Beauty", "Grocery", "Household Essentials"
)), row.names = c(39590L, 6987L, 13810L, 19403L, 26966L, 446L,
41599L, 28238L, 7622L, 19653L, 16458L, 18164L, 738L, 19819L,
43731L, 13310L, 17113L, 29729L, 29725L, 38903L, 25464L, 10048L,
42932L, 41179L, 37568L, 5830L, 14276L, 20526L, 31614L, 20119L,
40084L, 25978L, 1573L, 25121L, 3660L, 8850L, 10201L, 43313L,
17973L, 40423L, 10299L, 37320L, 32177L, 18491L, 32860L, 30439L,
24518L, 21579L, 24597L, 14687L), class = "data.frame")









share|improve this question
















I have two datasets. One dataset has about ~30k rows, and the second dataset has ~60k rows. The smaller dataset (df1) has a unique identifier (upc), which is critical to my analysis.



The larger dataset (df2) does not have this unique identifier, but it does have a descriptive variable (product_title) that can be matched with a similar description variable in df1 and used to infer the unique identifier.



I am trying to keep things simple, so I used expand.grid.



df1_titles<-unique(df1$product_title) # List of 30k titles
df2_titles<-unique(df2$product_title) # List of 60k titles
r<- expand.grid(df1_titles,df2_titles) # Distance matrix
names(r) <- c("df1_titles","df2_titles")
r$dist <- stringdist(r$df1_titles,r$df2_titles, method="jw") # Calculate distance
r<-r[order(r$dist),]
r<-r[!duplicated(r$df1_titles),]
r<-subset(r,dist<.10)


Unfortunately, R is struggling to expand such a large grid. So, I had the idea to use a second variable (c1) in both datasets to constrain the expand.grid to similar items. Let's assume the values for category are the same for both datasets.



While I know it's not recommended to create dataframes using a loop, I didn't have a better idea for how to subset the data to expand.grid so I tried this approach anyways:



categories<-c("Beauty","Personal Care","Grocery","Household Essentials") # Variable with categories to subset

for (i in seq_along(categories)) {
df1_sub<-subset(wmt,category==categories[i])
df2_sub<-subset(m,category==categories[i])
df1_titles<-unique(df1_sub$product_title)
df2_titles<-unique(df2_sub$product_title)
### HOW DO I CREATE A LIST/GRID DYNAMICALLY? ### <-expand.grid(df1_titles,df2_titles)
}


After creating these grids, the plan would be to grab the unique identifier upc from df1 and assign it to matches in df2 before consolidating the datasets.



I'm sure there is a better way to do this, and hope that identifying a better way to reduce data.frames to relevant subsets before using expand.grid will be helpful to others!



dput(sample_n(subset(df1,select=c(product_title,c1)),50)) structure(list(product_title = c("Sriracha Hot Chili Sauce Single Packets 25 Count .25 oz each (3 Items Per Order, not per case)",  "Duncan Hines Double Fudge Decadent Brownie Mix 17.6 oz by Duncan Hines",  "Mikee Tropical Teriyaki Sauce, 20 oz, (Pack of 12)", "NESQUIK Strawberry Low Fat Milk 6-8 fl. oz. Bottles",  "Dove Nutritive Solutions Conditioner, Coconut & Hydration 12 oz (Pack of 12)",  "FLORATA 24" Long Straight Velcro Wrap Around Ponytail Hair Extensions",  "Bing Cherries, Dried (16 oz, ZIN: 527111) - 3-Pack", "San-J Tamari Brown Sesame Crackers, 3.7 oz (Pack of 12)",  "PERDUE HARVESTLAND Breaded Chicken Breast Nugget (22 oz.)",  "Fray Bentos Just Chicken Pie (425g) - Pack of 6", "Product of Thomas Coffee Regular Roast, Portion Packs (64 ct.) - Ground Coffee [Bulk Savings]", "Bombay Basmati Rice White, 2 LB (Pack of 12)", "Herbs for Kids, Sugar Free Elderberry Syrup, Cherry-Berry Flavor, 4 fl oz (pack of 3)",  "Grain Millers BG13916 Grain Millers Rolled Oats No. 5 - 1x50LB",  "Tuning Fork C 512 C512 SURGICAL MEDICAL INSTRUMENTS NEW", "Garnier Fructis Style Pure Clean Finishing Paste, All Hair Types, 2 oz. (Packaging May Vary) (Pack of 8)",  "Stretch Island Organic Fruit Strips Grape -- 6 Pocket-Sized Fruit Strips pack of 6",  "Torani Cinnamon Syrup 750ml", "JFC Nori Maki Arare Crackers 3 oz each (6 Items Per Order)",  "FLORATA Ponytail Buns Wrap Bun Chignon Hair Extensions Wavy Curly Wedding Donut Hair Extensions Hairpiece Wig",  "Kenra Platinum Hot Spray #20 8oz, PACK OF 8", "GBS Red and Black Shampoo Scalp Massage Brushes Plus 1 Soft Pocket Brush Made In USA 3 Pack Promotes Healthy Hair Growth Compliments Any Shampoo and Conditioner",  "Clairol Professional Creme Permanent Developer - 20 volume (Size : 2 oz)",  "Garnier Nutrisse Ultra Color Permanent Haircolor R3 Light Intense Auburn 1.0 ea(pack of 12)",  "Kemps Swiss Style Chocolate Low Fat Milk, 1 gal", "Aussie Kids 3n1 Shampoo, Conditioner, & Bodywash with Pump Coral Reef Cupcake 29.2 oz.(pack of 4)",  "Dequmana Gordal Olives, 12 Oz", "Duncan Hines Caramel Creamy Home-Style Frosting 16 Oz Canister",  "Goya Goya Mole, 9 oz", "Fruit Roll-Ups Fruit Flavored Snacks Variety Pack (Pack of 16)",  "Wild Huckleberry Mountain Huckleberry Barbecue Sauce", "La Flor Spicy Hot Seasoned Salt, 13 oz",  "Clairol Nice n Easy Hair Color #79 Dark Brown, UK Loving Care (Pack of 3) + Beyond BodiHeat Patch, 1 Ct",  "White Vinegar Liquid ''1 gallon, 4 Count, Liquid''", "Metallic Gold Dried Canella Berries - 6 oz Bunch",  "La Flor Adobo All-Purpose Seasoning, 13 oz", "Marlos Bakeshop Marlos Bakeshop Biscotti, 1.25 oz",  "Sam's Choice Frozen Burrito Bowl, Fajita Vegetable, 12.5 oz",  "Conchita guava marmalade 14.1 oz Pack of 3", "HC Industries Kids Organics Kids Organics Shampoo, 12 oz",  "6 Pack - Head & Shoulders Full & Thick 2-in-1 Anti-Dandruff Shampoo + Conditioner 32.1 oz",  "Ice Breakers, Wintergreen Mints Tin, 1.5 Oz (Pack of 8)", "Mason Pearson - Boar Bristle & Nylon - Medium Junior Military Nylon & Bristle Hair Brush (Dark Ruby) -1pc",  "Dove Nutritive Solutions Revival Cleansing Shampoo, 20.4 oz",  "Boston's Best 12 Ct Jamaican Me Crazy", "Ultimate Baker Edible Glitter Mix It Up (1x3oz)",  "Nori Maki Arare Rice Crackers with Seaweed 5 oz per Pack (1 Pack)",  "H&S 2in1 MENS REFRESH POO 13.5oz-Pack of 5", "Keebler Club Mini Crackers, Multi-Grain, 11 Ounce (Pack of 20)",  "Briess Sparkling Amber Liquid Malt Extract (30 Pound Pail)"), 
c1 = c("Grocery", "Grocery", "Grocery", "Grocery", "Personal Care",
"Beauty", "Grocery", "Grocery", "Grocery", "Grocery", "Grocery",
"Grocery", "Grocery", "Grocery", "Beauty", "Beauty", "Grocery",
"Grocery", "Grocery", "Beauty", "Beauty", "Beauty", "Beauty",
"Beauty", "Grocery", "Beauty", "Grocery", "Grocery", "Grocery",
"Grocery", "Grocery", "Grocery", "Beauty", "Grocery", "Grocery",
"Grocery", "Grocery", "Grocery", "Grocery", "Personal Care",
"Beauty", "Grocery", "Beauty", "Beauty", "Grocery", "Grocery",
"Grocery", "Beauty", "Grocery", "Grocery")), row.names = c(16523L, 111871L, 28667L, 32067L, 8269L, 11076L, 50328L, 47200L, 99415L, 100031L, 39011L, 104854L, 29516L, 104643L, 3486L, 9689L, 52157L, 28995L, 47000L, 10895L, 3035L, 4992L, 3589L, 4276L, 32212L, 6055L, 22991L, 110279L, 27436L, 52282L, 14879L, 25710L, 6989L, 30133L, 51068L, 25490L, 45685L, 99073L, 18547L, 4991L, 5792L, 36241L, 10237L, 1430L, 40383L, 112458L, 46261L, 5875L, 46597L, 108099L ), class = "data.frame")

dput(sample_n(subset(df2,select=c(product_title,c1)),50))
structure(list(product_title = c("Drive Medical Heavy Duty Bariatric Plastic Seat Transfer Bench",
"Always Pure & Clean Ultra Thin Feminine Pads With Wings, Super Long",
"Patriot Candles Jar Candle Apple Clove Red", "Nature's Bounty Cardio-Health Probiotic Capsules",
"Finest Nutrition Biotin Plus Keratin", "Dr. Scholl's DuraGel Corn Remover",
"Humm Coconut Lime Kombucha 14 oz", "OneTouch Ultra Blue Test Strips",
"Kellogg's Rice Krispies Treats Bars M&M's", "Westbrae Natural Organic Chili Beans",
"Neutrogena Rapid Clear Acne Eliminating Spot Treatment Gel - 0.5 fl oz",
"Harris Bed Bug Killer", "Quart Storage Bags - 80ct - Up&Up cent (Compare to Ziploc Storage Bags)",
"Care Free Curl Gold Instant Curl Activator", "Purple Dessert Plate",
"Wexford Big Bubble Plastic Mailer 2", "L'Oreal Paris Advanced Haircare Total Repair Extreme Emergency Recovery Mask",
"Soap & Glory Spectaculips Matteallic Lip Cream Bronze Girl,Bronze Girl",
"No7 Instant Results Purifying Heating Mask - 2.5oz", "NuMe Classic Curling Wand",
"Revlon ColorSilk ColorStay Nourishing Conditioner Glowing Blonde",
"Weiman Lemon Oil Furniture Polish Lemon", "Dunkin' Donuts Ground Coffee Hazelnut",
"CocoaVia Cocoa Extract 375mg, Capsules", "Triple Paste AF Antifungal Ointment",
"Welch's Halloween Fruit Snacks 0.5oz 28 ct", "Studio 35 Purifying Natural Facial Wipes",
"Magnum Double Raspberry Mini Ice Cream Bars - 3ct", "CHI Twisted Fabric Finishing Paste",
"Creme Of Nature Argan Oil Intensive Conditioning Hair Treatment",
"Exergen Temporal Artery Thermometer", "Tolerex Formulated Liquid Diet Elemental Powder 6 Pack Unflavored",
"Gerber Nature Select 2nd Foods Nutritious Dinner Baby Food Chicken Noodle",
"Abreva Cold Sore Cream", "Super Macho Vitality and Stamina Dietary Supplement Softgel",
"M&M's Peanut Chocolates Halloween Ghoul's Mix - 3.27oz", "TruMoo protein milk cookies n' cream - 14 fl oz",
"DISNEY 25 Inch Plush Toy Assorted", "Beauty Infusion HYDRATING Manuka Honey & Collagen Sheet Mask",
"Edge Shave Gel, Twin Pack Sensitive Skin", "Haribo Sour Gold Bears Resealable Stand Up Pouch Pineapple",
"Jarrow Formulas Extra Virgin Coconut Oil, 1000mg, Softgels",
"Bliss Pore Patrol Oil-Free Hydrator with Willow Bark - 1.7oz",
"Airheads Candy Bites Watermelon", "Thrive Market Organic Sprouted Quinoa",
"Garnier Fructis Curl Stretch Loosening Pudding", "Systane Nighttime Lubricant Eye Ointment",
"SOHO Resort Organizer", "Enfamil Enfacare Lipil Infant Formula Powder",
"Fancy Feast Flaked Gourmet Cat Food Tuna"), c1 = c("Home Health Care Solutions",
"Personal Care", "Household Essentials", "Vitamin & Supplements",
"Vitamin & Supplements", "Personal Care", "Grocery", "Home Health Care Solutions",
"Grocery", "Grocery", "Beauty", "Household Essentials", "Household Essentials",
"Beauty", "Household Essentials", "Household Essentials", "Beauty",
"Beauty", "Beauty", "Beauty", "Beauty", "Household Essentials",
"Grocery", "Vitamin & Supplements", "Personal Care", "Grocery",
"Beauty", "Grocery", "Beauty", "Personal Care", "Personal Care",
"Home Health Care Solutions", "Grocery", "Personal Care", "Vitamin & Supplements",
"Grocery", "Grocery", "Baby, Kids & Toys", "Beauty", "Personal Care",
"Grocery", "Vitamin & Supplements", "Beauty", "Grocery", "Grocery",
"Beauty", "Personal Care", "Beauty", "Grocery", "Household Essentials"
)), row.names = c(39590L, 6987L, 13810L, 19403L, 26966L, 446L,
41599L, 28238L, 7622L, 19653L, 16458L, 18164L, 738L, 19819L,
43731L, 13310L, 17113L, 29729L, 29725L, 38903L, 25464L, 10048L,
42932L, 41179L, 37568L, 5830L, 14276L, 20526L, 31614L, 20119L,
40084L, 25978L, 1573L, 25121L, 3660L, 8850L, 10201L, 43313L,
17973L, 40423L, 10299L, 37320L, 32177L, 18491L, 32860L, 30439L,
24518L, 21579L, 24597L, 14687L), class = "data.frame")






r






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 26 '18 at 21:01







roody

















asked Nov 26 '18 at 19:22









roodyroody

1,08342345




1,08342345








  • 1





    You might try looking at the fuzzyjoin package and trying to join on some measure of string distance. I haven't used it with data quite that large, though.

    – joran
    Nov 26 '18 at 19:33













  • @CPak just spotted an error in how a column in df1 was labeled. Fixed now - let me know if you still have issues?

    – roody
    Nov 26 '18 at 21:02













  • Still getting an error - Error: unexpected symbol in "dput(sample_n(subset(df1,select=c(product_title,c1)),50)) structure - df2 seems ok

    – CPak
    Nov 26 '18 at 21:08
















  • 1





    You might try looking at the fuzzyjoin package and trying to join on some measure of string distance. I haven't used it with data quite that large, though.

    – joran
    Nov 26 '18 at 19:33













  • @CPak just spotted an error in how a column in df1 was labeled. Fixed now - let me know if you still have issues?

    – roody
    Nov 26 '18 at 21:02













  • Still getting an error - Error: unexpected symbol in "dput(sample_n(subset(df1,select=c(product_title,c1)),50)) structure - df2 seems ok

    – CPak
    Nov 26 '18 at 21:08










1




1





You might try looking at the fuzzyjoin package and trying to join on some measure of string distance. I haven't used it with data quite that large, though.

– joran
Nov 26 '18 at 19:33







You might try looking at the fuzzyjoin package and trying to join on some measure of string distance. I haven't used it with data quite that large, though.

– joran
Nov 26 '18 at 19:33















@CPak just spotted an error in how a column in df1 was labeled. Fixed now - let me know if you still have issues?

– roody
Nov 26 '18 at 21:02







@CPak just spotted an error in how a column in df1 was labeled. Fixed now - let me know if you still have issues?

– roody
Nov 26 '18 at 21:02















Still getting an error - Error: unexpected symbol in "dput(sample_n(subset(df1,select=c(product_title,c1)),50)) structure - df2 seems ok

– CPak
Nov 26 '18 at 21:08







Still getting an error - Error: unexpected symbol in "dput(sample_n(subset(df1,select=c(product_title,c1)),50)) structure - df2 seems ok

– CPak
Nov 26 '18 at 21:08














2 Answers
2






active

oldest

votes


















1














Your idea is good. One realization of it then would be



df2$upc <- NA
for(ctg in unique(df2$c1)) {
d <- stringdistmatrix(df1[df1$c1 == ctg, "product_title"], df2[df2$c1 == ctg, "product_title"], method = "jw")
fuzz <- apply(d, 2, min)
passThr <- fuzz < 0.1
df2$fuzz[df2$c1 == ctg] <- fuzz
df2$upc[df2$c1 == ctg][passThr] <- df1[df1$c1 == ctg, "upc"][apply(d, 2, which.min)][passThr]
}


So, for each line in df2, it gets assigned a upc value from df1 whose product.title_r has the smallest distance from the corresponding product_title from df2. How well this works will depend on the number of categories, length(unique(df2$c1)). The more of them, the faster the loop.






share|improve this answer


























  • This is a really elegant approach, and will work well with a multi-level taxonomy like I have! Could you incorporate a "max.dist" filter into your example? I would like to determine some threshold (as in my example, dist<.10) to ensure the matches are sufficiently high quality.

    – roody
    Nov 26 '18 at 23:26











  • @roody, you mean to do the assignment when the distance is below 0.1 and no assignment otherwise?

    – Julius Vainora
    Nov 26 '18 at 23:30











  • Yes - exactly. in my code it is r<-subset(r,dist<.10) after calculating string.dist

    – roody
    Nov 27 '18 at 16:19











  • @roody, see the update.

    – Julius Vainora
    Nov 27 '18 at 16:46











  • One very dense follow-up that I just can't figure out. How would I create a new variable with the value of passThr? The reason why I ask is because my dataset seems to be including only perfect matches, so I want to analyze if I've done something wrong.

    – roody
    Nov 27 '18 at 18:33





















0














Consider extending your expand.grid approach and build a list of data frames of nested merged elements. Then row bind all at once outside of loop.



# Variable with categories to subset
categories <- c("Beauty", "Personal Care", "Grocery", "Household Essentials")

df_list <- vector("list", length = length(categories))

for (i in seq_along(categories)) {
df1_sub <- subset(wmt, category == categories[i])
df2_sub <- subset(m, category == categories[i])
df1_titles <- unique(df1_sub$product_title)
df2_titles <- unique(df2_sub$product_title)

### HOW DO I CREATE A LIST/GRID DYNAMICALLY?
r <- expand.grid(df1_titles=df1_titles, df2_titles=df2_titles, stringsAsFactors=FALSE)

r$dist <- stringdist(r$df1_titles, r$df2_titles, method="jw")
r <- r[order(r$dist),]
r <- r[!duplicated(r$df1_titles),]
r <- subset(r, dist<.10)

# ASSIGN NESTED MERGE
df_list[i] = merge(merge(r, df1, by.x="df1_title", by.y="product_title"),
df2, by.x="df2_title", by.y="product_title")
}

# ROW BIND ALL DF ELEMENTS
final_df <- do.call(rbind, df_list)





share|improve this answer
























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53487716%2fsubset-datasets-by-variable-before-using-expand-grid-to-calculate-distance-matri%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Your idea is good. One realization of it then would be



    df2$upc <- NA
    for(ctg in unique(df2$c1)) {
    d <- stringdistmatrix(df1[df1$c1 == ctg, "product_title"], df2[df2$c1 == ctg, "product_title"], method = "jw")
    fuzz <- apply(d, 2, min)
    passThr <- fuzz < 0.1
    df2$fuzz[df2$c1 == ctg] <- fuzz
    df2$upc[df2$c1 == ctg][passThr] <- df1[df1$c1 == ctg, "upc"][apply(d, 2, which.min)][passThr]
    }


    So, for each line in df2, it gets assigned a upc value from df1 whose product.title_r has the smallest distance from the corresponding product_title from df2. How well this works will depend on the number of categories, length(unique(df2$c1)). The more of them, the faster the loop.






    share|improve this answer


























    • This is a really elegant approach, and will work well with a multi-level taxonomy like I have! Could you incorporate a "max.dist" filter into your example? I would like to determine some threshold (as in my example, dist<.10) to ensure the matches are sufficiently high quality.

      – roody
      Nov 26 '18 at 23:26











    • @roody, you mean to do the assignment when the distance is below 0.1 and no assignment otherwise?

      – Julius Vainora
      Nov 26 '18 at 23:30











    • Yes - exactly. in my code it is r<-subset(r,dist<.10) after calculating string.dist

      – roody
      Nov 27 '18 at 16:19











    • @roody, see the update.

      – Julius Vainora
      Nov 27 '18 at 16:46











    • One very dense follow-up that I just can't figure out. How would I create a new variable with the value of passThr? The reason why I ask is because my dataset seems to be including only perfect matches, so I want to analyze if I've done something wrong.

      – roody
      Nov 27 '18 at 18:33


















    1














    Your idea is good. One realization of it then would be



    df2$upc <- NA
    for(ctg in unique(df2$c1)) {
    d <- stringdistmatrix(df1[df1$c1 == ctg, "product_title"], df2[df2$c1 == ctg, "product_title"], method = "jw")
    fuzz <- apply(d, 2, min)
    passThr <- fuzz < 0.1
    df2$fuzz[df2$c1 == ctg] <- fuzz
    df2$upc[df2$c1 == ctg][passThr] <- df1[df1$c1 == ctg, "upc"][apply(d, 2, which.min)][passThr]
    }


    So, for each line in df2, it gets assigned a upc value from df1 whose product.title_r has the smallest distance from the corresponding product_title from df2. How well this works will depend on the number of categories, length(unique(df2$c1)). The more of them, the faster the loop.






    share|improve this answer


























    • This is a really elegant approach, and will work well with a multi-level taxonomy like I have! Could you incorporate a "max.dist" filter into your example? I would like to determine some threshold (as in my example, dist<.10) to ensure the matches are sufficiently high quality.

      – roody
      Nov 26 '18 at 23:26











    • @roody, you mean to do the assignment when the distance is below 0.1 and no assignment otherwise?

      – Julius Vainora
      Nov 26 '18 at 23:30











    • Yes - exactly. in my code it is r<-subset(r,dist<.10) after calculating string.dist

      – roody
      Nov 27 '18 at 16:19











    • @roody, see the update.

      – Julius Vainora
      Nov 27 '18 at 16:46











    • One very dense follow-up that I just can't figure out. How would I create a new variable with the value of passThr? The reason why I ask is because my dataset seems to be including only perfect matches, so I want to analyze if I've done something wrong.

      – roody
      Nov 27 '18 at 18:33
















    1












    1








    1







    Your idea is good. One realization of it then would be



    df2$upc <- NA
    for(ctg in unique(df2$c1)) {
    d <- stringdistmatrix(df1[df1$c1 == ctg, "product_title"], df2[df2$c1 == ctg, "product_title"], method = "jw")
    fuzz <- apply(d, 2, min)
    passThr <- fuzz < 0.1
    df2$fuzz[df2$c1 == ctg] <- fuzz
    df2$upc[df2$c1 == ctg][passThr] <- df1[df1$c1 == ctg, "upc"][apply(d, 2, which.min)][passThr]
    }


    So, for each line in df2, it gets assigned a upc value from df1 whose product.title_r has the smallest distance from the corresponding product_title from df2. How well this works will depend on the number of categories, length(unique(df2$c1)). The more of them, the faster the loop.






    share|improve this answer















    Your idea is good. One realization of it then would be



    df2$upc <- NA
    for(ctg in unique(df2$c1)) {
    d <- stringdistmatrix(df1[df1$c1 == ctg, "product_title"], df2[df2$c1 == ctg, "product_title"], method = "jw")
    fuzz <- apply(d, 2, min)
    passThr <- fuzz < 0.1
    df2$fuzz[df2$c1 == ctg] <- fuzz
    df2$upc[df2$c1 == ctg][passThr] <- df1[df1$c1 == ctg, "upc"][apply(d, 2, which.min)][passThr]
    }


    So, for each line in df2, it gets assigned a upc value from df1 whose product.title_r has the smallest distance from the corresponding product_title from df2. How well this works will depend on the number of categories, length(unique(df2$c1)). The more of them, the faster the loop.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 27 '18 at 19:26

























    answered Nov 26 '18 at 20:37









    Julius VainoraJulius Vainora

    38.7k76886




    38.7k76886













    • This is a really elegant approach, and will work well with a multi-level taxonomy like I have! Could you incorporate a "max.dist" filter into your example? I would like to determine some threshold (as in my example, dist<.10) to ensure the matches are sufficiently high quality.

      – roody
      Nov 26 '18 at 23:26











    • @roody, you mean to do the assignment when the distance is below 0.1 and no assignment otherwise?

      – Julius Vainora
      Nov 26 '18 at 23:30











    • Yes - exactly. in my code it is r<-subset(r,dist<.10) after calculating string.dist

      – roody
      Nov 27 '18 at 16:19











    • @roody, see the update.

      – Julius Vainora
      Nov 27 '18 at 16:46











    • One very dense follow-up that I just can't figure out. How would I create a new variable with the value of passThr? The reason why I ask is because my dataset seems to be including only perfect matches, so I want to analyze if I've done something wrong.

      – roody
      Nov 27 '18 at 18:33





















    • This is a really elegant approach, and will work well with a multi-level taxonomy like I have! Could you incorporate a "max.dist" filter into your example? I would like to determine some threshold (as in my example, dist<.10) to ensure the matches are sufficiently high quality.

      – roody
      Nov 26 '18 at 23:26











    • @roody, you mean to do the assignment when the distance is below 0.1 and no assignment otherwise?

      – Julius Vainora
      Nov 26 '18 at 23:30











    • Yes - exactly. in my code it is r<-subset(r,dist<.10) after calculating string.dist

      – roody
      Nov 27 '18 at 16:19











    • @roody, see the update.

      – Julius Vainora
      Nov 27 '18 at 16:46











    • One very dense follow-up that I just can't figure out. How would I create a new variable with the value of passThr? The reason why I ask is because my dataset seems to be including only perfect matches, so I want to analyze if I've done something wrong.

      – roody
      Nov 27 '18 at 18:33



















    This is a really elegant approach, and will work well with a multi-level taxonomy like I have! Could you incorporate a "max.dist" filter into your example? I would like to determine some threshold (as in my example, dist<.10) to ensure the matches are sufficiently high quality.

    – roody
    Nov 26 '18 at 23:26





    This is a really elegant approach, and will work well with a multi-level taxonomy like I have! Could you incorporate a "max.dist" filter into your example? I would like to determine some threshold (as in my example, dist<.10) to ensure the matches are sufficiently high quality.

    – roody
    Nov 26 '18 at 23:26













    @roody, you mean to do the assignment when the distance is below 0.1 and no assignment otherwise?

    – Julius Vainora
    Nov 26 '18 at 23:30





    @roody, you mean to do the assignment when the distance is below 0.1 and no assignment otherwise?

    – Julius Vainora
    Nov 26 '18 at 23:30













    Yes - exactly. in my code it is r<-subset(r,dist<.10) after calculating string.dist

    – roody
    Nov 27 '18 at 16:19





    Yes - exactly. in my code it is r<-subset(r,dist<.10) after calculating string.dist

    – roody
    Nov 27 '18 at 16:19













    @roody, see the update.

    – Julius Vainora
    Nov 27 '18 at 16:46





    @roody, see the update.

    – Julius Vainora
    Nov 27 '18 at 16:46













    One very dense follow-up that I just can't figure out. How would I create a new variable with the value of passThr? The reason why I ask is because my dataset seems to be including only perfect matches, so I want to analyze if I've done something wrong.

    – roody
    Nov 27 '18 at 18:33







    One very dense follow-up that I just can't figure out. How would I create a new variable with the value of passThr? The reason why I ask is because my dataset seems to be including only perfect matches, so I want to analyze if I've done something wrong.

    – roody
    Nov 27 '18 at 18:33















    0














    Consider extending your expand.grid approach and build a list of data frames of nested merged elements. Then row bind all at once outside of loop.



    # Variable with categories to subset
    categories <- c("Beauty", "Personal Care", "Grocery", "Household Essentials")

    df_list <- vector("list", length = length(categories))

    for (i in seq_along(categories)) {
    df1_sub <- subset(wmt, category == categories[i])
    df2_sub <- subset(m, category == categories[i])
    df1_titles <- unique(df1_sub$product_title)
    df2_titles <- unique(df2_sub$product_title)

    ### HOW DO I CREATE A LIST/GRID DYNAMICALLY?
    r <- expand.grid(df1_titles=df1_titles, df2_titles=df2_titles, stringsAsFactors=FALSE)

    r$dist <- stringdist(r$df1_titles, r$df2_titles, method="jw")
    r <- r[order(r$dist),]
    r <- r[!duplicated(r$df1_titles),]
    r <- subset(r, dist<.10)

    # ASSIGN NESTED MERGE
    df_list[i] = merge(merge(r, df1, by.x="df1_title", by.y="product_title"),
    df2, by.x="df2_title", by.y="product_title")
    }

    # ROW BIND ALL DF ELEMENTS
    final_df <- do.call(rbind, df_list)





    share|improve this answer




























      0














      Consider extending your expand.grid approach and build a list of data frames of nested merged elements. Then row bind all at once outside of loop.



      # Variable with categories to subset
      categories <- c("Beauty", "Personal Care", "Grocery", "Household Essentials")

      df_list <- vector("list", length = length(categories))

      for (i in seq_along(categories)) {
      df1_sub <- subset(wmt, category == categories[i])
      df2_sub <- subset(m, category == categories[i])
      df1_titles <- unique(df1_sub$product_title)
      df2_titles <- unique(df2_sub$product_title)

      ### HOW DO I CREATE A LIST/GRID DYNAMICALLY?
      r <- expand.grid(df1_titles=df1_titles, df2_titles=df2_titles, stringsAsFactors=FALSE)

      r$dist <- stringdist(r$df1_titles, r$df2_titles, method="jw")
      r <- r[order(r$dist),]
      r <- r[!duplicated(r$df1_titles),]
      r <- subset(r, dist<.10)

      # ASSIGN NESTED MERGE
      df_list[i] = merge(merge(r, df1, by.x="df1_title", by.y="product_title"),
      df2, by.x="df2_title", by.y="product_title")
      }

      # ROW BIND ALL DF ELEMENTS
      final_df <- do.call(rbind, df_list)





      share|improve this answer


























        0












        0








        0







        Consider extending your expand.grid approach and build a list of data frames of nested merged elements. Then row bind all at once outside of loop.



        # Variable with categories to subset
        categories <- c("Beauty", "Personal Care", "Grocery", "Household Essentials")

        df_list <- vector("list", length = length(categories))

        for (i in seq_along(categories)) {
        df1_sub <- subset(wmt, category == categories[i])
        df2_sub <- subset(m, category == categories[i])
        df1_titles <- unique(df1_sub$product_title)
        df2_titles <- unique(df2_sub$product_title)

        ### HOW DO I CREATE A LIST/GRID DYNAMICALLY?
        r <- expand.grid(df1_titles=df1_titles, df2_titles=df2_titles, stringsAsFactors=FALSE)

        r$dist <- stringdist(r$df1_titles, r$df2_titles, method="jw")
        r <- r[order(r$dist),]
        r <- r[!duplicated(r$df1_titles),]
        r <- subset(r, dist<.10)

        # ASSIGN NESTED MERGE
        df_list[i] = merge(merge(r, df1, by.x="df1_title", by.y="product_title"),
        df2, by.x="df2_title", by.y="product_title")
        }

        # ROW BIND ALL DF ELEMENTS
        final_df <- do.call(rbind, df_list)





        share|improve this answer













        Consider extending your expand.grid approach and build a list of data frames of nested merged elements. Then row bind all at once outside of loop.



        # Variable with categories to subset
        categories <- c("Beauty", "Personal Care", "Grocery", "Household Essentials")

        df_list <- vector("list", length = length(categories))

        for (i in seq_along(categories)) {
        df1_sub <- subset(wmt, category == categories[i])
        df2_sub <- subset(m, category == categories[i])
        df1_titles <- unique(df1_sub$product_title)
        df2_titles <- unique(df2_sub$product_title)

        ### HOW DO I CREATE A LIST/GRID DYNAMICALLY?
        r <- expand.grid(df1_titles=df1_titles, df2_titles=df2_titles, stringsAsFactors=FALSE)

        r$dist <- stringdist(r$df1_titles, r$df2_titles, method="jw")
        r <- r[order(r$dist),]
        r <- r[!duplicated(r$df1_titles),]
        r <- subset(r, dist<.10)

        # ASSIGN NESTED MERGE
        df_list[i] = merge(merge(r, df1, by.x="df1_title", by.y="product_title"),
        df2, by.x="df2_title", by.y="product_title")
        }

        # ROW BIND ALL DF ELEMENTS
        final_df <- do.call(rbind, df_list)






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 26 '18 at 21:54









        ParfaitParfait

        54k104872




        54k104872






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53487716%2fsubset-datasets-by-variable-before-using-expand-grid-to-calculate-distance-matri%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Wiesbaden

            Marschland

            Dieringhausen