Add column while maintaining correlation of the existing columns in Apache Spark Scala












0














I have a dataframe with columns review and rating in Spark Scala



val stopWordsList = scala.io.Source.fromFile("stopWords").getLines.toList
val downSampleReviewsDF = sqlContext.sql("SELECT review, rating FROM ds");


I have written a function which will remove stopWords from a given review (String)



def cleanTextFunc(text: String, removeList: List[String]): String = removeList.fold(text) {
case (text, termToRemove) => text.replaceAll("\b" + text + "\b" , "").replaceAll("""[p{Punct}&&[^.]]""", "").replaceAll(" +", " ")


}



How do I add another column "new_review" along with review and rating. The new_review should use cleanTextFunc() to get cleaned data for every row. cleanTextFunc takes two input arguments 1. text to clean 2. List of stop words to be removed from the text



Output should have Text | Rating | New_Text










share|improve this question
























  • Why not use the StopWordsRemover?
    – Shaido
    Nov 21 at 1:47
















0














I have a dataframe with columns review and rating in Spark Scala



val stopWordsList = scala.io.Source.fromFile("stopWords").getLines.toList
val downSampleReviewsDF = sqlContext.sql("SELECT review, rating FROM ds");


I have written a function which will remove stopWords from a given review (String)



def cleanTextFunc(text: String, removeList: List[String]): String = removeList.fold(text) {
case (text, termToRemove) => text.replaceAll("\b" + text + "\b" , "").replaceAll("""[p{Punct}&&[^.]]""", "").replaceAll(" +", " ")


}



How do I add another column "new_review" along with review and rating. The new_review should use cleanTextFunc() to get cleaned data for every row. cleanTextFunc takes two input arguments 1. text to clean 2. List of stop words to be removed from the text



Output should have Text | Rating | New_Text










share|improve this question
























  • Why not use the StopWordsRemover?
    – Shaido
    Nov 21 at 1:47














0












0








0







I have a dataframe with columns review and rating in Spark Scala



val stopWordsList = scala.io.Source.fromFile("stopWords").getLines.toList
val downSampleReviewsDF = sqlContext.sql("SELECT review, rating FROM ds");


I have written a function which will remove stopWords from a given review (String)



def cleanTextFunc(text: String, removeList: List[String]): String = removeList.fold(text) {
case (text, termToRemove) => text.replaceAll("\b" + text + "\b" , "").replaceAll("""[p{Punct}&&[^.]]""", "").replaceAll(" +", " ")


}



How do I add another column "new_review" along with review and rating. The new_review should use cleanTextFunc() to get cleaned data for every row. cleanTextFunc takes two input arguments 1. text to clean 2. List of stop words to be removed from the text



Output should have Text | Rating | New_Text










share|improve this question















I have a dataframe with columns review and rating in Spark Scala



val stopWordsList = scala.io.Source.fromFile("stopWords").getLines.toList
val downSampleReviewsDF = sqlContext.sql("SELECT review, rating FROM ds");


I have written a function which will remove stopWords from a given review (String)



def cleanTextFunc(text: String, removeList: List[String]): String = removeList.fold(text) {
case (text, termToRemove) => text.replaceAll("\b" + text + "\b" , "").replaceAll("""[p{Punct}&&[^.]]""", "").replaceAll(" +", " ")


}



How do I add another column "new_review" along with review and rating. The new_review should use cleanTextFunc() to get cleaned data for every row. cleanTextFunc takes two input arguments 1. text to clean 2. List of stop words to be removed from the text



Output should have Text | Rating | New_Text







scala apache-spark apache-spark-sql apache-spark-2.0






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 21 at 1:47

























asked Nov 21 at 1:44









Nick

96110




96110












  • Why not use the StopWordsRemover?
    – Shaido
    Nov 21 at 1:47


















  • Why not use the StopWordsRemover?
    – Shaido
    Nov 21 at 1:47
















Why not use the StopWordsRemover?
– Shaido
Nov 21 at 1:47




Why not use the StopWordsRemover?
– Shaido
Nov 21 at 1:47












1 Answer
1






active

oldest

votes


















0














Just a few more lines



// Curried method to create UDF from removeList
def getStopWordRemoverUdf(removeList: List[String]): UserDefinedFunction = {
udf { (text: String) =>
cleanTextFunc(text, removeList)
}
}

// Create UDF by passing your removeList
val stopWordRemoverUdf: UserDefinedFunction = getStopWordRemoverUdf(removeList)

// Use UDF to create new column
val cleanedDownSampleReviewsDf: DataFrame = downSampleReviewsDF.withColumn("new_review", stopWordRemoverUdf(downSampleReviewsDF("review")))




References




  • Passing extra parameters to UDF in Spark






share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53404183%2fadd-column-while-maintaining-correlation-of-the-existing-columns-in-apache-spark%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Just a few more lines



    // Curried method to create UDF from removeList
    def getStopWordRemoverUdf(removeList: List[String]): UserDefinedFunction = {
    udf { (text: String) =>
    cleanTextFunc(text, removeList)
    }
    }

    // Create UDF by passing your removeList
    val stopWordRemoverUdf: UserDefinedFunction = getStopWordRemoverUdf(removeList)

    // Use UDF to create new column
    val cleanedDownSampleReviewsDf: DataFrame = downSampleReviewsDF.withColumn("new_review", stopWordRemoverUdf(downSampleReviewsDF("review")))




    References




    • Passing extra parameters to UDF in Spark






    share|improve this answer


























      0














      Just a few more lines



      // Curried method to create UDF from removeList
      def getStopWordRemoverUdf(removeList: List[String]): UserDefinedFunction = {
      udf { (text: String) =>
      cleanTextFunc(text, removeList)
      }
      }

      // Create UDF by passing your removeList
      val stopWordRemoverUdf: UserDefinedFunction = getStopWordRemoverUdf(removeList)

      // Use UDF to create new column
      val cleanedDownSampleReviewsDf: DataFrame = downSampleReviewsDF.withColumn("new_review", stopWordRemoverUdf(downSampleReviewsDF("review")))




      References




      • Passing extra parameters to UDF in Spark






      share|improve this answer
























        0












        0








        0






        Just a few more lines



        // Curried method to create UDF from removeList
        def getStopWordRemoverUdf(removeList: List[String]): UserDefinedFunction = {
        udf { (text: String) =>
        cleanTextFunc(text, removeList)
        }
        }

        // Create UDF by passing your removeList
        val stopWordRemoverUdf: UserDefinedFunction = getStopWordRemoverUdf(removeList)

        // Use UDF to create new column
        val cleanedDownSampleReviewsDf: DataFrame = downSampleReviewsDF.withColumn("new_review", stopWordRemoverUdf(downSampleReviewsDF("review")))




        References




        • Passing extra parameters to UDF in Spark






        share|improve this answer












        Just a few more lines



        // Curried method to create UDF from removeList
        def getStopWordRemoverUdf(removeList: List[String]): UserDefinedFunction = {
        udf { (text: String) =>
        cleanTextFunc(text, removeList)
        }
        }

        // Create UDF by passing your removeList
        val stopWordRemoverUdf: UserDefinedFunction = getStopWordRemoverUdf(removeList)

        // Use UDF to create new column
        val cleanedDownSampleReviewsDf: DataFrame = downSampleReviewsDF.withColumn("new_review", stopWordRemoverUdf(downSampleReviewsDF("review")))




        References




        • Passing extra parameters to UDF in Spark







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 21 at 2:04









        y2k-shubham

        8441927




        8441927






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53404183%2fadd-column-while-maintaining-correlation-of-the-existing-columns-in-apache-spark%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Wiesbaden

            Marschland

            Dieringhausen