About 52 results
Open links in new tab
  1. scala - What is RDD in spark - Stack Overflow

    Dec 23, 2015 · An RDD is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. An RDD could come from any datasource, e.g. text files, a …

  2. Difference between DataFrame, Dataset, and RDD in Spark

    I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]) in Apache Spark? Can you convert one to the other?

  3. scala - How to print the contents of RDD? - Stack Overflow

    But I think I know where this confusion comes from: the original question asked how to print an RDD to the Spark console (= shell) so I assumed he would run a local job, in which case foreach works fine.

  4. python - Pyspark JSON object or file to RDD - Stack Overflow

    I am trying to create an RDD which I then hope to perform operation such as map and flatmap. I was advised to get the json in a jsonlines format but despite using pip to install jsonlines, I am unable to …

  5. How do I split an RDD into two or more RDDs? - Stack Overflow

    Oct 6, 2015 · I'm looking for a way to split an RDD into two or more RDDs. The closest I've seen is Scala Spark: Split collection into several RDD? which is still a single RDD. If you're familiar with SAS, some...

  6. What is the difference between spark checkpoint and persist to a disk

    Feb 1, 2016 · RDD checkpointing is a different concept than a chekpointing in Spark Streaming. The former one is designed to address lineage issue, the latter one is all about streaming reliability and …

  7. hadoop - What is Lineage In Spark? - Stack Overflow

    Aug 18, 2017 · In Spark, Lineage Graph is a dependencies graph in between existing RDD and new RDD. It means that all the dependencies between the RDD will be recorded in a graph, rather than …

  8. python - Splitting an Pyspark RDD into Different columns and convert …

    Splitting an Pyspark RDD into Different columns and convert to Dataframe Asked 7 years, 9 months ago Modified 7 years, 9 months ago Viewed 10k times

  9. How do I iterate RDD's in apache spark (scala) - Stack Overflow

    Sep 18, 2014 · I use the following command to fill an RDD with a bunch of arrays containing 2 strings ["filename", "content"]. Now I want to iterate over every of those occurrences to do something with …

  10. scala - DataFrame.count () == 0 Vs DataFrame.rdd.isEmpty (): please ...

    Jun 21, 2023 · DataFrame.count() requires materializing the query which is costly. Is there a non-negligible cost [of materialization] to DataFrame.rdd and how does that compare to the former? Is the …