Ready to move beyond Word Count? Watch as John Hogue walks through a practical example of a data pipeline to feed textual data for tagging with PySpark and ML. Learn to leverage great existing Python libraries in Spark such as NLTK and how to use some of Spark’s newer features. A GitHub Repo of source code, training and test sets of data will be provided for attendees to explore and play with.
Talk will cover:
• Reading in data from Hive with SparkSQL
• Distributing non-spark Python libraries with your PySpark Job
• Performing common NLP preprocessing & feature extraction within PySpark Dataframes
• Training & evaluating available MLlib algorithms for text classification
• Saving your results back out to Hive
[ Ссылка ]
Ещё видео!