Visit our website
[ Ссылка ]
The MapReduce paper was published by Google in 2004. MapReduce is an algorithm that describes how to do large-scale data processing on large clusters of commodity hardware.
The MapReduce paper marked the beginning of the “big data” movement. The Hadoop project is an open source implementation of the MapReduce paper. Doug Cutting and Mike Cafarella wrote software that allowed anybody to use MapReduce, as long as they had significant server operations knowledge and a rack of commodity servers.
Hadoop got deployed first at companies with the internal engineering teams that could recognize its importance and implement it–companies like Yahoo and Microsoft. The word quickly spread about the leverage Hadoop could provide.
Around this time, every large company was waking up to the fact that they had tons of data and didn’t know how to take advantage of it. Billion dollar corporations in areas like banking, insurance, manufacturing, and agriculture all wanted to take advantage of this amazing new way of looking at their data. But these companies did not have the engineering expertise to deploy Hadoop clusters.
Three big companies were formed to help bring Hadoop to large enterprises: Cloudera, Hortonworks, and MapR. Each of these companies worked with hundreds of large enterprise clients to build out their Hadoop clusters and help them access their data. Tomer Shiran spent five years at MapR, seeing the data problems of these large enterprises and observing how much value could be created by solving these data problems.
In 2015, eleven years had passed since MapReduce was first published, and companies were still having data problems. Tomer started working on Dremio, a company that was in stealth for another two years. I interviewed Tomer two years ago, when he still could not say much about what Dremio was doing. We talked about Apache Drill, an open-source project related to what Dremio eventually built.
Earlier this year, two of Tomer’s colleagues Jacques Nadeau and Julien Le Dem came on to discuss columnar data storage and interoperability. What I took away from that conversation was that today, data within an average enterprise is accessible, but the different formats are a problem. Some data is in MySQL, some is in Amazon S3, some is in ElasticSearch, some is on HDFS stored in Parquet files. Different teams will set up different BI tools and charts that read from a specific silo of data.
At the lowest level, the different data formats are incompatible–you have to transform MySQL data in order to merge it with S3 data. On top of that, engineers doing data science work are using Spark, Pandas, and other tools that pull lots of data into memory–if the in-memory formats are not compatible, the data teams can’t get the most out of their work. On top of THAT, at the highest level, data analysts are working with different data analysis tools, so there is even more siloing.
Now I understand why Dremio took two years to bring to market.
They are trying to solve data interoperability by making it easy to transform data sets between different formats. They are trying to solve data access speed by creating a sophisticated caching system. And they are trying to improve the effectiveness of the data analysts by providing the right abstractions for someone who is not a software engineer to study the different data sets across an organization.
Dremio is an exciting project because it is rare to see a pure software company put so many years into up-front stealth product development. After talking to Tomer in this conversation, I’m looking forward to seeing Dremio come to market. It was fascinating to hear him talk about how data engineering has evolved to today.
Some of the best episodes of Software Engineering Daily cover the history of data engineering, including an interview with Mike Cafarella, the co-founder of Hadoop, and another episode called “The History of Hadoop” in which we explore
Ещё видео!