00:03.97 greybeards And now it is my pleasure to introduce James duangng co-founder of Bitquil Technologies and senior staff developer at bitquil and Dremio and David Lee Apache Arrow Pmc member and software engineer at Voltron Data so why don't you to tell us a little bit about yourselves and what a path g arrow flight is all about. 00:24.26 David Li Thanks Ray. It's great to be here. So as mentioned so I am a Pmc member for apache arrow that means I'm one ofs maintainers I can vote on decisions I'm also software engineer applerondata which also contributes a lot towards the arrow project. So ah before we introduce Arrowflightte I think we should introduce Apache Arrow first so just to be brief Apache Arrow is a in-memory format, a across- languagegu standardized in-memory format for calmer data. It also comes with other things like ipc format for. Ah, serialization and arrow flight which is an rpc framework built on top of the in-memory format and the ipc format so arrowly is an rpc framework. It's specialized for transferring calmer data in arrow format across the network. It's built on top of Grpc and protobuff. Ah it is just a framework. It's inc includive with many of the arrow libraries. But of course you need to take it and do interesting things with it and 1 of things recently is a project called ah flight sql which I think James can introduce. 01:37.74 greybeards So James Why't you tell a little bit about yourself and. 01:37.85 James Duong Thanks David sure so yeah, as as previous mentioned I'm James Deon I'm a senior staff developer at Bitcoin Technologies and Dremio Corporation so a while back at Dremio Corporation we decided to introduce this layer on top of the arrowerflightte project called fly Sql which is a standardized way of accessing sql databases through using the arrowerofflight protocol and server framework so flight sql has ah a single flight client. Ah, single flight sql client that can connect to any flight sql server. 02:20.81 greybeards And so um, maybe just just yeah I'm not a database expert and some of the storage guys are not necessarily but they work with database experts all the time. What's the distinction between row based data and col column columnar. Face data sorry I can I pronounce a thing one of you right? David maybe. 02:43.74 David Li Yeah, so well I guess there? Yeah so databases. There are both row-based databases and columnar databases. It's really just how you shape the data if you look at a table of Data. Do you when you flatten out. Do you flatten out. Along the rows or along the columns and each of those has tradeoffs and advantages arrow focuses on Calmer Data Vi We think that has advantages for data Science For instance, Columnar data compresses better because ah the values within within a column are of the same type. 03:21.74 greybeards Ah, okay. 03:23.12 David Li And you can introduce. Ah, you can use different compression techniques or even if you're just using a general purpose compression technique that'll probably work better if you're doing processing on the data calmer has advantages because again all of the data in a column is adjacent in memory now and you can. You can apply things like s Simd or vectorization to get a speed boost of course row base has its advantages to if you're seeking through ah data row by Row of Calmer is not necessarily going to be a good fit. But for a lot of applications. We think Calmer has advantages. 03:59.25 greybeards And the other thing you mentioned about arrow was that it was an in-memory solution. You want to talk a little bit about that. 04:07.20 David Li Yeah, so so Wes Mckinney one of the co-founders of the project when he when he introduces arrow he kind of likens it to breaking up the layers of a traditional analog database so that you can use all of its components separately. So arrow provides an in-memory format which is basically if you have say a column of integers. How is that supposed to be laid out in memory. But also once you have a column of integers and you have a few columns and you want to write that out to disk or transfer it over to the network. How should you serialize that data. How should you encode that data on the network and actually for arrow. We also try to focus on avoiding a coding and copying as much as possible for efficiency. 04:58.20 greybeards And speed and performance and that sort of stuff. Yeah. 05:02.34 Matthew Leib That's that's interesting. Um, you know we talk a little bit from the storage side about about In-memory advantages. Um, and and particularly lately about the the inherent advantages of expanding that memory. Leveraging Optane For example, are there benefits to arrow by by increasing the capacity of of what's available in memory. 05:31.12 David Li Yeah, so I'm not super familiar with optane specifically but 1 of the advantages 1 of the properties that the arrow in-memory and file formats try to maintain is being able to just memory map of file and start working with the data right away. So you can work with larger inmemory data sets by just memory mapping and arrow file without having to load it all into memory without having to decode it first. 05:58.40 greybeards And so ah, yeah, the the challenges of course is that you know there's only so much memory in the world even with optane maybe I don't know sixty four terabytes per per per server might be a reasonable maximum or something like that. So if your arrow database. Exists and it's I don't know let's say it's it's a couple hundred terabytes you you page that in and out of memory is that how it would work I mean just for an in-memory processing when I mean starting to talk about flight yet which is the other side of this coin. 06:32.94 Matthew Leib Sure. 06:34.80 David Li Right? Yeah, so that point. Um, yeah, you you could start? How do I How do I want to say this so columns. Don't have to be entirely contiguous in a sense you can break up your columns into little chunks called record batches. 06:48.35 greybeards I Got you. 06:54.52 David Li Inside of record batch. Everything's contiguous but at a higher level. You can then stream or iterate through record batches and process those incrementally. So yeah in a sense you're paging data in and out. But this is accounted for in the file format in the In-memory format and so on. 07:10.93 greybeards And when you say a record batch that would be all the columns for let's say the database across you know a thousand records or however, many would fit into ah the record batch buffer or or something like that or would it just be like column 1 down to how much how much of column one actually fits into that record batch. 07:36.15 David Li It's a former. It's a 2 d it's a recangular chunk of data all the columns all with the same number of roads. 07:37.95 greybeards Okay. Okay, so we're just okay so bringing you know hundreds of terabytes of data back and forth into memory and writing it out stuff must be quite io intensive where does arrower flight. You know how does arrow flight and arrow flight sequel. Really improve that that sort of overhead so it used to be. You know you'd write out. Ah you'd write out something from a database. It'd move from a database buffer to a memory cache buffer and memory cache buffer to let's say a nick and a nick out to the the storage which has its own set of buffers. All that stuff right? I mean those are the old days or maybe it's today I don't know so where is air flight bring to the table. 08:28.92 David Li Right? So arrow flight ah basically tries to make all that more convenient and faster. Especially if you're working with arrow data or really only if you're working with arrow data. So if you have arrow data in- memorymory and you want to transfer over the network by using arrow flight. You don't have to go implement all that yourself you get these high-level methods. So like you just say I want to read the next record batch or I want to write the next record batch. An arrow will take your data. It'll punch through the layers of all the networking stuff. It uses Grpc and protobuf. To avoid copying data as much as possible. Ah get that onto ni and get that across the network as fast as possible and then flights sorry and then flight sql is taking those benefits and taking the benefits of arrow and then trying to bring it. 09:14.44 greybeards And and go ahead. 09:24.91 David Li Towards Sql databases. 09:27.70 greybeards So always saw sql databases James I guess sql databases were always row-based and and and you'd you'd sit there and you'd do like an if you know column x is you know Matt Leap social security number then bump is pay raised by 10 % or something like that it was. You know it was a row oriented bring it in do something with it and put it out kind of stuff. So how does where is flight sql how does fly sql work ah with a columnune database. 09:58.10 James Duong Well flight sql provides the protocol for high performance transport by by making the data sent in a a columnar format. Um, traditional traditional Apis like Odbc and jdbc are row orientied when you have a jdbc driver. Ah, when you're accessing data from a select query. You get a result set interface and you know what you do is you check if there's a row using results set done next and then you get values on that single row using say get object or get string on each column. So one a time. Um. 10:30.56 greybeards Right. 10:35.11 James Duong If You're using flight sql now using flight Sql Interface. You could just get um you could get a single record record batch and pull out of a vector representing a column for that batch. And you could go through the stream getting record badges until you've got gotten all the data now if you're working with air Now if your application layer is working with arrowero Data. That's when you really get the benefits out of flight. Um, you're already working with vectors that are that do not have to be deceralized. 11:09.35 greybeards What? What? yeah you mentioned serialization and deserialization before can you explain to me what what that sort of process is or what what that means in this in this sense. 11:20.24 James Duong Yeah, so say you have a jdbc driver. Well jdbc has its own formats for integers strings. Um, and time stamps for example, so. When you build a Jvc driver you have to convert from the databases wire format representations of those to jdbc's format. Potentially the database also needs to convert from its own internal representation to the wire format as well. 11:50.20 greybeards So you've got a transfer from the database to the wire format and from the wire format to Jdbc format. 11:59.21 James Duong Or As if you're using arrow flight and say your database is working it. Um, uses arrow internally. It's It's just moving data copying data to the to the to the wire and then having the client. Not even. And even deceralize the data but just be able to operate it with it. 12:20.57 greybeards So there's very little um format conversion requirements Unlike Odbc and Jdbc which would require multiple format conversions across during that during the data transfer is that what you're saying Okay, okay, there was some mention of of. 12:34.65 James Duong That's correct. Yep. 12:40.68 greybeards Parallelization as part of arrow flight. Could you explain how that plays out in this game. 12:47.85 James Duong I could talk about this So um, modern, modern computing engines often support. Um multi-node systems. What most systems are distributed nowadays. Um. 13:01.37 greybeards Yeah, yeah, yeah, yeah, and you're not talking about my core you're talking multi server node right. 13:05.38 James Duong Like ah multiserver Yeah I'll I'll use dremio. On example, we have a coreier node and then we have several executor nodes for processing a query according those planning and then we delegate the work to executors. Um and they they individually process The query. Execute the query plan. Um, what arrow flight provides is a way to as a response to a request report each endpoint where the data is being served so that your client layer can then start consuming data at every single. Ah, multiple different endpoints at once. Potentially if your client itself is is also Distributed. You could have your client working with data on multiple nodes on its side as well. 13:53.27 greybeards And and that's multiple servers right? I mean so and and so the other side of this is multiple cores could you have multiple clients sitting in the same server. 14:09.54 James Duong You mean multiple clients requesting from the same server. 14:12.10 greybeards Um, Gosh What do I mean so each core is its own compute engine Effectively if I've got ah parallel access to the data can I have all the cores. Effectively working on their own ah columns of or record batches separately or they would have within a server I Guess it's it's It's a record batch that this server gets and a record batch that some other server So The the element of or the unit of. Granularity for parallelization is record batches. 14:51.60 David Li So I would say arrow and arrow flight give you all the tools to parallelize and split things up what whichever way is the best for your application. Ah, for instance, so you can have. Multiple clients making requests to the same server you can have 1 client making multiple requests to multiple servers you can split data up so there's a little detail here. So the arrow java and arrow c plus plus libraries conceptualize things slightly differently. But. Effectively a record batch or a vector schema route is like a unit of data. Yes, that you can ah you can have each thread working independently on its own chunk of data and process that and send it back over arrower flight independently. 15:45.21 Matthew Leib Is there um any specialized hardware in creating these arrow nodes. 15:54.78 David Li So arrow as a in-memory format is intended to be hardware agnostic. It's intended to be designed in a way that it's efficient to implement but it's not tied to particular hardware. Ah, we have for instance arrow. Arrows ci infrastructure tests arrow on x Eight six machines to test arrow on max it tests arrow on an s three ninety x from Ibm ah and some power pc machines so we have all sorts of things. 16:27.24 greybeards Mainframe Did you say mainframe. 16:29.33 David Li I Just I said X Three Nine E X yeah. 16:30.17 James Duong Lick. 16:30.95 greybeards Ah, arrow. Okay, and that I I guess the other side of this is this all open source right? I mean it's It's a patchy project right. 16:39.99 David Li Yeah, arrow is under apache arrows under the apache umbrella it's open source. We have contributors from many companies we have contributors from all over the world. We have arrow projects in all sorts of different languages the Julia project recently. Ah, got recently joined the main arrow umbrella as well. So we have lots of things are supportive. Yeah. 17:09.28 greybeards And so I mean the reason I so I really wanted you guys to come on the show was because it's ah, there's not a lot of high performance access. Mechanisms or access protocols that that exist out there especially in the open source community I mean most of the most of the high performance access protocols are are either proprietary or or yeah, they yeah, they're the posixbased effectively so you would have a Posix client for. For venor x and he'd have his own servers to support their own paralyzization now Nfs is coming out with some paralyization and four not two I believe but ah, this is something different. Um I'm not trying to think what the question is. 17:58.66 Matthew Leib Um, good. 17:59.72 greybeards So do you have any performance statistics on on what arrow flight could potentially deliver as far as gigabytes per second or or record batches per second I guess would be the other claim. 18:13.51 David Li Um, um, James do you have any figures from dremel. 18:19.19 James Duong Um I could pull some up in a moment. 18:24.62 greybeards Yeah, yeah. 18:25.16 Matthew Leib Yeah, one of the things that it occurs to me is that Um, while the the software side which is where you're working is highly dependent. Um, ah you know it. It could be who knows line level speeds. But it's going to require fast Networking. It's going to require um and all of these hardware functionalities are variables that that are going to be hard to to compare apples to apples. 18:58.31 greybeards Yeah I agree. 18:58.68 David Li Um, ah that is ah that is a good point. So I think I briefly mentioned aerol flight uses grpc underneath grpc is a Rpc framework from Google and it's been pretty well optimized for tcp communication. But recently we're also looking at integrating the Ucx networking library into arroweroflightte as well because arrowerfly abstracts away the underlying networking library is using so ucx is a. Library that's designed to take advantage of specialized hardware like infinibband interfaces. We're hoping the tests were conducted on a cluster that I can't disclose exact numbers from but ah Ucx does quite well when it has access to ah. 19:34.48 greybeards Oh okay. 19:42.30 greybeards Yeah, yeah, it's fine. 19:51.32 David Li To specialize hardware. 19:52.10 greybeards And this would be an in infinite advanced Solution. So Let's talk about the hardware configuration here. Um arrowerol flight requires obviously client software sitting on the the client and there's server software as Well. Sitting on some server someplace and then. Behind that server would be ssds directly attached or disks directly attached or or do you support other storage systems behind that. 20:19.31 David Li Yeah, so think of arrow and arrow flight as more of a toolkit and a toolkit and a set of standards and protocols. So again, we're not trying to make particular requirements on the kind of hardware setup you have or anything like that. Ah. But basically arrowerf flight at the network level is a set of apis based on Grpc and then we also ship client and server libraries um, client and server libraries that any application can use in a variety of different programming languages to build higher level things like like. Flight sql on top of these libraries I see James has some performance figures if you want to mention those. 21:03.20 greybeards Yeah, please do James. 21:04.36 James Duong Yeah, so when we did some dressing of this at Dremio we saw it's three foot rates of twenty gigabytes per second without using flights peralization features. Yep so that's just. 21:14.96 greybeards Without flights paralyization so you potentially could see twenty gigabytes per second per parallel transfer if you have yeah if you had the hardware that could support and that's that's over ether that tcp right? I mean it's not. 21:21.36 James Duong Um, that's right. 21:26.19 Matthew Leib That's pretty speedy. 21:33.77 greybeards You don't require any special switching or anything like that right. 21:35.17 James Duong Right. 21:40.11 greybeards I Mean yeah. 21:40.36 Matthew Leib And yet you did mention Infiniband so is is there some reliance on Infiniband as a protocol. 21:49.36 David Li No, so once we have Ucx fully integrated. We'll be able to take advantage of a finabbend hardware if you have it. But if you don't you can continue using Grpc and everything will just work over tcp. 22:04.33 greybeards And you mentioned Ipc as well as ah, another protocol that you use. Um maybe just for my own edification. Can you tell me the distinction between Ipc and gprc. 22:17.72 David Li Yeah, so we have the arrow in memory format. That's just how the data gets laid out in Ram. Ah, if we want to serialize it and then send it to another process or write it to disk or something that's when you use the Ipc format. Ipc format basically specifies how you pack the buffers on the wire of the message headers stuff like that that all gets sent over Grpc Grpc is a Rpc framework from Google and the cloud native computing foundation. Um. Ah, handles all the networking details. So that's where the those are like the 3 layers here in Aero flight. 22:56.62 greybeards Oh Okay I got you not like alternate layers. They're alternate. They're they're all combined to support the the transfers and its or stuff so where do you where do you see in-memory databases being used these days a Colomular Co. Columnar and in-memor databases being I mean what sort of clients or customers would be and what would they be doing with them. 23:25.61 James Duong So I'd say in-memory databases are really good at doing batch analytics dealing with large fact tables and and being able to produce meaningful data using bi tools. 23:41.29 greybeards Um, and and you don't see that um ah, being applicable to things like machine learning or anything of that nature. 23:53.57 James Duong Um I can see this being used in in machine learning one of the one of the big use cases for arrow is to be able to efficiently process data through so efficient process data than using spark. 24:11.35 greybeards Ah, okay. 24:13.39 James Duong Um, being able to load data into sparko arrow data into spark without serization that normally when you work with Spark Jobs you ought to write a python script and then you send you you send the work to a Jvm that process. But if you're using arrow. 24:15.88 greybeards Bright right? right. 24:28.87 James Duong Um, there's no craization required to go from the Python data to the Jvm um data because it's just arrow datata. 24:34.90 David Li Yeah,, That's a good Example. So I guess think of arrow as kind of like a bridge between all these different systems So Spark uses arrow to implement its Python user defineed functions and other systems like bigquery and stillflake also use. Arrow to transfer data at different points I think in the client libraries in these cases. 25:01.68 greybeards That Kafka could be a potential solution here as well or I know Kafka has some spark support. 25:10.87 David Li I Can't say I can't I don't think off the top of my head I can't think of anyone combining Kafka and arrow per se but no, there's no reason stopping you if you need to get calmer data from point a to point b. 25:27.24 greybeards So what? about um, high availability and those sorts of things I mean that? yeah yeah, in-memory data is is great and and but you know fault tolerance and that sort of thing are somewhat. Are sometimes ah required attributes of especially databases quite frankly because they become so critical to bi and and other critical corporate functionality does Aer flight offer high availability or is that something you you just. Kind of configure it with. 26:05.56 David Li So ultimately, that's up to the application being built on top but arrow flight does provide ah does provide things to try to to make it easier to implement reliable applications. So again because we're building on top of grpc. 26:09.67 greybeards Um. 26:24.20 David Li And means we inherit a lot of the tooling Grpc is its own like rich ecosystem. So arrower flight building on top of that means we inherit all the tooling all the best practices that have been built up on top of Grpc over the years all the observability monitoring and logging tooling all the knowledge of how to debug things all of that still applies to Aer flight because it is Grpc underneath. 26:53.24 greybeards Okay, so you get you get the advantage of gprc and that's her stuff. Ah, but. 26:59.51 James Duong I'd like to I'd like to mention that Aero flights ability to report multiple endpoints can be used for data redundancy as well. So if say one of the those where one of one of the endpoints has gone down with that has sorts data. You could go to another one. 27:13.12 greybeards Right? If you got a copy of it and at that other endpoint I see I see. 27:19.62 Matthew Leib That's interesting. We we think and and forgive me if I'm making too much assumption rate. Ah but but it seems as if you and I are thinking in terms of how hardware might resolve a lot of these issues. But. 27:31.83 greybeards Um, right. 27:34.42 Matthew Leib But ultimately with a database language or or a file format. We're really looking here at how those problems are actually resolved by software. Um, so you know, split brain taking place. That's a transaction that. 27:43.11 greybeards Um. 27:52.73 Matthew Leib Doesn't necessarily compete and is there a cash coherency from site to site and you're saying That's not really a function of error That's really a function of the the overriding architecture that that actually handles the transactions. 28:00.88 greybeards At the application layer. 28:12.35 Matthew Leib Would that be a correct statement. 28:12.57 David Li Yeah, that's correct. 28:16.75 greybeards I got you but you mentioned that it you could automatically replicate or or mirror arrow flight data on to so different storage servers if I'm using the correct terminology just by configuration configuring it. 28:32.72 David Li Yeah, ah, kind of so flight. Well sorry so I guess there's there's always more layer C pull back here right? So ah, flight just defines. 28:36.49 greybeards That way I guess. 28:40.38 Matthew Leib Correct. 28:44.28 greybeards Yeah, yeah. 28:50.72 David Li Ah, protocol and some rpc methods that you can use to build things like that and it kind of it kind of tries to be suggested and corral you into doing stuff like that. So ah, for instance, when you're requesting data from a flight service. The recommended pattern is that you make first a. Metadata call called getflightte info and that tells you ah that tells you where this data set can be fetched from and how it's partitioned and as James mentioned alternative endpoints so you can fetch data from if the primary endpoint is down. 29:23.56 greybeards Right? right. 29:28.27 David Li And as long as your application implements that as long as your client implements that then yeah, you can build in redundancy you can build in parallelism. It's still up to your application to actually implement those details but flyte tries to encourage you to do that and make it easy for you to do that. 29:39.45 greybeards Right. Right? right? right? The other challenge that open source has had historically is is operations or configurations and that sort of stuff is but always I would say open source typically developed by technical. Development teams and and and they're not necessarily Usability teams associated with that. How hard is it to configure and and make use of something like arrow and arrower flight and arrower flight sql. 30:18.12 David Li That's something that the community is actively working on I would say so we've been trying to improve the documentation especially in languages like Java. We've been trying to. We recently started a aerol cookbook initiative to try to. To provide these simple reusable recipes for accomplishing common task with the arrow libraries. Um, now this is maybe a common cop out of open source projects. But if. If you if there's something that's not clear if there's something that you want improved, please let us know ah at least for me because I've been in the project for a few years now it can be hard to see where things are confusing or unclear so having these questions. 31:04.88 greybeards Yeah. 31:09.76 David Li Ah, having these questions really helps me as a contributor know where to focus my efforts know where we need to be nowhere. We need to focus explain more. Basically. 31:20.33 greybeards Right. 31:20.70 Matthew Leib Ah, to that's a very valid point you know, um the forest for the trees conversation. Um, but the the difficulty with open source in general has always been a lack of support um because your get. 31:35.20 greybeards Which is the other side of this. Yeah I agree. 31:39.97 Matthew Leib Yeah, so I imagine that community support though is is quite robust in a project of this magnitude. 31:47.81 David Li Yeah, I'd say so well I guess Eric there are a couple ways to approach it so dremio from as as far as I can see is actually fairly active in the project itself and one of Dremio's co-founders was also co-founder of the arrow project. But also yeah. I and many of the other contributors do our best to monitor stack overflow our mailing list github issues all that to try to provide support as best we can and maybe that's not ah of course that's not guaranteed, but. I think we we try our best to address. Everyone's questions. 32:28.58 Matthew Leib Nobody's denying that. Um I think that you know the the historic need has has has has been you know finding a community of of practitioners. Who actually ah understand the product and and actually understand the the avenue that the endus is attempting to go through to resolve these questions et Cetera um in my mind. Um, you know when you've got a product of this significance. You've you've. More than likely got people that have faced similar issues in the past and consent you into a decent direction in terms of even if it's ad hoc support. So you know I think that. We're not seeing the same issues we used to see. 33:23.40 greybeards Um. 33:27.34 David Li Right? Yes, the I think the aero community has grown the Aero community has grown a lot is still growing. So yeah, there is there is a fairly active community around it now across all these different languages and of course there's also commercial support. But. 33:46.25 greybeards Oh and there is commercial support for arrow and arrow flight. 33:46.52 David Li That's always an option. Yeah, so I'm ah it's through my employer of oltund Data. So I I won't speak too much, but it's also an option. 33:54.48 greybeards Ah, okay, right? but's good. That's I was kind. We were kind of looking we probing to see if that was ah as ah as available as an option and. 34:00.59 Matthew Leib To. 34:10.27 Matthew Leib And and that's good I mean obviously if you are a modern institution a bank you're relying on on the data and the accessibility in the long term you want to know if you can. 34:12.38 greybeards Yes. 34:27.52 Matthew Leib Gain Greater levels of support and obviously you can't. 34:30.38 greybeards Can you guys speak to some of let's say some of your bigger installations. You don't have to actually name the company but might talk about you know what? they're doing from a vertical perspective with arrowero flight and perhaps arrow flight sequel. 34:48.21 David Li Ah, well dremio is the obvious candidate here. 34:51.90 greybeards James. 34:53.11 James Duong Right? So Jemio is just getting ready has as recently made dremio cloud available and with Jemio cloud comes ah support for our Aero flight through through through centralized service now. So so that's one the one of the big changes. We we adopted Aero flight in into the dremio into dremio enterprise edition about a couple of years ago so we added support for aerol flight on its own and then started the initial up to do flightql. We're currently so we're currently building up. Flight sql support. 35:28.94 greybeards Can you tell us just a little bit about Dromio as a company. What you guys are doing because I tell you truth I've heard about them but I don't know exactly what you're doing. 35:37.71 James Duong Ah, dremel is is a ah query engine for accessing data lakes efficiently and executing sql using an an arrow-based execution engine. So we we take advantage of and advantagets of the feature that features that David's mentioned like being able to vectorize computations on data for the purpose of processing sql as well as as well exposing data to users using using aerol flight so dremio could connect to a lot of different sources including. You know azure data lakes Amazon s 3 and Google Google confuse storage so data lakes as well as more traditional sources like relational databases such as sql server postgresl or. 36:24.26 greybeards Right. 36:35.24 James Duong Or red shift For example now Mysqls included orachoswell. 36:36.98 greybeards Myqquel. Yeah, yeah, ah there it's. 36:42.54 Matthew Leib Oracle as Well. That's the first mention we've had this far and it had occurred to me. Um, but if you've got raw data sitting in an oracle database. There. It seems logical that there be ah, an interpreter of that data. Into arrow. 37:00.67 James Duong So what Germmelo does is it provides a connector based on jdbc to suck data in from a traditional database and then get into the arrow format. So Thatremio could work with it. It. It tries to push as a bunch work as possible down to the. Down to the backend database though. 37:20.46 greybeards Right? right? right? right? You mentioned vectorization and I would assume that because it's having this data sitting in you know a column or format sitting in memory. Ah vector operations would be useful, very useful here. So I mean are you're using things like Gpu to do those sort of. Things or is this something that you're using I'll call it vector instructions of like x Eight six et cetera. 37:44.13 James Duong So we use a component of arrow that was developed at Drummel called Ganddia. Um, yeah, so dremio is a Java is a javabased server and we use Ganddia to be able to access some of arrows more lower level features including its. 37:52.92 greybeards I'm sorry Gand diva. 38:04.19 James Duong Simd Simd operations. 38:05.66 greybeards Single instruction multiple data operations I Just want to translate from our so this is vectorization but I mean vectorization could occur at the cpu level. It could occur in a gpu it could occur in an fpga. 38:08.11 James Duong Right. 38:21.55 greybeards Am I assuming that you're using primarily the Simd Instruction sets for the cpus that you're operating on. 38:29.99 James Duong I'm actually not sure about the answer that David would you know this is kind of abstracted from me. 38:32.87 greybeards Yeah, no, no instead. 38:33.30 David Li Yeah noise. So ganddiva is based on the vm compiler framework as far as I know it targets cpu mostly the interesting thing there is ganddiva is written in c plus plus even though much of Dremio uses Java. Ah, but because arrow is a standardized memory layout. Those 2 languages can share data between them without having to copy it all. They can just pass pointers around so that's a big advantage of arrow here that a jvm -based system can take full advantage of native. Ah, native cc plus plus capabilities but you mentioned gpus and f fugas and I want to say so the nvidia rapids ecosystem has a library called qdf which implements data frame operations using the arrow memory layout on gpu so we. 39:09.27 greybeards Right. 39:27.26 David Li We do see arrow usage with with Gpus as well and the name escapes me at the moment but there is also a project that ah works with fvgas and arrow. Um, you can take basically you can give it an arrow schema. 39:44.81 greybeards Right. 39:46.91 David Li Ah, basically the data types and it'll generate I think V Hio or verolog to work with that data on an fvga. 39:52.87 greybeards What what wait a minute wait a minute so you can get an arrowero seek schema to this process and it'll generate the harvard design language to to program an Fpga to process it. 39:56.58 Matthew Leib Set up. 40:10.39 greybeards Yeah, what you're saying. 40:10.95 David Li Yes, you still have to bring your own you still provide the actual processing part but it will generate. Yeah, it's called fletcher. It'll generate all the interfaces for you? Yeah, so you don't it it reduces amount of work you have to do to program the FFiga so yeah 40:19.63 greybeards The the interfaces or something like that to allow it to access the same Yeah yeah, Fletch. Okay. 40:30.44 David Li Again, That's called fletcher if you want to look at it Yes, or whoops. There's lots and lots and lots of arrow base puns. 40:38.22 greybeards Yeah, okay, all right? Well hey this has been great David and James any last questions for Matt myself or is there something you wanted to say to our listen to yarns is probably the better thing. 40:52.98 Matthew Leib Will that. But. 40:53.98 David Li Well, ah, as trying to as trying to represent the open source project here if you want to get involved please reach out on the mailing list Dev at Arrow dot Apache Dot Org or you can send github issues or pull request At. On Github at Apache Slash arrow. 41:14.67 greybeards Okay, great. Um Matt anything you'd like to ask before we leave. 41:19.79 Matthew Leib No no questions. But ah I just want to thank you guys. This is a very interesting project. You're working on and and I learned a lot. 41:29.90 greybeards Yeah, yeah, yeah, well this has been great David and Jabe thank you for being on our show today. That's it for now by David by James Bye Matt until next time. Okay. 41:35.82 David Li Thanks for having us. 41:36.69 James Duong Thank You. I. 41:44.49 Matthew Leib All right. 41:44.66 David Li Goodbye. 41:48.70 greybeards Have a good day.