darkone (\/) (;,,,;) (\/) 11610 Posts user info edit post |
Random question for anyone who might has some ideas on a mystery I have:
Context: I've mounted some NFS shares on Windows. The performance in reading data from these shares seems acceptable for everything I can think of except from Matlab. If I copy files from the shares via FTP or robocopy from my windows machine I average about 40-45 MB/s.
Matlab behaves differently: I wrote a Matlab script to read data from 10 different HDF files to use as a benchmark. If the files are on the local drive, they benchmark runs in just over 6 seconds. If the files are on the NFS share, the benchmark takes 330 seconds. Using robocopy to copy the benchmark files from the NFS share to a local directory took 22 seconds. Something strange is happening in Matlab that doesn't make sense to me.
Anyone have any ideas? 6/29/2015 5:07:10 PM |
clalias All American 1580 Posts user info edit post |
reading/writing data line by line or record by record, across networks is always slower us. I'm not sure why, but it's true for other languages as well, like C, python, java etc. I've always marked it up to some magic that OS's do to move things in bulk fast. IDK.
We always copy files over to a local server and do our runs then move the data back.
If you find anything else out, I'd be interested to hear.
[Edited on June 29, 2015 at 7:25 PM. Reason : .] 6/29/2015 7:24:53 PM |
darkone (\/) (;,,,;) (\/) 11610 Posts user info edit post |
I haven't tried to benchmark this for other file types and I don't know anything about the under-the-hood workings of the HDF libraries. Obviously, a bunch of small reads is going to be really slow because of overhead.
My workflow usually prohibits moving the file between machines. I constructed my bench mark to read 10 files but for actual tasks, I'm reading tens of thousands with a combined weight of dozens of terabytes. Usually I execute code on the local machines where the data lives but I've been exploring alternatives for how we interact with our data servers. I'd like to drag my lab out of the 90s technology wise.
We have 5 CentOS servers with a combined 14 RAID 5 & 6 data volumes and 300+ TB total writable capacity. The servers all share the volumes with each other via NIS/NFS. Most of the users (10-15 max) have windows workstations and do their work directly on the data servers via SSH + X-Win (usually Matlab). I know this is an archaic setup, but I was trained as a scientist, not a sys admin. I suppose we really need a consultant but I'm pretty sure we can't afford one. 6/30/2015 11:16:54 AM |
clalias All American 1580 Posts user info edit post |
executing the code on the file server is as fast as your going to get for the current process. I'd look into new ways of handling the data. Do you really need to read it all in for a single run --10's of terabytes? can you load a smaller chunck and then fork a process to do something, whist you read in more data for the next run, i.e. parallel-reading/tasking.
Are you just looking for a subset of the data? would a sql server better handle the data though queries to return just what you need to your local machine to run? Since you can't afford a consultant I guess you're not getting new hardware either. So living within your paradigm, I think you can only optimize the ways you use the data. 6/30/2015 12:38:29 PM |
darkone (\/) (;,,,;) (\/) 11610 Posts user info edit post |
I'm aggregating the data to get at various spatial statistics so I do need to read every file. However the reading can be parallelized. I do this kind of analysis just infrequently enough that I haven't wanted to rework all my data structures to handle being split and sent to different nodes (cores). I let this chart be my guide for this sort of thing: https://xkcd.com/1205/
We don't often run into a situation were we need a database but we are building one to catalog data from out Multi-Angle Snowflake Camera. Usually I design my output structures so that I can make subsets of that data post-aggregation since I usually don't know what kind of subsets I want until I start wading through the data. 6/30/2015 1:14:15 PM |
darkone (\/) (;,,,;) (\/) 11610 Posts user info edit post |
Fun fact: Not that this is surprising, but my benchmark runs almost twice as fast on a local SSD versus a local spinning disk drive. 7/22/2015 3:54:56 PM |
BigMan157 no u 103354 Posts user info edit post |
i'm surprised it's only twice as fast 7/22/2015 4:08:04 PM |
darkone (\/) (;,,,;) (\/) 11610 Posts user info edit post |
It's not a 100% I/O benchmark. 7/22/2015 10:48:27 PM |