Java Answers Forum - Processing large files suggestions

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Java Answers Forum
Processing large files suggestions

10 replies on 1 page. Most recent reply: Jun 24, 2004 8:53 PM by Gerardo Horvilleur

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 10 replies on 1 page

Emilio Gagliardi

Posts: 5
Nickname: emilio
Registered: Jun, 2004

Processing large files suggestions

Posted: Jun 2, 2004 1:02 PM

Reply

Advertisement

Greetings Java Gurus!
My first post and hopefully an easy one!
I'm a writing a small program to do linear algebra on a large dataset. The file size is about 6 gigs; all integers that will be later placed into matrices to do the math on. As the subject suggests I am seeking suggestions about how to tackle processing a large dataset that cannot fit in memory. I can wrap my brain around one object incrementally reading the file, processing the values, and writing out results, but I'd like to hear and learn about sending a 1000 objects to do the task. Mostly I'd like to know how to reduce the time required. As it stands, I ran some preliminary tests and it looks like it might take 2 days for me to make the file (because I'm crunching 150,000,000,000 shorts into 2,500,000,000 ints) based on one object running the code. One catch word I've heard is *thread* and I'm wondering if threads are what allows me to do this. I would really appreciate some solid reading material and examples if possible, but beggars can't be choosers.

Thanks for your time,
Emilio

Adam Duffy

Posts: 168
Nickname: adamduffy
Registered: Feb, 2003

Re: Processing large files suggestions

Posted: Jun 2, 2004 2:36 PM

Reply

> As the subject suggests I am seeking suggestions
> about how to tackle processing a large dataset that cannot
> fit in memory.

This would indicate that the first thing you should do is ensure that you are using buffering.

Check out this link for an explanation on Java IO Streams and, in particular, buffered streams.

http://java.sun.com/developer/technicalArticles/Streams/ProgIOStreams/

After that, you should probably check out Knuth's The Art of Computer Programming (possibly Volume 2) for a starting point in finding an efficient algorithm for the linear algebra that you're performing.

The use of threads would not necessarily reduce the time time required.

Adam

Emilio Gagliardi

Posts: 5
Nickname: emilio
Registered: Jun, 2004

Re: Processing large files suggestions

Posted: Jun 2, 2004 2:49 PM

Reply

Hi Adam, thank you for your suggestions! I'm actually using FileChannels and ByteBuffers to read and write these files as that seemed an efficient choice.
What I'm looking for information on is how to design my objects so that I can process many segments of the file at once, and bring all the pieces back together at the end. As it stands I read a range of integers, do some math on them, and then write to file. But I only have one object doing this so it will take a bit of time to run through that many integers if I do it linearly. What I'm looking for is a simple strategy that allows me to say "take this file and break it into 500 regions, hand one region per object and have them all process on their own and somehow bring their work back together". Does that make sense?

Cheers,
Emilio

Adam Duffy

Posts: 168
Nickname: adamduffy
Registered: Feb, 2003

Re: Processing large files suggestions

Posted: Jun 3, 2004 3:01 AM

Reply

Hello Emilio,

Will each of these objects run
a) on the same computer or
b) on different computers to each other?

If (a) then I don't know if breaking up the task will be of much help since there are bottlenecks in a computer such as file IO and (more importantly for computationally intensive applications which appear to be the case here) CPU access.

A typical example is that of a supermarket. Assume only 1 cashier is working. Of the following, which would be faster? 1 person (thread) buying 10 items (region) OR 10 people buying 1 item each.

If (b) then it's a whole different kettle of fish called Distributed Computing.

Either way, I don't know if threads are going to be of much help. They tend to be most useful when performing different operations, i.e. 1 person at the cashier and 1 person looking for a shopping trolley that doesn't have a mind of its own....

Any help to you?

Adam

Emilio Gagliardi

Posts: 5
Nickname: emilio
Registered: Jun, 2004

Re: Processing large files suggestions

Posted: Jun 3, 2004 1:47 PM

Reply

Hi Adam,
this program will only run on one machine at the moment, so I guess not breaking it up will be better. I appreciated your analogy! At first I was like, "well it depends how efficient the cashier is" but then I thought about the overhead of taking the money, producing change, and handing the reciept over. With one customer all the logistical stuff happens only once, but with 10 customers it happens 10 times...

As well, I'm not performing different operations so it looks like just leave it to one object and just make sure I code it as thinly as possible. :)

Do you know of any example programs, tutorials kicking around about linear algebra and Java? I've found lots of libraries, but they do not cover implementation as that is situation specific. Be that as it may, I'd still like to look at *one* implementation and modify as needed...

Cheers,
Emilio

Adam Duffy

Posts: 168
Nickname: adamduffy
Registered: Feb, 2003

Re: Processing large files suggestions

Posted: Jun 6, 2004 12:06 PM

Reply

Hi Emilio,

The phrase "linear algebra" covers a lot of territory so I'm afraid I can't offer much in the way of advice without knowing the specifics.

I, and many other people around the world, are interested in numerical computing. Chances are that someone has already solved, and possibly implemented a solution to, a similar problem to yours. Google for your specific problem. You may get lucky.

I'm not aware of a standard or favoured Java library for linear algebra. However, that is not to say that there isn't one.

Adam

Matt Gerrans

Posts: 1153
Nickname: matt
Registered: Feb, 2002

Re: Processing large files suggestions

Posted: Jun 6, 2004 4:58 PM

Reply

> If (b) then it's a whole different kettle of fish called Distributed Computing.

Also "grid computing." (I'm not sure what the distinction between the two is).

> Google for your specific problem.

This looks like a promising list: http://math.nist.gov/javanumerics/

Back on the main question, you may additionally want to do a little profiling or quasi-profiling to determine whether most of your time is being used doing calculating or doing I/O. At the very least, look at a CPU and I/O meter while you run what you have.

Buffer size is important. I wrote a program for calculating CRCs and also tuned it for very large files. It turns out that when it comes to buffers, bigger is not necessarily better. A relatively small buffer of about 100K bytes was close to the sweet spot (on current high-end hardware). When you think about it, it makes sense: if your buffer is too big, then the CPU has to sit long periods of time doing almost nothing while the buffer is being filled. (By the way, this lesson also applies to CPU cache: advertising a huge L2 cache sounds impressive, but it may not affect peformance much).

In fact, you could get improved performance by having calculation threads separated from I/O threads, but I think the performance improvement on a single-CPU machine would be pretty small (a few percent) in comparision with the amount of additional work and complexity. Getting the buffer size right is easier and the payoff can be quite a bit more substantial (orders of magnitude).

The jrMan ray-tracing project also processes huge files and the author told me that he got amazing performance (even better than the C++ program, RenderMan, upon which it is based) by using memory mapped I/O. I've only used the memory mapped I/O on Windows with the Windows API, so I don't know what Java interface he is using, but if you want to pursue that, you can have a look at the project link: http://sourceforge.net/projects/jrman/
(also, I don't know if that technique or something similar can be effected on other non-Windows platforms)

Joe Cheng

Posts: 65
Nickname: jcheng
Registered: Oct, 2002

Re: Processing large files suggestions

Posted: Jun 8, 2004 9:00 AM

Reply

It might be worth using two threads if one thread does the loading (assuming the loading is completely IO-bound) and another does the processing (assuming the processing is completely CPU-bound) (or actually, one processing thread for each physical processor). If the results can be written to a separate physical disk, you might benefit also then from a separate thread for writing the results.

This will net you at *best* a 3X performance improvement on a single-processor machine. You have to decide whether that improvement is worth learning a whole new programming paradigm for. (Personally I think it is worth learning about, especially if you already intuitively though of workers doing their thing and then joining back together.)

If you are interested in learning about threads, you really need to pick up a good book--it is very dangerous to try to figure out how to write these kinds of programs on an ad-hoc basis, because of the subtlety of the kinds of bugs that can and will pop up (i.e. the kind that never get uncovered in testing and only show up in production). I highly recommend Doug Lea's Concurrent Programming in Java: Design Principles and Patterns, it will give you a good solid grounding if you read it carefully.

Emilio Gagliardi

Posts: 5
Nickname: emilio
Registered: Jun, 2004

Re: Processing large files suggestions

Posted: Jun 8, 2004 10:17 AM

Reply

Thank you all very much for your input!
I think threasds are a little out of my league at this point.
As far as buffer size goes maybe you can help me think this through?
Currently I have it set up that I create a buffer just big enough to hold one row of data; for integers thats 195392, and for doubles 390784. My thinking was that there is no end-of-line since I'm writing bytes directly to file, so I use my buffer size to grab one unit of data. So now according to you, this is probably too big. Are you suggesting that I, for example, make the buffer half as big and simply read twice and that this could improve performance?

I did as you suggested and it looks like the calculations take 20 minutes versus 4 minutes for IO (it took 24 minutes to write 10 gigs of doubles generated in a simple for loop based on one calculation).

Do you have any suggestions about how to play around with buffer size to find the sweet spot?
Cheers,
Emilio

Matt Gerrans

Posts: 1153
Nickname: matt
Registered: Feb, 2002

Re: Processing large files suggestions

Posted: Jun 8, 2004 2:49 PM

Reply

Hmm. Based on your comments, you need a certain minimum size chunk of data to begin calculations. If that chunk is around 200k to 400k bytes, I would just stick with that, since going smaller just complicates matters. My experience was that the performance difference between buffers of 50k to 500k was relatively small, so there isn't too much value in really super fine tuning. You just have to get near the sweet spot, not right on it. When you try a really small buffer (like 1 byte) or a really big one (like 100MB or 1GB), then you see the big performance drop-off.

I just wanted to make it clear that reading into a huge buffer is probably not such a hot idea. At first glance, if you have say, 1GB of RAM, you might think the that more you read into RAM, the faster the calculations will go. However, it usually turns out that reading in something under 1MB results in better performance.

Once again, I think you are fine with just using your "minimum calculatable chunk" size.

Gerardo Horvilleur

Posts: 1
Nickname: mago
Registered: Jun, 2004

Re: Processing large files suggestions

Posted: Jun 24, 2004 8:53 PM

Reply

Some clarifications:

Actually jrMan is not a raytracer, it uses the REYES algorithm (Render Everything You Ever Saw).

I don't think jrMan is faster than Pixar's Renderman (which is written in C, not C++ as far as I know).

What you saw was jrMan rendering images 2 to 4 times faster than Aqsis, another open source REYES based renderer written in C++.

The memory mapping in jrMan is used to access the texture files, which can be very large. This has given us quite good I/O performance, plus the OS does the texture caching for us.

Flat View: This topic has 10 replies on 1 page

Previous Topic

Next Topic

Sponsored Links

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use