TWW - Cpp question

go to bottom

Message Boards » » Cpp question

Page [1]

clalias
All American
1580 Posts
user info
edit post

Any help would be greatly appreciated <by the way I am new to c++ so I might need a clear explanation>.

I need to read in a file that has comma separated data. Specifically,

RDU,ILM,AA,1996,2,102
JAX,BWI,AE,1997,1,30
LAX,JFK,SW,1997,3,79

The file actually has over 6 million lines of data but I am testing just for this small example.

I need to read in the file and store the variables then run test on those. My problem is I can't read in the data.

Here is what I started so far.---
Quote :
"
#include <iostream>
#include <fstream>

using namespace std;

int main()
{

char orig[10][4]; // first variable in the line
char dest[10][4]; //2nd variable
char carr[10][4]; //3rd variable

int year; //4th variable
int qtr; //5th variable
int pass; //6th variable

int j,i; //dummy

ifstream in("Feed.txt" ) , ios::in); // stream to Data file

// Check
if(!in) {
cout <<"can't open file\n";
return 1;
}

for(j=0;j<=2;j++) {
in.getline(orig[j], 20 ,',');
//in.getline(dest[j], 15);
}

for(i=0;i<=2;i++) {
cout << orig[i]<< "\n";
}

in.close();
return 0;

}"

[Edited on March 28, 2006 at 6:45 PM. Reason : .]

3/28/2006 6:42:45 PM

skokiaan
All American
26447 Posts
user info
edit post

strtok

atoi

or exec("perl readFile.pl Feed.txt")

[Edited on March 28, 2006 at 7:35 PM. Reason : dfsd]

3/28/2006 7:33:39 PM

clalias
All American
1580 Posts
user info
edit post

^thanks.

I think I'll try to take the data and read in every line as one string then use strtok.
Not to sure about the perl suggestion. Like I said I just started learning c++ two weeks ago.

I still think there should be a simpler/shorter way. Any other suggestions? I have MS VS 2005 so if anyone can think of any class etc.. that I have access to that would be useful.

3/28/2006 9:41:03 PM

Excoriator
Suspended
10214 Posts
user info
edit post

why don't you just write a perl script

3/28/2006 10:59:48 PM

JaegerNCSU
Veteran
245 Posts
user info
edit post

Quote :
"The file actually has over 6 million lines of data"

Quote :
"I think I'll try to take the data and read in every line as one string"

If you really have over 6 million lines in that file that approach is going to be slow as balls. You should read the entire file into a memory buffer with one disc read and then parse the data in memory.

3/28/2006 11:35:14 PM

clalias
All American
1580 Posts
user info
edit post

^^I don't know perl.

^That is what I was trying to do from the get-go. But I couldn't figure out how to read in each line and pick out the variables that are comma delimited. Any suggestions on how to do this?

I am expecting this code to take a while. I'll let it run all night if it has too and I have lots of memory. I don't need to do this frequently. Just once for now anyway. Unless I get updated or more complete data.

3/29/2006 12:01:32 AM

Excoriator
Suspended
10214 Posts
user info
edit post

well you don't know c++ either so what's the difference.

perl:


open(INFILE, "< Feed.txt") or die("Can't open file: $!");

my @lines = <INFILE>;
close(INFILE);

foreach $line (@lines){
  chomp($line);
  @words = split($line, ",");
  foreach $word (@words){
    #PROCESS YOUR WORD HERE
  }
}

I'm guessing it will take about 5 minutes to parse, assuming you're not doing anything crazy with each word.

[Edited on March 29, 2006 at 12:14 AM. Reason : s]

3/29/2006 12:13:02 AM

clalias
All American
1580 Posts
user info
edit post

haha nice,
I'll take a look at that.

-------

nvermind I see: 5 min to parse.

[Edited on March 29, 2006 at 12:19 AM. Reason : misread]

[Edited on March 29, 2006 at 12:20 AM. Reason : asd]

3/29/2006 12:13:55 AM

Excoriator
Suspended
10214 Posts
user info
edit post

text parsing is perl's forte

3/29/2006 12:15:57 AM

joe17669
All American
22728 Posts
user info
edit post

I really need to learn Perl... I'm doing text processing on shitloads of data that's generated from my simulations... ive just been using matlab, but it's kinda slow. (its just what i know )

3/29/2006 12:27:11 AM

Excoriator
Suspended
10214 Posts
user info
edit post

damn i hope i didn't just help a terrorist parse out flight schedules...

3/29/2006 12:29:49 AM

clalias
All American
1580 Posts
user info
edit post

yep you figured it out. gg --well except the terrosist part

3/29/2006 12:33:38 AM

Excoriator
Suspended
10214 Posts
user info
edit post

what processing are you trying to do - i'm betting you'll need to use a hash of arrays

3/29/2006 12:40:59 AM

LimpyNuts
All American
16860 Posts
user info
edit post

Is this on a Windows-based computer? Microsoft's activex data objects database drivers (ADODB) supports comma separated files. (it will do everything for you to make it as searchable as an actual database and it'll probably be faster than any code you write)

here's a VB example. i googled and didn't see one in C

http://www.vb-helper.com/howto_ado_load_csv.html

You need to import the adodb dll:

#import "c:\Program Files\Common Files\System\ADO\msado##.dll"

## represents the ADO version. someone on here might have experience using ADO in C++. Look for an ADO example online and just use the text driver "Microsoft Text Driver (*.txt; *.csv)"

[Edited on March 29, 2006 at 12:56 AM. Reason : ]

3/29/2006 12:48:09 AM

clalias
All American
1580 Posts
user info
edit post

I can compile VB. I'll take a look at that--thanks

yea that's something like what I was looking for. I need something I can just grab off the internet.
------------------------
I need to compare every line of data to every other line of data and test certain relations.

like

for(j=0;j<=sizedata;j++)

for(i=0;i<=sizedata;i++)

if( year(j)==year(i) && qtr(j)==qtr(i) && j !==i && carr(j)==carr(i) && ( dest(j)==dest(i) ||
dest(j)=orig(i) || orig(j)=dest(i) || orig(j)=orig(i) ))
{ sum++;
}

END

COUNT(j)=sum
sum=0

END

[Edited on March 29, 2006 at 12:55 AM. Reason : TOP part]

[Edited on March 29, 2006 at 12:56 AM. Reason : .]

[Edited on March 29, 2006 at 12:58 AM. Reason : .]

3/29/2006 12:53:07 AM

Excoriator
Suspended
10214 Posts
user info
edit post

what info are you trying to get specifically?

3/29/2006 1:08:55 AM

clalias
All American
1580 Posts
user info
edit post

I forgot to add a " && pass(i) !=0 " in the if()

^Kinda hard to explain. But, here goes. It calculates the number of "spokes"(this was what I called 'count' in the above code) in an airport-pair market for a given airline and a given year and quarter. A spoke is like the number of conncetion points to a given airport.

So if the year, quarter, and the airline is the same. Then If the dest-origin of one market is connected to either dest-origin in another market then that creates a spoke. But I need to reject the case that the #of passengers in a given market for a given quarter is 0.

Kida helps to draw a graph of airports and lines connecting them.

[Edited on March 29, 2006 at 1:22 AM. Reason : .]

[Edited on March 29, 2006 at 1:22 AM. Reason : .]

3/29/2006 1:21:57 AM

Excoriator
Suspended
10214 Posts
user info
edit post

alright then, create a hash of arrays. the hash key will be the concatenation of year, airline, and quarter. The array accessed by the key will be the list of source/destination terminals

$hash{AA|1996|2} ....> (RDU|ILM, JAX|BWI, LAX|JFK)

the routes with zero passengers can be automatically culled as you build your hash structure with a simple if statement

I don't fully understand what you're doing yet, though - what do you mean by market?

[Edited on March 29, 2006 at 1:29 AM. Reason : s]

3/29/2006 1:28:12 AM

skokiaan
All American
26447 Posts
user info
edit post

study algorithms

3/29/2006 1:33:34 AM

clalias
All American
1580 Posts
user info
edit post

oh Ok,

E.g. the flights from RDU <--> IAD form a market / "airport-pair".

3/29/2006 1:33:57 AM

clalias
All American
1580 Posts
user info
edit post

I think I just figured this out in FORTRAN. It's actually pretty easy. just a simple formatted read statement.

I wish that the C++ method would be as easy. There has to be a way to read up to a comma then store everything before that in one var and keep going... Oh well.

Thanks for your help everyone.

I'd still like to know if someone comes up with a simple method in C++.

3/29/2006 2:06:02 AM

dakota_man
All American
26584 Posts
user info
edit post

fscanf

3/29/2006 9:02:28 AM

scud
All American
10804 Posts
user info
edit post

^ eww I know one of my coworkers didn't jsut suggest a potentially very unsafe method

good thing we're not in the security business anymore

3/29/2006 9:25:18 AM

clalias
All American
1580 Posts
user info
edit post

that wouldn't be unsafe for me, right? does that create some kind of vulnerability like buffer overflow?

Actually fscanf looks nice for my application.

3/29/2006 10:09:51 AM

ZeroDegrez
All American
3897 Posts
user info
edit post

To read in the entire file, like JaegerNCSU suggested do something like this (but for ascii)

http://www.cplusplus.com/doc/tutorial/files.html


// reading a complete binary file
#include <iostream>
#include <fstream>
using namespace std;

ifstream::pos_type size;
char * memblock;

int main () {
  ifstream file ("example.txt", ios::in|ios::binary|ios::ate);
  if (file.is_open())
  {
    size = file.tellg();
    memblock = new char [size];
    file.seekg (0, ios::beg);
    file.read (memblock, size);
    file.close();

    cout << "the complete file content is in memory";

    delete[] memblock;
  }
  else cout << "Unable to open file";
  return 0;
}

Then use sscanf to process the information in memory. Because sscanf only tells you the number of things it found you need to make one of the formated items you are looking for a '%n' which does not actually scan any values, but it will put in the pointer you give for that item the number of characters read so far. And you can use that to increment into the memory block with all your text.

Note:
'%n' would need to be the last item in your format string.

See here for more details:
http://man.he.net/man3/sscanf

I think that should do the trick.

[Edited on March 29, 2006 at 8:32 PM. Reason : note:]

3/29/2006 8:30:16 PM

clalias
All American
1580 Posts
user info
edit post

^got it! Thanks.

In case anyone is interested-------

It takes less than 1 min to load the entire data set, then I performed the nested loop on a small sample of 600,000 observations (1/10 total) and wrote out the result in under 30 min. Basically it's able to test 2000 lines of data in 7 sec. That means it is looping 2000 times x 600,000 innerloops in under 30 min.

(compared to Matlab testing about 1 line of data per second on the same machine -- though I didn't try mex'ing) I did try to compile matlab script to c using the matlab compiler and that only gave marginal improvement.

So I should able to run the entire file in around 4-5 hrs.

The Fortran code I tried using was going to take 12 hrs, but I *was* on a different computer. I think it was using a lot of the pagefile not to mention the slower processor. school supplied me with VC++2005 and it's a good thing I have 2GB of system memory.

I did have to set the compiler options to allow a commit stack of 300,000,000 Bytes and a reserve size of 400,000,000. And declare everything as short if I could. Probably overkill but I was tired of the damn stack overflow error. VC++ default is only 1MB. I think gcc is around 5MB.

[Edited on March 30, 2006 at 1:15 AM. Reason : .]

[Edited on March 30, 2006 at 1:18 AM. Reason : .]

[Edited on March 30, 2006 at 1:23 AM. Reason : .]

3/30/2006 1:15:18 AM

skokiaan
All American
26447 Posts
user info
edit post

why would there be a stack error on non recursive code?

3/30/2006 1:22:25 AM

clalias
All American
1580 Posts
user info
edit post

The arrays were too large.
Quote :
" const int NUMOBS=643500; // 6435000

char orig[NUMOBS][4]; //1st variable in list
char dest[NUMOBS][4]; //2nd
char carr[NUMOBS][3]; //3rd

int year[NUMOBS]; //4th
int qtr[NUMOBS]; //5th
int pass[NUMOBS]; //6th

int network[NUMOBS]; // The output. This is the number of "spokes""

http://www.devx.com/tips/Tip/14276

[Edited on March 30, 2006 at 1:27 AM. Reason : link]

3/30/2006 1:25:23 AM

skokiaan
All American
26447 Posts
user info
edit post

umm, dynamic memory allocation? welcome to the 80s

3/30/2006 1:32:28 AM

ZeroDegrez
All American
3897 Posts
user info
edit post

Quote :
"why would there be a stack error on non recursive code?"

hurray for java...

Arrays created like this...

char orig[NUMOBS];

are created on the stack. if he had created the arrays using 'new', it would have put them on the heap.

....but yeah seriously, 80's, create those arrays on the heap.

[Edited on March 30, 2006 at 1:36 AM. Reason : argh array not stack...brain died]

3/30/2006 1:33:08 AM

clalias
All American
1580 Posts
user info
edit post

^explain please.

^^haha. ~~But, how would dynamic memory allocation have helped?~~
nevermind... duh...
[Edited on March 30, 2006 at 1:37 AM. Reason : .]

[Edited on March 30, 2006 at 1:38 AM. Reason : .]

3/30/2006 1:36:13 AM

skokiaan
All American
26447 Posts
user info
edit post

^the way you did it, the array is stored on the stack. if you use dynamic memory allocation, it's stored on the heap

http://c.ittoolbox.com/documents/popular-q-and-a/stack-vs-heap-2112

[Edited on March 30, 2006 at 1:37 AM. Reason : I guess i should be happy about job security]

[Edited on March 30, 2006 at 1:38 AM. Reason : then again, c++ sucks. 50% c++, 50% you]

3/30/2006 1:37:28 AM

clalias
All American
1580 Posts
user info
edit post

" I guess i should be happy about job security"

I mean, I guess. If you like to compare yourself with someone who learned FORTRAN 5 years ago and hasn't programed since. Then said person buys a c++ book 2 weeks ago and spent a total of 5 hours or so reading it.

This is just some shit I had to learn to do my research. I don't care if it's "perfect" If it gets the fucking job done and I get on with my life.

[Edited on March 30, 2006 at 1:42 AM. Reason : .]

but this is good to know
Quote :
"Also, there is no save way to protect the stack space from being overwritten (other than adjusting the ESP register yourself).

In conclusion, unless you're coding in assembly, don't use the stack for general data."

[Edited on March 30, 2006 at 1:45 AM. Reason : .]

3/30/2006 1:41:09 AM

ZeroDegrez
All American
3897 Posts
user info
edit post

So in the example I gave you about how to read in the whole file at once.


char * memblock;
memblock = new char [size];

Assuming size is defined and assigned elsewhere. What this does is create the memblock pointer, which is a single 32bit integer that points to an index into memory where the data starts (in the heap).

Using the 'new' keyword will initalize a block of memory in the heap (think giant jello cube). So it carves out a piece of the jello block, and there is where your array exists.

Then when you index the pointer, ie: memblock[100]; what that does is say start at the address memblock points to, and move 100 blocks into memory, each block being the size of the type of item in the array. In our case a single byte character. so 100 bytes into the string of characters.

So indexing works the same way....same code. It's just that instead of everything existing on this huge stack...which probably has to be paged a lot. It exists on the system ram...somewhere.

If we were using C, you would use malloc/calloc/realloc.

[Edited on March 30, 2006 at 1:54 AM. Reason : s]

3/30/2006 1:49:27 AM

clalias
All American
1580 Posts
user info
edit post

yeah, I admit I only used bits and pieces of the stuff you linked to. But I got it working. The thing is, I don't have a lot of time to spend on it-- full time grad student and an RA-- you get the idea.

But thanks, I definitely need to look at what your telling me.

3/30/2006 1:56:34 AM

ZeroDegrez
All American
3897 Posts
user info
edit post

no problem. If you change over to dynamic allocation, I would be willing to bet you would see performance improvements though. but it's up to you.

3/30/2006 2:05:25 AM

dakota_man
All American
26584 Posts
user info
edit post

i would like for scud to explain why fscanf is particularly insecure

but he won't because he's on vacation

oh well...

can't you also fin >> var1 >> "," >> var2 >> "," >> var3 >> etc?

3/30/2006 10:11:07 PM

Message Boards » Tech Talk » Cpp question

Page [1]

go to top

Admin Options : move topic | lock topic