clalias All American 1580 Posts user info edit post |
Any help would be greatly appreciated <by the way I am new to c++ so I might need a clear explanation>.
I need to read in a file that has comma separated data. Specifically,
RDU,ILM,AA,1996,2,102 JAX,BWI,AE,1997,1,30 LAX,JFK,SW,1997,3,79
The file actually has over 6 million lines of data but I am testing just for this small example.
I need to read in the file and store the variables then run test on those. My problem is I can't read in the data.
Here is what I started so far.---
Quote : | " #include <iostream> #include <fstream>
using namespace std;
int main() {
char orig[10][4]; // first variable in the line char dest[10][4]; //2nd variable char carr[10][4]; //3rd variable int year; //4th variable int qtr; //5th variable int pass; //6th variable
int j,i; //dummy
ifstream in("Feed.txt" ) , ios::in); // stream to Data file // Check if(!in) { cout <<"can't open file\n"; return 1; }
for(j=0;j<=2;j++) { in.getline(orig[j], 20 ,','); //in.getline(dest[j], 15); }
for(i=0;i<=2;i++) { cout << orig[i]<< "\n"; } in.close(); return 0;
}" |
[Edited on March 28, 2006 at 6:45 PM. Reason : .]3/28/2006 6:42:45 PM |
skokiaan All American 26447 Posts user info edit post |
strtok
atoi
or exec("perl readFile.pl Feed.txt")
[Edited on March 28, 2006 at 7:35 PM. Reason : dfsd] 3/28/2006 7:33:39 PM |
clalias All American 1580 Posts user info edit post |
^thanks.
I think I'll try to take the data and read in every line as one string then use strtok. Not to sure about the perl suggestion. Like I said I just started learning c++ two weeks ago.
I still think there should be a simpler/shorter way. Any other suggestions? I have MS VS 2005 so if anyone can think of any class etc.. that I have access to that would be useful. 3/28/2006 9:41:03 PM |
Excoriator Suspended 10214 Posts user info edit post |
why don't you just write a perl script 3/28/2006 10:59:48 PM |
JaegerNCSU Veteran 245 Posts user info edit post |
Quote : | "The file actually has over 6 million lines of data" |
Quote : | "I think I'll try to take the data and read in every line as one string" |
If you really have over 6 million lines in that file that approach is going to be slow as balls. You should read the entire file into a memory buffer with one disc read and then parse the data in memory.3/28/2006 11:35:14 PM |
clalias All American 1580 Posts user info edit post |
^^I don't know perl.
^That is what I was trying to do from the get-go. But I couldn't figure out how to read in each line and pick out the variables that are comma delimited. Any suggestions on how to do this?
I am expecting this code to take a while. I'll let it run all night if it has too and I have lots of memory. I don't need to do this frequently. Just once for now anyway. Unless I get updated or more complete data. 3/29/2006 12:01:32 AM |
Excoriator Suspended 10214 Posts user info edit post |
well you don't know c++ either so what's the difference.
perl:
open(INFILE, "< Feed.txt") or die("Can't open file: $!");
my @lines = <INFILE>; close(INFILE);
foreach $line (@lines){ chomp($line); @words = split($line, ","); foreach $word (@words){ #PROCESS YOUR WORD HERE } }
I'm guessing it will take about 5 minutes to parse, assuming you're not doing anything crazy with each word.
[Edited on March 29, 2006 at 12:14 AM. Reason : s]3/29/2006 12:13:02 AM |
clalias All American 1580 Posts user info edit post |
haha nice, I'll take a look at that.
-------
nvermind I see: 5 min to parse.
[Edited on March 29, 2006 at 12:19 AM. Reason : misread]
[Edited on March 29, 2006 at 12:20 AM. Reason : asd] 3/29/2006 12:13:55 AM |
Excoriator Suspended 10214 Posts user info edit post |
text parsing is perl's forte 3/29/2006 12:15:57 AM |
joe17669 All American 22728 Posts user info edit post |
I really need to learn Perl... I'm doing text processing on shitloads of data that's generated from my simulations... ive just been using matlab, but it's kinda slow. (its just what i know ) 3/29/2006 12:27:11 AM |
Excoriator Suspended 10214 Posts user info edit post |
damn i hope i didn't just help a terrorist parse out flight schedules... 3/29/2006 12:29:49 AM |
clalias All American 1580 Posts user info edit post |
yep you figured it out. gg --well except the terrosist part 3/29/2006 12:33:38 AM |
Excoriator Suspended 10214 Posts user info edit post |
what processing are you trying to do - i'm betting you'll need to use a hash of arrays 3/29/2006 12:40:59 AM |
LimpyNuts All American 16859 Posts user info edit post |
Is this on a Windows-based computer? Microsoft's activex data objects database drivers (ADODB) supports comma separated files. (it will do everything for you to make it as searchable as an actual database and it'll probably be faster than any code you write)
here's a VB example. i googled and didn't see one in C
http://www.vb-helper.com/howto_ado_load_csv.html
You need to import the adodb dll:
#import "c:\Program Files\Common Files\System\ADO\msado##.dll" ## represents the ADO version. someone on here might have experience using ADO in C++. Look for an ADO example online and just use the text driver "Microsoft Text Driver (*.txt; *.csv)"
[Edited on March 29, 2006 at 12:56 AM. Reason : ]3/29/2006 12:48:09 AM |
clalias All American 1580 Posts user info edit post |
I can compile VB. I'll take a look at that--thanks
yea that's something like what I was looking for. I need something I can just grab off the internet. ------------------------ I need to compare every line of data to every other line of data and test certain relations.
like
for(j=0;j<=sizedata;j++)
for(i=0;i<=sizedata;i++)
if( year(j)==year(i) && qtr(j)==qtr(i) && j !==i && carr(j)==carr(i) && ( dest(j)==dest(i) || dest(j)=orig(i) || orig(j)=dest(i) || orig(j)=orig(i) )) { sum++; }
END
COUNT(j)=sum sum=0
END
[Edited on March 29, 2006 at 12:55 AM. Reason : TOP part]
[Edited on March 29, 2006 at 12:56 AM. Reason : .]
[Edited on March 29, 2006 at 12:58 AM. Reason : .] 3/29/2006 12:53:07 AM |
Excoriator Suspended 10214 Posts user info edit post |
what info are you trying to get specifically? 3/29/2006 1:08:55 AM |
clalias All American 1580 Posts user info edit post |
I forgot to add a " && pass(i) !=0 " in the if()
^Kinda hard to explain. But, here goes. It calculates the number of "spokes"(this was what I called 'count' in the above code) in an airport-pair market for a given airline and a given year and quarter. A spoke is like the number of conncetion points to a given airport.
So if the year, quarter, and the airline is the same. Then If the dest-origin of one market is connected to either dest-origin in another market then that creates a spoke. But I need to reject the case that the #of passengers in a given market for a given quarter is 0.
Kida helps to draw a graph of airports and lines connecting them.
[Edited on March 29, 2006 at 1:22 AM. Reason : .]
[Edited on March 29, 2006 at 1:22 AM. Reason : .] 3/29/2006 1:21:57 AM |
Excoriator Suspended 10214 Posts user info edit post |
alright then, create a hash of arrays. the hash key will be the concatenation of year, airline, and quarter. The array accessed by the key will be the list of source/destination terminals
$hash{AA|1996|2} ....> (RDU|ILM, JAX|BWI, LAX|JFK)
the routes with zero passengers can be automatically culled as you build your hash structure with a simple if statement
I don't fully understand what you're doing yet, though - what do you mean by market?
[Edited on March 29, 2006 at 1:29 AM. Reason : s] 3/29/2006 1:28:12 AM |
skokiaan All American 26447 Posts user info edit post |
study algorithms 3/29/2006 1:33:34 AM |
clalias All American 1580 Posts user info edit post |
oh Ok,
E.g. the flights from RDU <--> IAD form a market / "airport-pair". 3/29/2006 1:33:57 AM |
clalias All American 1580 Posts user info edit post |
I think I just figured this out in FORTRAN. It's actually pretty easy. just a simple formatted read statement.
I wish that the C++ method would be as easy. There has to be a way to read up to a comma then store everything before that in one var and keep going... Oh well.
Thanks for your help everyone.
I'd still like to know if someone comes up with a simple method in C++. 3/29/2006 2:06:02 AM |
dakota_man All American 26584 Posts user info edit post |
fscanf 3/29/2006 9:02:28 AM |
scud All American 10804 Posts user info edit post |
^ eww I know one of my coworkers didn't jsut suggest a potentially very unsafe method
good thing we're not in the security business anymore 3/29/2006 9:25:18 AM |
clalias All American 1580 Posts user info edit post |
that wouldn't be unsafe for me, right? does that create some kind of vulnerability like buffer overflow?
Actually fscanf looks nice for my application. 3/29/2006 10:09:51 AM |
ZeroDegrez All American 3897 Posts user info edit post |
To read in the entire file, like JaegerNCSU suggested do something like this (but for ascii)
http://www.cplusplus.com/doc/tutorial/files.html
// reading a complete binary file #include <iostream> #include <fstream> using namespace std;
ifstream::pos_type size; char * memblock;
int main () { ifstream file ("example.txt", ios::in|ios::binary|ios::ate); if (file.is_open()) { size = file.tellg(); memblock = new char [size]; file.seekg (0, ios::beg); file.read (memblock, size); file.close();
cout << "the complete file content is in memory";
delete[] memblock; } else cout << "Unable to open file"; return 0; }
Then use sscanf to process the information in memory. Because sscanf only tells you the number of things it found you need to make one of the formated items you are looking for a '%n' which does not actually scan any values, but it will put in the pointer you give for that item the number of characters read so far. And you can use that to increment into the memory block with all your text.
Note: '%n' would need to be the last item in your format string.
See here for more details: http://man.he.net/man3/sscanf
I think that should do the trick.
[Edited on March 29, 2006 at 8:32 PM. Reason : note:]3/29/2006 8:30:16 PM |
clalias All American 1580 Posts user info edit post |
^got it! Thanks.
In case anyone is interested-------
It takes less than 1 min to load the entire data set, then I performed the nested loop on a small sample of 600,000 observations (1/10 total) and wrote out the result in under 30 min. Basically it's able to test 2000 lines of data in 7 sec. That means it is looping 2000 times x 600,000 innerloops in under 30 min.
(compared to Matlab testing about 1 line of data per second on the same machine -- though I didn't try mex'ing) I did try to compile matlab script to c using the matlab compiler and that only gave marginal improvement.
So I should able to run the entire file in around 4-5 hrs.
The Fortran code I tried using was going to take 12 hrs, but I *was* on a different computer. I think it was using a lot of the pagefile not to mention the slower processor. school supplied me with VC++2005 and it's a good thing I have 2GB of system memory.
I did have to set the compiler options to allow a commit stack of 300,000,000 Bytes and a reserve size of 400,000,000. And declare everything as short if I could. Probably overkill but I was tired of the damn stack overflow error. VC++ default is only 1MB. I think gcc is around 5MB.
[Edited on March 30, 2006 at 1:15 AM. Reason : .]
[Edited on March 30, 2006 at 1:18 AM. Reason : .]
[Edited on March 30, 2006 at 1:23 AM. Reason : .] 3/30/2006 1:15:18 AM |
skokiaan All American 26447 Posts user info edit post |
why would there be a stack error on non recursive code? 3/30/2006 1:22:25 AM |
clalias All American 1580 Posts user info edit post |
The arrays were too large. Quote : | " const int NUMOBS=643500; // 6435000
char orig[NUMOBS][4]; //1st variable in list char dest[NUMOBS][4]; //2nd char carr[NUMOBS][3]; //3rd int year[NUMOBS]; //4th int qtr[NUMOBS]; //5th int pass[NUMOBS]; //6th
int network[NUMOBS]; // The output. This is the number of "spokes"" |
http://www.devx.com/tips/Tip/14276
[Edited on March 30, 2006 at 1:27 AM. Reason : link]3/30/2006 1:25:23 AM |
skokiaan All American 26447 Posts user info edit post |
umm, dynamic memory allocation? welcome to the 80s 3/30/2006 1:32:28 AM |
ZeroDegrez All American 3897 Posts user info edit post |
Quote : | "why would there be a stack error on non recursive code?" |
hurray for java...
Arrays created like this...
char orig[NUMOBS];
are created on the stack. if he had created the arrays using 'new', it would have put them on the heap.
....but yeah seriously, 80's, create those arrays on the heap.
[Edited on March 30, 2006 at 1:36 AM. Reason : argh array not stack...brain died]3/30/2006 1:33:08 AM |
clalias All American 1580 Posts user info edit post |
^explain please.
^^haha. But, how would dynamic memory allocation have helped? nevermind... duh... [Edited on March 30, 2006 at 1:37 AM. Reason : .]
[Edited on March 30, 2006 at 1:38 AM. Reason : .] 3/30/2006 1:36:13 AM |
skokiaan All American 26447 Posts user info edit post |
^the way you did it, the array is stored on the stack. if you use dynamic memory allocation, it's stored on the heap
http://c.ittoolbox.com/documents/popular-q-and-a/stack-vs-heap-2112
[Edited on March 30, 2006 at 1:37 AM. Reason : I guess i should be happy about job security]
[Edited on March 30, 2006 at 1:38 AM. Reason : then again, c++ sucks. 50% c++, 50% you] 3/30/2006 1:37:28 AM |
clalias All American 1580 Posts user info edit post |
" I guess i should be happy about job security"
I mean, I guess. If you like to compare yourself with someone who learned FORTRAN 5 years ago and hasn't programed since. Then said person buys a c++ book 2 weeks ago and spent a total of 5 hours or so reading it.
This is just some shit I had to learn to do my research. I don't care if it's "perfect" If it gets the fucking job done and I get on with my life.
[Edited on March 30, 2006 at 1:42 AM. Reason : .]
but this is good to know Quote : | "Also, there is no save way to protect the stack space from being overwritten (other than adjusting the ESP register yourself).
In conclusion, unless you're coding in assembly, don't use the stack for general data." |
[Edited on March 30, 2006 at 1:45 AM. Reason : .]3/30/2006 1:41:09 AM |
ZeroDegrez All American 3897 Posts user info edit post |
So in the example I gave you about how to read in the whole file at once.
char * memblock; memblock = new char [size];
Assuming size is defined and assigned elsewhere. What this does is create the memblock pointer, which is a single 32bit integer that points to an index into memory where the data starts (in the heap).
Using the 'new' keyword will initalize a block of memory in the heap (think giant jello cube). So it carves out a piece of the jello block, and there is where your array exists.
Then when you index the pointer, ie: memblock[100]; what that does is say start at the address memblock points to, and move 100 blocks into memory, each block being the size of the type of item in the array. In our case a single byte character. so 100 bytes into the string of characters.
So indexing works the same way....same code. It's just that instead of everything existing on this huge stack...which probably has to be paged a lot. It exists on the system ram...somewhere.
If we were using C, you would use malloc/calloc/realloc.
[Edited on March 30, 2006 at 1:54 AM. Reason : s]3/30/2006 1:49:27 AM |
clalias All American 1580 Posts user info edit post |
yeah, I admit I only used bits and pieces of the stuff you linked to. But I got it working. The thing is, I don't have a lot of time to spend on it-- full time grad student and an RA-- you get the idea.
But thanks, I definitely need to look at what your telling me. 3/30/2006 1:56:34 AM |
ZeroDegrez All American 3897 Posts user info edit post |
no problem. If you change over to dynamic allocation, I would be willing to bet you would see performance improvements though. but it's up to you. 3/30/2006 2:05:25 AM |
dakota_man All American 26584 Posts user info edit post |
i would like for scud to explain why fscanf is particularly insecure
but he won't because he's on vacation
oh well...
can't you also fin >> var1 >> "," >> var2 >> "," >> var3 >> etc? 3/30/2006 10:11:07 PM |