JeffreyBSG All American 10165 Posts user info edit post |
Hi, guys. Okay, I'm a grad student in mathematics, and I'm trying to determine the asymptotic growth of the interval profile of a suffix tree. I've written a simulation in C++, but it pretty much sucks since I suck at coding. So I'm hoping you guys can help me improve its efficiency.
We're looking at a suffix tree with n leaves, and we want to know how many internal nodes there are on the kth level. k is an integer which will be in the neighborhood of log(n).
We start with three parameters, n (a big fucking integer, on the order of 10^6 or larger.) p (a probability, i.e. real number between 0 and 1), alpha (a positive number around 1.)
We then define the integer k=alpha*log(n) (rounded to the nearest integer.)
To get our internal profile, then, we 1. Construct a random binary string of X consisting of n+k-1 entries, with each entry taking value 1 with probability p, 0 with probability 1-p. 2. Consider the n consecutive substrings of length k of X, i.e. X(1...k), X(2...k+1), ... , X(n...n+k-1), and count how many of these substrings appear at least twice. This is M_{n,k}, the internal profile, our variable of interest. 3. Estimate the asymptotic variance of M_{n,k} by running this simulation a bunch of times for really fucking big n.
My approach is to convert each substring to an integer on [0,2^k - 1], and then store it in a set. And if it's already in the set (i.e. it's appeared as an earlier substring) I put it in another set, the size of which is, at the end, my profile. But I'm sure I'm doing this super-inefficiently. I'm an amateur programmer at best. Here's the second of my code in which I build my n successive substrings and then count how many appear more than once by the placing of each in sets. but if you could suggest a better method, that would be awesome.
Also, I don't know anything about pointers, so I can't build the fucking tree. And I'm not sure if that would be optimal.
int n=1000000; int k=round(log(n)); int p=.75; int profilesize=0; int curval=0; int wordmask=0; std::set myset; std::set profileset;
// building the first k characters of our string, and converting them into an integer on [0, 2^k -1] for (int i=0; icurval=1-floor((float)rand()/(p*RAND_MAX)); wordmask=wordmask+curval*pow(2,i); }
// generating n strings from this base-string, each time lopping off the first character // and adding the last, then checking to see if this is a string we've already // encountered by examining our list of encountered strings for (int i=0; icurval=1-floor((float)rand()/(p*RAND_MAX)); wordmask=wordmask/2 + curval*pow(2,k-1); if( myset.find(wordmask) != myset.end() ){ profileset.insert(wordmask);} else{myset.insert(wordmask);}
profilesize =profileset.size();
The variable "profilesize" is our final variable; then we generate it a shitload of times for large values of n, and try to estimate its variance empirically. Any help that could be offered would be appreciated. 4/19/2014 11:01:32 PM |
JeffreyBSG All American 10165 Posts user info edit post |
k 4/19/2014 11:02:49 PM |
JeffreyBSG All American 10165 Posts user info edit post |
4/19/2014 11:11:53 PM |
moron All American 34142 Posts user info edit post |
Why are you generating a word mask and looking for it in the set, rather than iterating through the set and counting substrings of the length you're looking for? 4/20/2014 12:19:34 AM |
JeffreyBSG All American 10165 Posts user info edit post |
maybe I'm misunderstanding your question, but I'm converting substrings to wordmasks because it seems impractical to compare the jth substring to the previous j-1 substrings, letter by letter, just to see if it matches
in essence, by converting to wordmatch, I'm allowing myself to test whether two k-strings are identical simply by comparing two integers.
maybe that's suboptimal, though. 4/20/2014 12:24:44 AM |
moron All American 34142 Posts user info edit post |
Okay perhaps I don't understand your problem...
But you have a string of size 1000000 You have substrings of length 6 (let's say) that is every 6-length substring in your string And You want to know which 6-length substring is the most common Or You want to know how many times each 6-length substring occurs
Or neither of these? 4/20/2014 12:34:06 AM |
JeffreyBSG All American 10165 Posts user info edit post |
Quote : | "You have substrings of length 6 (let's say) that is every 6-length substring in your string " |
Yes, every consecutive substring of length 6, just to be absolutely clear. If my string were made of English letters and started boobsvagina.... my first three substrings would be boobsv oobsva obsvag
Quote : | "And You want to know which 6-length substring is the most common Or You want to know how many times each 6-length substring occurs
Or neither of these?" |
Neither. I want to know how many substrings appear at least twice.
Thanks for taking an interest in this, definitely. I'm messing with multiset right now, but I suck at C++.4/20/2014 12:37:31 AM |
moron All American 34142 Posts user info edit post |
Okay, it looks to me like you're randomly generating a substring:
for (int i=0; icurval=1-floor((float)rand()/(p*RAND_MAX)); wordmask=wordmask+curval*pow(2,i); }
The looking for this random string in your set of existing substrings:
if( myset.find(wordmask) != myset.end() ){ profileset.insert(wordmask);} else{myset.insert(wordmask);}
Is this correct?
If you have a set of existing substrings (set myset) (or is this the full string? it doesn't matter either way), why not look through that set for occurrences of 2 or more of that substring, rather than generating a substring, then looking to see if it exists?
It's like if your string is boobsvagina
you generate the random substring "cat" then see if it's in boobsvagina?
basically, in the process of generating "myset" you can implicitly keep count of how many of each element has been added. then just read out the counts, discarding any that's <2.
[Edited on April 20, 2014 at 12:58 AM. Reason : ] 4/20/2014 12:44:30 AM |
moron All American 34142 Posts user info edit post |
http://en.cppreference.com/w/cpp/container/map
You can modify your existing thing to use a map instead, where the "value" you're storing is the count.
What you're doing with the 2 sets is not extraordinarily different, or more computationally intensive.
It just doesn't make sense to generate a random substring, look for it in the set, when you have the full set, and you can just look for substrings that are duplicates of each other (if that's what you're doing).
[Edited on April 20, 2014 at 12:56 AM. Reason : ] 4/20/2014 12:56:25 AM |
JeffreyBSG All American 10165 Posts user info edit post |
Quote : | "Okay, it looks to me like you're randomly generating a substring:
...
The looking for this random string in your set of existing substrings:
... " |
More or less correct, yes. I should be clear, though: I don't have my whole giant string constructed at the beginning. Instead I generate the first substring randomly, then lop of the first letter and generate a new last letter, i.e. I take "boobsv," lop off the b and randomly generates a g so the new substring is "oobsva." I'm using the substrings to generate the master string, which is fine because they're all I care about, and the behavior of the system is the same.
Quote : | " If you have a set of existing substrings (set myset), why not look through that set for occurrences of 2 or more of that substring, rather than generating a substring, then looking to see if it exists?[
It's like if your string is boobsvagina
you generate the random substring "cat" then see if it's in boobsvagina? " |
Okay, first of all, if my current substring is "cat", that means that my previous substring was "hca" or something that ended in ca.
You're right that I should just stick all my substrings in a set and then count how many appear twice or more. I guess the difficulty was that sets can only contain one of an element, so I built a set to contain the first occurrence of an element, then for every element, if it was in that set I put it another set which contained things that appeared at least twice. This "looking for every element" business was really inefficient.
Multiset can hold multiple copies of the same element; hopefully it'll be vastly more efficient.4/20/2014 12:57:46 AM |
moron All American 34142 Posts user info edit post |
so you have the super-long string stored some where, you know your substrings are length k: psuedocode:
std::map substringKCountsMap;
while(i=0,i+k<superLongString.length(),i++){
newMapKey=readSuperLongString(i,i+k);
if (substringKCounts.contains(newMapKey)) substringKCountsMap[newMapKey]++;
else substringKCountsMap.add({newMapKey,0})
}
foreach(countedSubstring in substringKCountsMap){
if (countedSubstring.value()>2) printf(countedSubstring.key);
}
Also, this problem is VERY parallelizable, if you wanted to make this threaded, it'd run a lot faster (probably).
[Edited on April 20, 2014 at 1:10 AM. Reason : ] 4/20/2014 1:07:56 AM |
JeffreyBSG All American 10165 Posts user info edit post |
the ideas of the pseudocode all make sense...I'll try to translate them into some sensible C tomorrow. thanks a lot.
also, I have no idea in the world how I would exploit parallelizability (that's way beyond the scope of my C++ knowledge), but I know what parallizing is (I think) and that's cool that it's possible here.
I'd love to bang out some results for n=10^10 or something. that would be the shit. 4/20/2014 1:17:52 AM |
moron All American 34142 Posts user info edit post |
cool
just make sure to look at this: http://en.cppreference.com/w/cpp/container/map
It's a key-value pair structure, it makes your life easier.
map is the sister structure to std::set and is nearly drop-in replacement for using the std::set that you're already using.
http://stackoverflow.com/questions/205945/why-does-the-c-stl-not-provide-any-tree-containers
Both are implemented with binary trees, so it doesn't matter either that you don't know pointers, C++ has this taken care of for you 4/20/2014 1:21:23 AM |
JeffreyBSG All American 10165 Posts user info edit post |
I checked out "map" and tried for a bit to implement it, but got kinda stuck (didn't try for all that long, admittedly...my students have an exam tomorrow and bombared me with questions.)
also, I'm not entirely clear how I can store my monstrous (ideally, > 10^7 entries) binary string, or in what format. but I'm still tinkering with it.
just an update. thanks for the suggestion that I use map, I checked out the links and I'm sure it'll all become clear soonish. 4/21/2014 12:14:47 AM |
moron All American 34142 Posts user info edit post |
Storing 10^7 bits is pretty trivial... That's like a megabyte of data.
A map is just a dictionary, or a lookup table. Or is it the implementation you're more having issue with? 4/21/2014 1:12:41 AM |
Shadowrunner All American 18332 Posts user info edit post |
This sounds tractable analytically; are you simulating it to check your work? 4/21/2014 11:56:04 AM |
JeffreyBSG All American 10165 Posts user info edit post |
^^ yeah, but the thing is , I want to generate a random string of 10^7 bits of data. In fact, I want to generate dozens or hundreds of them, all within the same program, since I'm trying to simulate the statistical properties of a random variable depending on a random enormous string of data.
I guess I'm not sure how/where to store the gigantic base-string. C++ flatly refuses to store it as an array, of course, and that's kinda the only structure I'm familiar with.
^ yeah, I think it's tractable, but nobody's done it yet. this is my thesis, actually. I'm not precisely checking my work; it's more, I want to get a feel for what's happening, how this variable behaves for certain ranges of k/log(n). The expectation is more or less understood; I'm looking at the variance. 4/21/2014 4:55:46 PM |
moron All American 34142 Posts user info edit post |
Isn't there anyone in your department that can sit down and help you? These programming problems aren't very big ones to overcome, a senior computer engineering student, or computer science student can help you figure them out easily. 4/21/2014 5:30:07 PM |
lewisje All American 9196 Posts user info edit post |
Quote : | "I don't know anything about pointers" | http://cslibrary.stanford.edu/104/4/21/2014 7:33:10 PM |
JeffreyBSG All American 10165 Posts user info edit post |
yeah, I'm sure there is. I just randomly decided to ask TWW first though, for some reason 4/21/2014 9:55:46 PM |
aaronburro Sup, B 53068 Posts user info edit post |
seriously, like moron is saying, use a map. A tree is not what you want in this scenario. You want something with a fast lookup, and a binary tree has O(log2(n)) lookup time (normal case) while a map has O(1). Even better, you have no need for pointers with a map, especially one that's already made for you like what moron suggested.
Also, I'd suggest going away from C++ for this for the time being, unless you have a background in it. You'd probably have better luck banging out the basics in java or C#, and those will keep you away from nastier bugs that are easy to write in C++ when you're just starting out. Both java and C# have a dictionary (or map) as part of the base libraries, too.
Another word of advice would be not to build your string as you go along. Generate your huge random string first, and then pass it in to the algorithm that analyzes it. This will be much more parallelizable (and trivially so, at that), and it has the added benefit of allowing you to easily pass in your own strings for any "weird cases" that you might want to explore. It's also going to run faster, even with the worst possible implementations. At the end of the day, you want to analyze a random string, so it doesn't matter if the string is generated as you go or if it's generated all at once. 4/22/2014 12:19:46 AM |
moron All American 34142 Posts user info edit post |
map is still implemented as a binary tree, to my knowledge, not a hash table.
A hash table would be a great option for this too, assuming he could find a good hashing algorithm. If he couldn't, there would be memory issues if he tried to scale it up.
This is easy in C#, but C# is so Microsofty...
[Edited on April 22, 2014 at 12:32 AM. Reason : ] 4/22/2014 12:31:14 AM |
lewisje All American 9196 Posts user info edit post |
javascript man
javascript 4/22/2014 2:20:12 AM |