quagmire02 All American 44225 Posts user info edit post |
so i have a database populated with partner organizations, groups that help fund our research...i'm trying to automate a registration process for our intranet, but my regex skills are lacking...what i'd like to do is pull the organization's website out of the database, strip out the http:// and www. if they exist, and compare the remaining string against the registrant's email address to see if they're one of our supporters...pulling the information isn't really hard, nor is the comparison, but i'm running into problems based on the websites
for example, MIT is a partner and the website we have on file is http://www.mit.edu/ (this is exactly as it appears in the database)...i'd like to strip all except for the mit.edu (exactly) and compare that against their email address (username@mit.edu)
that's great, but we have funding partners that also might have international addresses like .co.uk or .com.au or us.company.com/ca.company.com so i can't just write a generic script that strips out everything except the domain and the block of text preceding it (i mean, that would work great for the us.company.com/ca.company.com example, but not for the others)
suggestions? any of you PHP regex gurus want to give me some pointers? 6/8/2009 12:08:19 PM |
Stein All American 19842 Posts user info edit post |
$string = preg_replace('#^(https?://)?([w]{3}\.)?([^/]+)(.*)#', "$3", $orgString);
[Edited on June 8, 2009 at 12:16 PM. Reason : works and now tested!] 6/8/2009 12:12:28 PM |
Fail Boat Suspended 3567 Posts user info edit post |
remind me what this
[^/]+
does? 6/8/2009 12:18:09 PM |
Stein All American 19842 Posts user info edit post |
Selects all non forward slashes as long as there is at least 1. 6/8/2009 12:22:05 PM |
quagmire02 All American 44225 Posts user info edit post |
that's actually not what i was thinking of, but then had a duh moment and realized it was easier to do that, split their email address at the "@", and then use in_array to compare their email domain against an array of the authorized domains
while i'm in this thread, might as well ask my next question...i've never had to do this before, but how does one pass session variables out of a function that redirects to another page? i've got a redirect function that takes in a generated error message and a redirect URL, assigns the message to a session variable, and then redirects via the provided URL...the function is what requires the special case, since if i use the code outside of a function in each instance it works swimmingly
page1.php <?php function redirect($message,$link) { $_SESSION['error'] = $message; session_write_close(); header("location:$link"); exit(); } $error = "Oh, bollocks."; $gohere = "page2.php"; redirect($error,$gohere); ?> page2.php<?php echo $_SESSION['error']; ?> 6/10/2009 10:36:45 AM |
evan All American 27701 Posts user info edit post |
global $_SESSION['error'] = $message;
good job on not using session_register(), but if you set a session var from within a function you need to make sure it goes to the global namespace.
also, make sure you're calling session_start() before anything else on those pages if you're using cookies to handle the SIDs] 6/10/2009 10:41:22 AM |
quagmire02 All American 44225 Posts user info edit post |
^ i tried using global before, but i got this, so i assumed that wasn't it:
Quote : | "Parse error: syntax error, unexpected '[', expecting ',' or ';'" |
also, i'm calling session_start() on both pages6/10/2009 10:54:03 AM |
evan All American 27701 Posts user info edit post |
hm, i stand corrected
Quote : | "Note: As of PHP 4.1.0, $_SESSION is available as a global variable just like $_POST, $_GET, $_REQUEST and so on. Unlike $HTTP_SESSION_VARS, $_SESSION is always global. Therefore, you do not need to use the global keyword for $_SESSION." |
are you sure it's actually registering a session?
try using the URL method instead:
header("location:" . $link . "?" . SID);
see if it sticks a long md5 hash in the URL.]6/10/2009 11:11:56 AM |
Stein All American 19842 Posts user info edit post |
Ignore evan. Sessions are superglobal.
I've never used session_write_close() but the manual page does show some people having similar issues as you're having. http://us2.php.net/manual/en/function.session-write-close.php#86791 seems to be a solution to your problem. 6/10/2009 11:14:12 AM |
quagmire02 All American 44225 Posts user info edit post |
ah, session_regenerate_id(true) is what was needed to get it to pass those session variables
thanks, y'all 6/10/2009 11:16:01 AM |
evan All American 27701 Posts user info edit post |
^^it wasn't in previous versions. i rarely use sessions, so i wasn't aware of this. sorry.
also, sounds like this is your solution:
Quote : | "I was having the same problem as many here regarding setting session data just before a header location redirect and having the session data just not be there. I tried everything people here said, and none of their combinations worked. What did finally work for me was to fire off a session_regenerate_id(true) call just prior to the header() and die() calls.
session_regenerate_id(true); header('location: blah blah'); die();
Without the regenerate id call, the write close did not seem to do anything. session_write_close() doesn't seem to matter at all. It certainly didn't fix anything on its own for me.
This is a rather annoying issue with php sessions that I've never run into before. I store my sessions to /dev/shm (which is RAM) so file IO blocking can't be the problem. Now I'm nervous that some other session data might not be getting updated prior to a header() location change, which is extremely important and common in any web app." |
[Edited on June 10, 2009 at 11:17 AM. Reason : heh, never mind, you already saw it]6/10/2009 11:17:36 AM |
Stein All American 19842 Posts user info edit post |
Quote : | "^^it wasn't in previous versions. i rarely use sessions, so i wasn't aware of this. sorry. " |
Not to be a dick, but it's been like this for almost 7.5 years.6/10/2009 11:33:17 AM |
evan All American 27701 Posts user info edit post |
which is right around the time i started learning php. i learned on 3, 4 wasn't out then. i haven't had a need to register session variables as i do it a different way, so, yeah.
btw, you actually were being a dick. 6/10/2009 11:36:01 AM |
Ernie All American 45943 Posts user info edit post |
Previous versions being PHP 3.0.
[Edited on June 10, 2009 at 11:37 AM. Reason : ha I was just joshin']
[Edited on June 10, 2009 at 11:40 AM. Reason : and ha super globals were introduced in 4.1, late 2001] 6/10/2009 11:37:00 AM |
quagmire02 All American 44225 Posts user info edit post |
is there a better way (more concise or best practice) to clean up file names than this (strip everything but alphanumeric, including periods, but keep the file extension and last period)?
function cleanFilename($str) { $str = strtolower(trim(basename($str))); // gets file extension $i = strrpos($str,"."); if(!$i){return "";} $l = strlen($str)-$i; $ext = substr($str,$i+1,$l); // replaces characters $pos = strrpos($str,"."); // position of last . in string (strpos does the first) $str = preg_replace("/[^a-zA-Z0-9\s]/","",substr($str,0,$pos)); // remove all non-alphanumeric characters before last . in string $str = preg_replace("/\s+/","_",$str); // compress internal whitespace and replace with _ $str = preg_replace("/\W-/","",$str); // remove all non-alphanumeric characters except _ and - return $str.".".$ext; }
[Edited on June 19, 2009 at 10:53 AM. Reason : .]6/19/2009 10:51:49 AM |
Noen All American 31346 Posts user info edit post |
I would go with below:
function cleanFilename ($str) { $str = basename($str); $fileExtensionPosition = strrpos($str, ".");
if($fileExtensionPosition) { $patterns[0] = '/[^a-zA-Z0-9\s]/'; $replacements[0] = ''; $patterns[1] = '/\s\s+/'; $replacements[1] = '_'; $fileName = preg_replace($patterns, $replacements, substr($str,0,$fileExtensionPosition); $fileExtension = substr($str,$fileExtensionPosition); return $fileName.$fileExtension; } return false; }
Differences:
-extension check is the primary pass/fail logic of the function, it should be blocked to prevent boundary return conditions. -offload as much computation as possible until after you do the extension check. -string length is an optional parameter of substr, you can kill all that -why chop out the "." of the ext, and then manually reinsert it, when you aren't doing any transforms on the extension? -grouped the reg expressions into arrays (best practice) -slight tweak to your whitespace replace, so also remove redundant whitespace ( _____ goes to _) -the last replace should be redundant, as you already removed *all* non alphanumeric characters in the first replacement, then inserted _'s. -variable naming, return value for failure should never be "", as it doesn't tell you if that's a false return, or an empty filename (ie: --- is a valid filename, but would return as "" in your function.)
[Edited on June 19, 2009 at 4:33 PM. Reason : .]6/19/2009 4:30:52 PM |
quagmire02 All American 44225 Posts user info edit post |
okay, so now my question is related to my first post in this thread...similar situation, but again my regex skills are lacking
we might have on record http://www.sponsor.com/ and i can get just the sponsor.com (which is what i want), but i've just come across a case where the email address of the user is something like username@us.sponsor.com so that when i do the compare, it tries to compare us.sponsor.com to sponsor.com and it fails (obviously)
i could do a reverse compare (where i check for sponsor.com inside us.sponsor.com), but i'm trying to avoid that...what i want to do is take the user's domain from their email address (us.sponsor.com) and strip out everything EXCEPT sponsor.com...so if their email was username@we.are.a.sponsor.com or username@us.sponsor.com or regardless of the number of subdomains, it will always return JUST sponsor.com
suggestions? 9/8/2009 3:32:22 PM |
Stein All American 19842 Posts user info edit post |
$whatever = preg_replace('#(.*(\.|@))?([^\.]+\.[^\.]+)$#', "$3", $whatever);
[Edited on September 8, 2009 at 3:40 PM. Reason : there we go] 9/8/2009 3:36:20 PM |
quagmire02 All American 44225 Posts user info edit post |
^ thanks!
i really need to brush up on my regex 9/8/2009 3:41:49 PM |
qntmfred retired 40726 Posts user info edit post |
http://www.sellsbrothers.com/tools/#regexd is a great little tool for testing out regex btw
[Edited on September 8, 2009 at 3:46 PM. Reason : it's built using .net regex, which is mostly the same as php. but still helpful] 9/8/2009 3:44:30 PM |
quagmire02 All American 44225 Posts user info edit post |
^ that's actually pretty cool...thanks for the heads up 9/9/2009 7:59:21 AM |
qntmfred retired 40726 Posts user info edit post |
Bump 4/27/2011 9:35:25 AM |
quagmire02 All American 44225 Posts user info edit post |
i suck at regex...i have this function to automatically parse text for email addresses:
function emailit($str) { $regex = '/(\S+@\S+\.\S+)/i'; $replace = "<a href='mailto:$1'>$1</a>"; $str = preg_replace($regex, $replace, $str); return $str; $str = preg_match($regex, $str); return $str; } and it works great as long as the string only has the email address and not the href tag...so it works well for:
blah blah blah myemailgoeshere@fakemail.com blah blah blah but not:
blah blah blah <a href="mailto:myemailgoeshere@fakemail.com">myemailgoeshere</a> blah blah blah suggestions?4/27/2011 9:40:50 AM |
FroshKiller All American 51911 Posts user info edit post |
Why are you parsing the presentation layer? 4/27/2011 10:37:42 AM |
BigMan157 no u 103354 Posts user info edit post |
$regex = '/([^"\'\s]+@\S+\.\[^"\'\s]+)/i';
maybe4/27/2011 10:46:17 AM |
Stein All American 19842 Posts user info edit post |
preg_match_all('#[a-zA-Z0-9\-_\.]+@[a-zA-Z\-_\.]+#', $testString, $matches);
You can make it more specific if you really want by adding something that ensure the backhalf actually has a valid domain, but I mean, this will basically work.
[Edited on April 27, 2011 at 11:06 AM. Reason : .] 4/27/2011 11:05:22 AM |
quagmire02 All American 44225 Posts user info edit post |
^^^ i'm not...not exactly, anyway
^^ that did it...thxu
[Edited on April 27, 2011 at 11:07 AM. Reason : carats] 4/27/2011 11:07:17 AM |
FroshKiller All American 51911 Posts user info edit post |
So you're screen-scraping. Just because it's not your presentation layer doesn't mean it's not the presentation layer. 4/27/2011 11:14:09 AM |
quagmire02 All American 44225 Posts user info edit post |
i'm working with pre-existing data and i'm trying to clean it up to serve my purposes
once again, contributing to a thread by not contributing to it...thanks for your input 4/27/2011 12:24:52 PM |
quagmire02 All American 44225 Posts user info edit post |
actually, BigMan157, that didn't do it...at least, it takes care of the condition i mentioned, but now the other condition is ignored 4/27/2011 1:02:45 PM |
FroshKiller All American 51911 Posts user info edit post |
The first question you should always ask yourself is whether there's a better approach than the one that has led you to the problem you're currently dealing with. I don't know why that is so hard for you to appreciate.
If you're wanting to scrape e-mail addresses out of HTML and you're using PHP, why don't you just strip out the HREF attributes of any A elements in the document prior to parsing for e-mail addresses? Jesus.
[Edited on April 27, 2011 at 1:10 PM. Reason : The idea being that PHP has easy-to-use DOM traversal and manipulation.] 4/27/2011 1:07:50 PM |
quagmire02 All American 44225 Posts user info edit post |
okay, i'll bite
database entry is exactly this (minus any changes tww's crazy code makes):
My name is Bob. My email address is bob@email.com. i have no control over the content of the database, just the display...what is your suggestion as to the best way, using PHP, to make that email address into a mailto link?4/27/2011 1:19:32 PM |
FroshKiller All American 51911 Posts user info edit post |
You don't have control over what's in the database record, but what do you expect to find in a record? Is it reasonably reliable that if the record contains a mailto link, the closing tag will be included? If so, you could just run strip_tags() on it prior to regexing for an e-mail address. That would obviate the need to test for an e-mail address in an anchor's HREF entirely.
[Edited on April 27, 2011 at 1:26 PM. Reason : ...] 4/27/2011 1:26:07 PM |
quagmire02 All American 44225 Posts user info edit post |
reasonably, yes...but the content isn't always that (it was just an example, though a realistic one)
sometimes there will be HTML character entities and sometimes the tags will be encoded as their entity name/number
my thought is to create a function to convert all entity names/numbers to their character and then search for URLs and email addresses to convert to their appropriate links 4/27/2011 1:31:29 PM |
FroshKiller All American 51911 Posts user info edit post |
You could run html_entity_decode() then strip_tags() then run your regex, then. The first function shouldn't hurt anything if there aren't any HTML character references in the input string. 4/27/2011 1:35:17 PM |
quagmire02 All American 44225 Posts user info edit post |
i suppose i'm not sure what that will accomplish
html_entity_decode() is obvious, and i'm doing that already...but why would i WANT to strip tags? i want to keep them there (and yes, i realize i could except certain tags, but i want to keep them all) 4/27/2011 1:39:36 PM |
FroshKiller All American 51911 Posts user info edit post |
Maybe I'm not fully understanding the issue. I thought you were saying your regex wasn't working as expected when the input string contained an anchor with a mailto HREF. Are you actually trying to replace instances of e-mail addresses in your input with a different e-mail address? 4/27/2011 1:43:13 PM |
quagmire02 All American 44225 Posts user info edit post |
okay, i'll try to do a better job of explaining...below are possible entries:
My name is Bob. My email address is bob@email.com My name is John. You can email me <a href="mailto:john@email.com">here</a>. <p>My name is Fred. My email address is <a href="mailto:fred@email.com">fred@email.com</a>.</p> <p>My name is Mary.<br /><br />You should email me at mary@email.com!</p> My name is Anna. My website is http://www.mywebsite.com/. or any variation
if HTML characters are there, i want them displayed...if not, i want to convert email address and URLs into the appropriate link
[Edited on April 27, 2011 at 1:53 PM. Reason : imagine that one of those examples has & l t ; and & g t ; since TWW converted them]4/27/2011 1:50:40 PM |
AstralEngine All American 3864 Posts user info edit post |
^ So those are the possible entries, but I'm not sure what you want:
1. to strip out everything except the email address and return it, or
2. replace the email address with a mailto tag and return that?
[Edited on April 27, 2011 at 2:03 PM. Reason : ] 4/27/2011 1:56:57 PM |
FroshKiller All American 51911 Posts user info edit post |
Okay, so for e-mail addresses, couldn't you just include the colon as a potential starting character? 4/27/2011 2:04:02 PM |
quagmire02 All American 44225 Posts user info edit post |
nevermind, i think i've got it all in one function now:
function linkylinky($str) { $str = ' '.$str; $str = preg_replace("#(^|[\n ])([\w]+?://[\w]+[^ \"\n\r\t<]*)#ise", "'\\1<a href=\"\\2\" >\\2</a>'", $str); $str = preg_replace("#(^|[\n ])((www|ftp)\.[^ \"\t\n\r<]*)#ise", "'\\1<a href=\"http://\\2\" >\\2</a>'", $str); $str = preg_replace("#(^|[\n ])([a-z0-9&\-_\.]+?)@([\w\-]+\.([\w\-\.]+\.)*[\w]+)#i", "\\1<a href=\"mailto:\\2@\\3\">\\2@\\3</a>", $str); $str = substr($str, 1); return $str; }
[Edited on April 27, 2011 at 2:20 PM. Reason : code...i had the first two lines working fine, but couldn't get the third...now it works, i think]4/27/2011 2:09:37 PM |
moron All American 34142 Posts user info edit post |
This question relates to the second post, but just out of curiosity, why would you use the session header to pass a message between pages instead of a form? 4/27/2011 2:26:14 PM |
quagmire02 All American 44225 Posts user info edit post |
oh, that was a long time ago
page1 (front-end): form, submit to page2 page2 (back-end): process form variables, generate message (success or fail) page3 (front-end): display message
[Edited on April 27, 2011 at 2:49 PM. Reason : is there something wrong with that process?] 4/27/2011 2:48:58 PM |
Stein All American 19842 Posts user info edit post |
It's just sort of a silly way to do it if you're not passing anything you're planning on displaying. 4/27/2011 2:51:25 PM |
quagmire02 All American 44225 Posts user info edit post |
what do you mean? the message is displayed 4/27/2011 2:57:00 PM |
Stein All American 19842 Posts user info edit post |
You made it sound like you're just sending "Success" or "Failed", which is something you could just pass in the URL and then use an if statement to actually display whatever message you wanted to show.
If that's the case, you're creating additional server overhead using a session for no real reason.
Now if you're transmitting a whole error message, like "The operation failed for X, Y, Z" reason, that's a different story. 4/27/2011 3:03:12 PM |
quagmire02 All American 44225 Posts user info edit post |
Quote : | "Now if you're transmitting a whole error message, like "The operation failed for X, Y, Z" reason, that's a different story." |
exactly...it's not common, but when it happens, it's usually a paragraph or two 4/27/2011 3:08:37 PM |