clean data

Clean Data, Animal Shelter Management, Structured Data, R, Çankaya Belediyesi Animal Shelter Donations

Türkçe için

A few months ago, I started to attend R lectures at Coursera R is a programming language that can help me analyzing data; although my interest is in unstructured text data, since lectures are perfect, I used R to convert unstructered text to tabular data in this case. May be Python would be a better choice.

The raw data: http://kuyruksuzbipolarpisi.blogspot.com.tr/p/bagisraw.html

What I converted it to: (first 10 rows) http://kuyruksuzbipolarpisi.blogspot.com.tr/p/namez-tarih-bagis-1-nalcabesmez-11.html

namez tarih bagis
>Murat Nalçabesmez 11.05.2014  2.5 kg Adolt goody Mama
MAT-TAV 10.05.2014  250 kg Tavuk Eti
Gökser Yasar 10.05.2014 21adet410gr Konserve Köpek Mamasi
MAT-TAV 09.05.2014  300 kg Tavuk Eti
Rabia Sen,Ekin Çaliskan,Nida Kuttas,Merve Demir 07.05.2014  8 adet 1lt Süt,5kg Köpek Mamasi Açik
Deniz Ünsal- Kargo 06.05.2014 Smart Dog 15 kg Kuru Köpek Mamasi
Çagla Jansel 04.05.2014  10 adet Konserve Mama
Pinar Karabudak 30.04.2014  30 Konserve Mama,4 adet10 kg Kutu Mama
Canan Sayin. Doga Ipek Sayin 29.04.2014  Goody 2.5 Kg Kuru Mama
Göksu Bilgiç 29.04.2014  Goody 2.5 kg Kuru Mama
Mat_Tav 28.04.2014  310 kg Tavuk Eti

I am not a programmer, however I don’t like to see data in an unstructured form. Since unstructured data can not be analyzed, reports about it can not be formulated easily. So a person in fact don’t know what s/he has ig the data is kept in text format.
If you don’t want to search or have responsibility for searching information, it is the easiest way to keep data as it happened in https://cankayabldbarinagi.wordpress.com
( As I see they are deleting the records :))
The people who donate to Mühye shelter were kept in an html file, not tabular.

  • The people who donates most
  • The categories of donations: food, vaccines, infrastructre materials
  • The people who donates regularly can not be identified.

Format 1 – 2014 Name (Date) Donation

Murat Nalçabesmez(11.05.2014) 2.5 kg Adolt goody Mama

MAT-TAV(10.05.2014) 250 kg Tavuk Eti

Gökser Yaşar(10.05.2014)21adet410gr Konserve Köpek Maması

MAT-TAV(09.05.2014) 300 kg Tavuk Eti

Rabia Şen,Ekin Çalışkan,Nida Kuttaş,Merve Demir(07.05.2014) 8 adet 1lt Süt,5kg Köpek Maması Açık

Deniz Ünsal- Kargo(06.05.2014)Smart Dog 15 kg Kuru Köpek Maması

Format 2- Name-Substrings

Talatpaşa İÖO Hayvanları koruma kulübü (01.06.2011)

2 x 20 kg. köpek kuru maması

2 x 13,5 kg. köpek kuru maması

Fulya Aydın.Fatma Şahin (31.05.2011)

7,5 numara cerrahi eldiven

kutu non steril eldiven

kutu cerrahi maske

6  different formats to record the donations…. Congrats 🙂

The code is messy, but I got bored and don’t want to refactor it. This format has names of people who donate, dates and donations. Donations should be classified,too.

bagistodf<-function(){
  setwd(“D:/Belgeler/Coursera”)
  bagisfile<-readLines("bagis.txt",encoding="UTF-8")
  bagis201<-bagisfile[1:201]
  g<-function(x){x[2]}
  j<-function(x){x[3]}
  f<-function(x){x[1]}
  bagis201s<-strsplit(bagis201,"\\(|\\)")
  namez<-sapply(bagis201s,f)
  tarih<-sapply(bagis201s,g)
  bagis<-sapply(bagis201s,j)
  df<-data.frame(namez,tarih,bagis)
  write.table(df,”bagis333.csv”,sep=”,”)
  head(df)

}
bagis202304<-function(){
  setwd(“D:/Belgeler/Coursera”)
  bagisfile<-readLines("bagis.txt",encoding="UTF-8")
  bagis304<-bagisfile[202:304]
  split<-strsplit(bagis304,"\\(|\\)")
  pattern<-"[0-9][0-9]\\.[0-9][0-9]\\.[0-9][0-9][0-9][0-9]"
  for (i in length(split):2){
 
    if (grepl(pattern,split[i])==FALSE && grepl(pattern,split[i-1])==FALSE) {
      split[[i-1]][[1]]<-paste(split[[i]][[1]],split[[i-1]][[1]],sep=";")}
    else if (grepl(pattern,split[i])==FALSE && grepl(pattern,split[i-1])==TRUE){
      split[[i-1]][[3]]<-split[[i]][[1]]
   
    }

 
  }
newsplit<-list()
  for(i in 1:length(split)) {
  if (length(split[[i]])>=3)
    newsplit[i]<-split[i]
}
class(newsplit)
newsplit<-newsplit[lapply(newsplit,is.null)==FALSE]
g<-function(x){x[2]}
j<-function(x){x[3]}
f<-function(x){x[1]}
namez<-sapply(newsplit,f)
tarih<-sapply(newsplit,g)
bagis<-sapply(newsplit,j)
df304<-data.frame(namez,tarih,bagis)
nrow(df304)
write.table(df304,”bagis333.csv”,sep=”,”,append=TRUE)

}
bagis560685<-function(){

  setwd(“D:/Belgeler/Coursera”)
  bagisfile<-readLines("bagis2.txt",encoding="UTF-8")
  bagis685<-bagisfile[560:685]
  b6<-sub("\\?","\\%",bagis685)
  b6<-strsplit(b6,"\\(|\\)|\\%")
  # 83 ve 84 hatalı bölünmüş
  g<-function(x){x[2]}
  j<-function(x){x[3]}
  f<-function(x){x[1]}

  bagis<-sapply(b6,f)
  namez<-sapply(b6,g)
  tarih<-sapply(b6,j)
  df685<-data.frame(namez,tarih,bagis)
  df685
  write.table(df685,”bagis333.csv”,sep=”,”,append=TRUE)
  }
bagis391539<-function(){

  setwd(“D:/Belgeler/Coursera”)
  bagisfile<-readLines("bagis2.txt",encoding="UTF-8")
  bagis539<-bagisfile[391:539]
  b539<-sub("\\?","\\%",bagis539)
  b539<-strsplit(b539,"\\%")
  g<-function(x){x[2]}

  f<-function(x){x[1]}
  bagis<-sapply(b539,f)
  namez<-sapply(b539,g)
  tarih<-rep("17.03.2013",length(b539))
  df539<-data.frame(namez,tarih,bagis)
  df539
  write.table(df539,”bagis333.csv”,sep=”,”,append=TRUE)
}

bagis544559<-function(){
  setwd(“D:/Belgeler/Coursera”)
  bagisfile<-readLines("bagis2.txt",encoding="UTF-8")
  bagis559<-bagisfile[544:559]
  b559<-sub("\\?","\\%",bagis559)
  b559<-strsplit(b559,"\\%")
  g<-function(x){x[2]}

  f<-function(x){x[1]}
  bagis<-sapply(b559,f)
  namez<-sapply(b559,g)
  tarih<-rep("17.03.2013",length(b559))
  df559<-data.frame(namez,tarih,bagis)
  df559
  write.table(df559,”bagis333.csv”,sep=”,”,append=TRUE)

}
bagis334340<-function(){

  setwd(“D:/Belgeler/Coursera”)
  bagisfile<-readLines("bagis2.txt",encoding="UTF-8")
  bagis340<-bagisfile[334:340]
  b340<-sub("\\?","\\%",bagis340)
  b340<-strsplit(b340,"\\(|\\)")

  g<-function(x){x[2]}
  j<-function(x){x[3]}
  f<-function(x){x[1]}

  bagis<-sapply(b340,f)
  tarih<-sapply(b340,g)
  namez<-sapply(b340,j)
  df340<-data.frame(namez,tarih,bagis)
  df340
  write.table(df340,”bagis333.csv”,sep=”,”,append=TRUE)

}
bagis377388<-function(){

  setwd(“D:/Belgeler/Coursera”)
  bagisfile<-readLines("bagis2.txt",encoding="UTF-8")
  bagis388<-bagisfile[377:388]
  b388<-sub("\\?","\\%",bagis388)
  b388<-strsplit(b388,"\\(|\\)")

  g<-function(x){x[2]}
  j<-function(x){x[3]}
  f<-function(x){x[1]}

  bagis<-sapply(b388,f)
  tarih<-sapply(b388,g)
  namez<-sapply(b388,j)
  df388<-data.frame(namez,tarih,bagis)
  df388
  write.table(df388,”bagis333.csv”,sep=”,”,append=TRUE)
}

Kategoriler:clean data, Patilerle İlgili

Tagged as: , ,

Bir Cevap Yazın

Aşağıya bilgilerinizi girin veya oturum açmak için bir simgeye tıklayın:

WordPress.com Logosu

WordPress.com hesabınızı kullanarak yorum yapıyorsunuz. Çıkış  Yap /  Değiştir )

Google+ fotoğrafı

Google+ hesabınızı kullanarak yorum yapıyorsunuz. Çıkış  Yap /  Değiştir )

Twitter resmi

Twitter hesabınızı kullanarak yorum yapıyorsunuz. Çıkış  Yap /  Değiştir )

Facebook fotoğrafı

Facebook hesabınızı kullanarak yorum yapıyorsunuz. Çıkış  Yap /  Değiştir )

w

Connecting to %s